All Downloads are FREE. Search and download functionalities are using the official Maven repository.

org.sonar.l10n.py.rules.python.S6969.html Maven / Gradle / Ivy

There is a newer version: 4.23.0.17664
Show newest version

This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the memory argument.

Why is this an issue?

When the memory argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted. This can be time-consuming if the transformers are expensive to compute or if the dataset is large.

However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to None. This way the intention is clear.

How to fix it

Specify the memory argument when creating a Scikit-Learn Pipeline.

Code examples

Noncompliant code example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LinearDiscriminantAnalysis())
]) # Noncompliant: the memory parameter is not provided

Compliant solution

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LinearDiscriminantAnalysis())
], memory="cache_folder") # Compliant

Pitfalls

If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space. This is true when using sklearn.model_selection.HalvingGridSearchCV or sklearn.model_selection.HalvingRandomSearchCV because the size of the dataset changes every iteration when using the default configuration.

Resources

Documentation





© 2015 - 2024 Weber Informatics LLC | Privacy Policy