org.sonar.l10n.py.rules.python.S6969.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of python-checks Show documentation

There is a newer version: 4.26.0.19456

This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the memory argument.
Why is this an issue?
When the memory argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted. This can be
time-consuming if the transformers are expensive to compute or if the dataset is large.
However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to None. This way the
intention is clear.
How to fix it
Specify the memory argument when creating a Scikit-Learn Pipeline.
Code examples
Noncompliant code example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LinearDiscriminantAnalysis())
]) # Noncompliant: the memory parameter is not provided

Compliant solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LinearDiscriminantAnalysis())
], memory="cache_folder") # Compliant

Pitfalls
If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space. This is true when using
sklearn.model_selection.HalvingGridSearchCV or sklearn.model_selection.HalvingRandomSearchCV because the size of the dataset
changes every iteration when using the default configuration.
Resources
Documentation

   Scikit-Learn documentation - Pipeline