org.sonar.l10n.py.rules.python.S6969.html Maven / Gradle / Ivy
This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the memory
argument.
Why is this an issue?
When the memory
argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted. This can be
time-consuming if the transformers are expensive to compute or if the dataset is large.
However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to None
. This way the
intention is clear.
How to fix it
Specify the memory
argument when creating a Scikit-Learn Pipeline.
Code examples
Noncompliant code example
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
]) # Noncompliant: the memory parameter is not provided
Compliant solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
], memory="cache_folder") # Compliant
Pitfalls
If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space. This is true when using
sklearn.model_selection.HalvingGridSearchCV
or sklearn.model_selection.HalvingRandomSearchCV
because the size of the dataset
changes every iteration when using the default configuration.
Resources
Documentation
- Scikit-Learn documentation - Pipeline