org.sonar.l10n.py.rules.python.S6709.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of python-checks Show documentation
There is a newer version: 4.26.0.19456
This rule raises an issue when random number generators do not specify a seed parameter.
Why is this an issue?
Data science and machine learning tasks make extensive use of random number generation. It may, for example, be used for:

   Model initialization
    
       Randomness is used to initialize the parameters of machine learning models. Initializing parameters with random values helps to break
      symmetry and prevents models from getting stuck in local optima during training. By providing a random starting point, the model can explore
      different regions of the parameter space and potentially find better solutions. 
    
  
   Regularization techniques
    
       Randomness is used to introduce noise into the learning process. Techniques like dropout and data augmentation use random numbers to
      randomly drop or modify features or samples during training. This helps to regularize the model, reduce overfitting, and improve generalization
      performance. 
    
  
   Cross-validation and bootstrapping
    
       Randomness is often used in techniques like cross-validation, where data is split into multiple subsets. By using a predictable seed, the
      same data splits can be generated, allowing for fair and consistent model evaluation. 
    
  
   Hyperparameter tuning
    
       Many machine learning algorithms have hyperparameters that need to be tuned for optimal performance. Randomness is often used in techniques
      like random search or Bayesian optimization to explore the hyperparameter space. By using a fixed seed, the same set of hyperparameters can be
      explored, making the tuning process more controlled and reproducible. 
    
  
   Simulation and synthetic data generation
    
       Randomness is often used in techniques such as data augmentation and synthetic data generation to generate diverse and realistic datasets.
      
    
  

To ensure that results are reproducible, it is important to use a predictable seed in this context.
The preferred way to do this in numpy is by instantiating a Generator object, typically through
numpy.random.default_rng, which should be provided with a seed parameter.
Note that a global seed for RandomState can be set using numpy.random.seed or numpy.seed, this will set the
seed for RandomState methods such as numpy.random.randn. This approach is, however, deprecated and Generator
should be used instead. This is reported by rule {rule:python:S6711}.
Exception
In contexts that are not related to data science and machine learning, having a predictable seed may not be the desired behavior. Therefore, this
rule will only raise issues if machine learning and data science libraries are being used.
How to fix it in Numpy
To fix this issue, provide a predictable seed to the random number generator.
Code examples
Noncompliant code example
import numpy as np

def foo():
    generator = np.random.default_rng()  # Noncompliant: no seed parameter is provided
    x = generator.uniform()

Compliant solution
import numpy as np

def foo():
    generator = np.random.default_rng(42)  # Compliant
    x = generator.uniform()

How to fix it in Scikit-Learn
To fix this issue, provide a predictable seed to the estimator or the utility function.
Code examples
Noncompliant code example
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X_train, _, y_train, _ = train_test_split(X, y) # Noncompliant: no seed parameter is provided

Compliant solution
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

rng = np.random.default_rng(42)
X, y = load_iris(return_X_y=True)
X_train, _, y_train, _ = train_test_split(X, y, random_state=rng.integers(1)) # Compliant

Resources
Documentation

   NumPy documentation - NEP 19 RNG Policy 
   Scikit-learn documentation - Glossary random_state 
   Scikit-learn documentation - Controlling randomness
  

Standards

   STIG Viewer - Application Security and
  Development: V-222642 - The application must not contain embedded authentication data. 

Related rules

   {rule:python:S6711} - numpy.random.Generator should be preferred to numpy.random.RandomState