data-pre-processing

Step 1: [[Import Required Libraries ]]
  • [[NumPy]]: Numerical Computing
  • [[Pandas]]: Data Manipulation
Step 2: [[Importing Data set]]
Step 3: [[Handling the missing values]]
  • Detecting Missing Values:

    • isnull(): Detects missing values and returns True where they exist.
    • notnull(): Opposite of isnull(), returns True where values are not missing.
  • Dropping Missing Values:

    • If you want to remove rows or columns with missing data, you can use dropna().
  • Filling Missing Values:

    • Instead of dropping rows/columns, you can fill in missing values using fillna()
  • Replacing Missing Values with Interpolation:

    • Means, Median,Mode
    • sklearn.preprocessing
Step 4: Encoding Categorical Data
  • Label Encoder
Step 5: [[Splitting the dataset into test set & training set]]

- train_test_split()

- from sklearn.model_selection import train_test_split 
Step 6: [[Feature Scaling]]
  • Min-Max Scaling (Normalization): This technique scales the data between a fixed range, usually 0 and 1.
  • Z-score Normalization (Standardization): This method scales the data such that the mean is 0 and the standard deviation is 1.
  • MaxAbs Scaling: This scales the data by its maximum absolute value, preserving the sign of the data (useful for data that is already centered around 0).
  • Robust Scaling: This technique is less sensitive to outliers as it uses the median and interquartile range (IQR) instead of the mean and standard deviation for scaling.

When to Use Feature Scaling:

  • Algorithms based on distance: Algorithms like k-NN, K-Means, SVM, and PCA rely on the distance between points and can be influenced by the scale of features.

  • Gradient-based algorithms: Algorithms like gradient descent (used in linear regression, logistic regression, and neural networks) converge faster with feature scaling.

  • [[Neural Networks]]: Feature scaling improves performance because it ensures faster convergence and avoids issues with large weight updates.

When Not to Use Feature Scaling:

  • Tree-based algorithms: Algorithms like Decision Trees, Random Forests, and XGBoost do not require feature scaling because they are not distance-based and can handle features with varying magnitudes.

  • Feature scaling ensures that your model performs optimally, treating each feature equally and helping distance- or gradient-based algorithms perform well.