data-pre-processing | TNPSC Fuhrer Notes

Detecting Missing Values:
- isnull(): Detects missing values and returns True where they exist.
- notnull(): Opposite of isnull(), returns True where values are not missing.
Dropping Missing Values:
- If you want to remove rows or columns with missing data, you can use dropna().
Filling Missing Values:
- Instead of dropping rows/columns, you can fill in missing values using fillna()
Replacing Missing Values with Interpolation:
- Means, Median,Mode
- sklearn.preprocessing


- train_test_split()

- from sklearn.model_selection import train_test_split

Min-Max Scaling (Normalization): This technique scales the data between a fixed range, usually 0 and 1.
Z-score Normalization (Standardization): This method scales the data such that the mean is 0 and the standard deviation is 1.
MaxAbs Scaling: This scales the data by its maximum absolute value, preserving the sign of the data (useful for data that is already centered around 0).
Robust Scaling: This technique is less sensitive to outliers as it uses the median and interquartile range (IQR) instead of the mean and standard deviation for scaling.

When to Use Feature Scaling:

Algorithms based on distance: Algorithms like k-NN, K-Means, SVM, and PCA rely on the distance between points and can be influenced by the scale of features.
Gradient-based algorithms: Algorithms like gradient descent (used in linear regression, logistic regression, and neural networks) converge faster with feature scaling.
[[Neural Networks]]: Feature scaling improves performance because it ensures faster convergence and avoids issues with large weight updates.

When Not to Use Feature Scaling:

Tree-based algorithms: Algorithms like Decision Trees, Random Forests, and XGBoost do not require feature scaling because they are not distance-based and can handle features with varying magnitudes.
Feature scaling ensures that your model performs optimally, treating each feature equally and helping distance- or gradient-based algorithms perform well.