quality and reliability of data

The following are common causes of unreliable data in datasets:

Note: Any sufficiently large or diverse dataset almost certainly contains outliers that fall outside your data schema or unit test bands. Determining how to handle outliers is an important part of machine learning. The [[Numerical data]] unit details how to handle numeric outliers.

Don’t train a model on incomplete examples. Instead, fix or eliminate incomplete examples by doing one of the following:

Imputation is the process of generating well-reasoned data, not random or deceptive data. Be careful: good imputation can improve your model; bad imputation can hurt your model.

One common algorithm is to use the [[Mean]] or [[Median]] as the imputed value. Consequently, when you represent a numerical feature with [[Z-score|Z-Scores]], then the imputed value is typically 0 (because 0 is generally the mean Z-score).