numerical-data

A machine learning (ML) model’s health is determined by its data. Feed your model healthy data and it will thrive; feed your model junk and its predictions will be worthless.

Best practices for working with numerical data:

  • Remember that your ML model interacts with the data in the [[feature vector]], not the data in the dataset.
  • Normalize most numerical features.
  • If your first normalization strategy doesn’t succeed, consider a different way to normalize your data.
  • Binning, also referred to as bucketing, is sometimes better than normalizing.

Considering what your data should look like, write verification tests to validate those expectations. For example:

  • The absolute value of latitude should never exceed 90. You can write a test to check if a latitude value greater than 90 appears in your data.
  • If your data is restricted to the state of Florida, you can write tests to check that the latitudes fall between 24 through 31, inclusive.
  • Visualize your data with scatter plots and histograms. Look for anomalies.
  • Gather statistics not only on the entire dataset but also on smaller subsets of the dataset. That’s because aggregate statistics sometimes obscure problems in smaller sections of a dataset.
  • Document all your data transformations.

Data is your most valuable resource, so treat it with care.