binning | TNPSC Fuhrer Notes

Binning (also called bucketing) is a [[Feature Engineering]] technique that groups different numerical subranges into bins or buckets. I
n many cases, binning turns numerical data into categorical data.

For example, consider a feature named X whose lowest value is 15 and highest value is 425. Using binning, you could represent X with the following five bins:

Bin 1: 15 to 34
Bin 2: 35 to 117
Bin 3: 118 to 279
Bin 4: 280 to 392
Bin 5: 393 to 425

Binning is a good alternative to [[Normalization|scaling]] or [[Clipping]] when either of the following conditions is met:

The overall linear relationship between the feature and the label is weak or nonexistent.
When the feature values are clustered.

Binning can feel counterintuitive, given that the model in the previous example treats the values 37 and 115 identically. But when a feature appears more clumpy than linear, binning is a much better way to represent the data.

Quantile bucketing creates bucketing boundaries such that the number of examples in each bucket is exactly or nearly equal.

Quantile bucketing mostly hides the outliers.
Bucketing with equal intervals works for many data distributions.
For skewed data, however, try quantile bucketing.
Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket.
Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket.