Handling Mixed Variable Entropy With Gaussian Binning

How do you handle the notion of entropy for a system of both continuous and discrete variables? One idea comes from Generalized Additive Models (GAM), which use dummy code to represent categorical variables. Essentially, you would handle continuous variables using the formula for continuous entropy, then dummy code your categorical variables, and finally use the formula for discrete entropy.

One problem with this approach is that you may end up with an extreme example of overfitting–generally, a tree node for each discrete variable. To aleviate this, binning is used to lump discrete (and sometimes continuous) variables together, reducing the maximum number of possible nodes to the size of the discrete set.

Binning can be improved with a concept esoterically referred to as “Gaussian Binning,” in which the goal is to smooth out a histogram of bin membership by adding/subtracting members to each bin until the result is a gaussian-esque distribution.

Essentially, you wouldn’t split your induction steps on the discrete variables themselves, but rather their “bin” class membership. This not only gives us the ability to control for potential overfitting, but also gives us a parameter to tune based on the type of information we are working with.

Code examples coming soon!