Entropy; A method for Data Science & Machine Learning
Entropy, I first got introduced to this term while studying the laws of Thermodynamics during Chemistry & Physics classes in high school and I’ve always been confused by it, at least not until recently. This is a good time for a disclaimer: You do not need any background in chemistry, physics, or any other science-related field to understand entropy.
To get started on Entropy and its application in Data Science & Machine Learning, we must explain what Entropy is first.
What is Entropy?
Entropy, according to textbook definitions is a measure of disorder in a system. It is a measure of our ignorance about the physical state of a system. Now, this is tough to explain but I would try to do this with an everyday scenario. Imagine you are trying to make pancake; you have an egg, a cup of flour, a cup of sugar, yeast, salt, and whatever else. You dump them in a mixing bowl.
If you close your eyes and reach your hand into the bowl ‘x’ times, you will end up touching different ingredients, depending on the different distribution of the ingredients in the bowl. Now imagine you stir all of these ingredients together for a very long time. You close your eyes and reach your hand into the bowl ‘x’ times, just like before. This time, however, you end up touching the same thing (well-mixed batter) no matter where you put your hand. This is because the ingredients now have the same distribution in space.
From this pancake analogy, you say the Entropy is low when the ingredients are not mixed, i.e the measure of how well they are mixed is low. When the ingredients are very well mixed, entropy is high.
So what does this have to do with Data Science or Datasets?
Entropy is used for a lot of things in data science. For example, entropy can be used to build classification trees which are used to classify things or datasets. Entropy is also the basis of something called `mutual information` which quantifies the relationship between two things. Entropy is also the basis of Relative Entropy aka ‘The Kullback-Liebler distance’ and Cross-Entropy which show up all over the place including fancy dimension reduction algorithms like t-SNE and UMAP. What these three things have in common is that they all use entropy or something derived from it to quantify similarities and differences so let’s learn how entropy quantifies similarities and differences.
However, to talk about entropy, we need to understand ‘Surprise’
Surprise
Suppose a fisherman has three baskets; one basket contains 9 pieces of Tuna and 1 Salmon, the other contains 9 pieces of Salmons & 1 Tuna while the last basket contains 5 pieces of Salmons & 5 Tunas.
Now, if the fisherman just randomly placed his hands in the first basket to pick a fish, knowing that there are 9 Tunas and only one Salmon means there is a higher probability that he will pick a Tuna and since there is a higher probability of picking up a Tuna it would not be surprising if he does. In contrast, if the fisherman picked up the Salmon from the basket we would be relatively surprised.
The second basket has a lot more Salmon than Tunas and because there is now a higher probability of picking up a Salmon we would not be very surprised if it happened and because there is a relatively low probability of picking the Tuna, it would be relatively surprising if he does.
The third basket has an equal number of Tunas and Salmons thus regardless of what fish he picks up we would be equally surprised. Combined, these baskets tell us that Surprise is in some way inversely related to probability. In other words, when the probability of picking up a Salmon is low as it was in the first basket, the surprise is high and when the probability of picking up a Tuna is high, the surprise is low.
How Is Surprise Related To Entropy?
Now that we have a general intuition of how probability is related to surprise. Let’s relate surprise to Entropy. In fancy statistics notation, we say that Entropy is the expected value of Surprise. As explained with the previous examples earlier, Entropy is the measure of uncertainty of a variable. The more uncertain it is, the higher the entropy is. So based on this intuition, what do you think will be the entropy of the decision to pick a Salmon in the first basket?
We understand that there is a type of inverse relationship between probability and surprise it’s tempting to just use the inverse of probability to calculate the value of surprise which is Entropy but that would be wrong. Surprise instead is the log of the inverse of the probability. To get a detailed explanation please check out Claude Shannon’s work here.
Now, let’s can try to calculate the value of surprise for picking Tuna or Salmon from the example above. Using our knowledge of Expected Values in statistics, we already know that there is a 0.1 chance that the farmer would pick a Salmon from the first basket, and there’s a 0.9 chance that he would pick a Tuna. So, from this example, we have two variables.
- X (Chance that the farmer picks a Salmon)
- Y (Chance that the farmer picks a Tuna)
It’s widely understood that the total Entropy of a variable is equal to
Where p(x) represents the probability of the event.
As expected, the Entropy of X is greater than the Entropy of Y (If we ignore the negative sign for a moment). So, we can see from this example, that this confirms the earlier assertion that the less probable an event is, the higher the Entropy of that event.
We also can think about it in these terms if we wish — The less information we have about an event, i.e., the more surprising it is, the higher the Entropy will be for that event.
Now, we are going to ask another question. What is the total entropy of the system? It is simple. We measure it this way —
We do this by summing up both the values of X and Y and negating it.
As Entropy is a unit of Information, it is represented in ‘bits’. So we say the entropy of the system of the decision of the Farmer to pick a fish from the first basket is 0.467 bits of Entropy.
All this theory is good but how is it helpful for us? How do we apply this in our day-to-day machine learning models? To do this practically, we need to have a holistic overview of what a Decision Tree is and how it works.
Walkthrough of a Decision Tree
The decision tree algorithm is a type of supervised learning for classification problems.· Classification is the process of dividing data sets into different categories or groups by adding labels (For example a model that decides if a mail is spam/no spam).
It is a hierarchical if-else statement that is nothing but a collection of rules known as the splitting criteria based on comparison operators on the features. The algorithm finds the relationship between the response variable and the predictors and expresses this relation in the form of a family tree structure.
This flow-chart consists of the Root node, the Branch nodes, and the Leaf nodes. The root node is the original data, branch nodes are the decision rules whereas the leaf nodes are the output of the decisions and these nodes cannot be further divided into branches.
Hence, it is a graphical depiction of all the possible outcomes to a problem based on certain conditions or as said rules. The model is trained by creating a top-down tree and then this trained decision tree is used to test the new or the unseen data to classify these cases into a category.
Why entropy in decision trees?
Decision trees compute what is called an “information gain,” which is the pattern observed in the data and is the reduction in entropy. It can also be seen as the entropy of the parent node minus the entropy of the child node.
In decision trees, the goal is to clean and arrange the data. You split the data set into two parts, you want that these two parts are as tidy as possible. Here tidy means that each of the two sets contains only the elements of one category (associated with a single label). It is then easier to make a decision. However, in practice, splitting a dataset into 2 pure sets rarely happens. The two sets always contain elements with 2 or more different labels. Entropy is used to quantify the purity of the sets in terms of the element labels.
So Entropy becomes an indicator to know how messy your data is
Imagine you have some dataset of emails labeled either Spam or No Spam. If the training dataset is pure and contains only elements with Spam or with No Spam, the entropy is minimum (equals zero). If the dataset contains a mix of elements with Spam and No Spam the entropy rises. The entropy reaches its maximum (equals 1) when there are as many elements with both labels in the set.
Endnotes
Once you understand entropy, the decision tree explanation is pretty straightforward. The idea of entropy is to quantify the uncertainty of the probability distribution concerning the possible classification classes. Entropy is not just a mathematical formula or theoretical. It has a simple meaning and practical application that everyone can understand.
Entropy in ML models quantifies the amount of uncertainty (or surprise) involved in the value of a random variable or the outcome of a random process. Its significance in the decision tree is that it allows us to estimate the impurity or heterogeneity of the target variable. To achieve the maximum level of correctness in the response variable, the child nodes must be arranged in such a way that the total entropy of these child nodes must be less than the entropy of the parent node.
voilà, and that's Entropy in Data Science.