Excel Applications: Naive Bayes AI (Text Classification) – part 1 of 2
Artificial intelligence is pervasive in our world today. From the cars we drive to the food we eat, our world is increasingly defined by the algorithms and the formulas of which they are composed. Because of the diversity of possible applications of artificial intelligence, cottage industries have cropped up attempting to apply AI models to virtually every facet of modern life. Although the applications of these techniques are quite modern, the underlying techniques themselves predate modernity by generations.
The workhorse of artificial intelligence algorithms is derived, interestingly enough, from the work of an ecclesiastical 18th century academic and statistician, Thomas Bayes. After his death, his notes would be published, laying out the foundations of Bayes’ Theorem, which would find new purpose in the world of machine learning. Concurrently, Bayes’ Theorem experienced a renaissance in the aftermath of the post-2008 financial crisis world, where “fat tails” and “black swans” gained notoriety as common parlance in mainstream academia. The simple premise underlying the statistical jargon is that previous observation (or “Bayesian priors”) should be accounted for when making predictions about the future.
Think about recent flooding. Climatologists often proclaim to much fanfare that once-in-a-hundred year hurricanes, earthquakes, and other natural phenomena are occurring at rates that far exceed their likelihood, and thus their policy recommendations should be considered above their more muted peers. Bayesian thinking, including the incorporation of “priors”, might indicate that the likelihood of the inaccuracy of the model is greater than the likelihood of two massive wildfires occurring in subsequent years being a virtual impossibility. Of course, this thinking pervades many spheres. The financial meltdown of 2008 was caused in part (how many parts there actually were is the subject of a whole subset of literature) on the erroneous notion that risk in the sub-prime housing market was not systemic. Of course, observation of prior banking crises indicates that in fact systemic risk is an endemic part of the banking sector. Nonetheless, we will see that Bayesian thinking has become centrally featured in the world of statistics.
Probability theory undergirds Bayes’ theorem and the naive Bayes AI applications used in Excel. John Foreman’s Data Smart gives an excellent primer on probability theory, and is worth a look for anyone wanting a quick refresher in reference to AI models. For our sake, it’s worth remembering that p(A|B) is the probability of A given B. This is known as a conditional probability, because it is the probability that A occurs on condition of B occurring. In the case of independent events, we can apply conditional probability. If p(A)=.5 and p(C)=.5, the chance that both A and B occurs is .5 x .5 = .25. Understanding probability theory will help in working through Naive Bayes models, particularly with regard to sentiment analysis application in Excel.
The model used for our purposes will allow us to perform text classification. Text classification in the modern sense allows us to rapidly and effectively make judgments about the class of text that we are attempting to understand. Sentiment analysis, a subset of text classification, is often used by advertisers to gauge twitter reactions and customer reviews. Text classification spans a broad range of applications, however, from military intelligence to targeted advertising to data mining.
For our purposes, we will be working with the Mandrill data set from John Foreman’s Data Smart.
Our data set focuses on classifying tweets about either a (fictional) person named “Mandrill” or a (fictional) app named “Mandrill.” Of course, as we go along you will see how this technique can be applied to a limitless set of possible naive Bayes problems. In order to create our maximum a posteriori rule (MAP rule) we have to first determine the probability of a given tweet (from the Mandrill excel sheet) being more or less likely to come from a tweet about the app than not coming from the app. If the probability that tweet in question is about the app greater than the probability that it is not, we will classify it as a tweet about Mandrill the app. If it is not, it will be classified as a tweet about Mandrill the person. The beauty of ‘idiot Bayes’ is twofold: the formula is naive insofar as it only takes into account its inputs, and even an idiot can do that right.
For our assumption, we will assume (wrongly) that the prevalence of tweets is 1:1. Although this isn’t the case, adding in Bayesian priors of the actual prevalence (4:1) messes with the formula in ways that complicate the analysis. For simplicities’ sake, a 50-50 split assumption works well. This is why in real application practitioners will often begin their analysis under the assumption of an equal split, knowing this to be incorrect. To read up on the statistics, buy the book – it also has great in-depth analysis.
Two other assumptions must be dealt with before we can begin the actual analysis in excel: rare words and floating-point underflow. Rare words, quite simply, are words that occur infrequently enough in our training set that they would potentially throw off our classification. Floating-point underflow occurs with words that have minuscule probabilities. When these probabilities are small enough, their calculation can exponentially increase the computational power required to work through the calculations. As such, we will take the ln() of our outputs, which will give us negative numbers rather than increasingly small probabilities. With those two points in mind, we can begin manipulating the excel sheet.
In part 2 of our look at modeling Naive Bayes AI algorithms in Excel, we will return to the Mandrill dataset and begin working through text classification.