Principal Component Analysis with Python (Intro)

Principal Component Analysis (PCA) is a statistical remedy that allows data science practitioners to pare down numerous variables in a dataset to a predefined number of ‘principal components.’ Essentially, this method allows statisticians to visualize and manipulate unwieldy data.

For a moment, take a look at the graph below, which comes from Jose Portilla’s Udemy course on machine learning.  On the upper left graph, you see what would be considered a normal data with two features, or ‘components.’  This graph is the eventual output of a PCA transformation.  Looking at the bottom left graph, you see all of the data points graphed on a single axis, with the y value (‘Feature 2’) dropped so as to only display the values on a single x axis.  The bottom right graph functions in a similar way, but using the other ‘principal component’ (‘Feature 2’) as the axis.

PCA.png

At its root, PCA requires understanding the theory behind the X and Y axes that normally goes unnoticed when looking at plotted data.  Below, you see a traditional number line like the one that gets presented to primary school students across the country.  Normally, any value is plotted as a one-to-one relationship to a point on the graph.  For instance, take stock returns.  If in year one returns for the S&P 500 were 6%, that number would be dotted at the six.  If in the following year returns for the S&P 500 were -2%, that number would be represented with a dot at the -2.  Simply, all graphs that we see in everyday life are representations in 2-dimensional space as the intersection between two variables.

number line
Number line as shown to students learning mathematics fundamentals

In finance, statistics, epidemiology, and elsewhere, we typically see this referred to as an ‘axis.’  So, plotting any two of these at a 90 degree angle typically yields the scatterplots, bar charts, or trendlines typically seen in the professional and academic worlds.  PCA is principally no different from this, except that we take dimensions to the nth number and pare them down until we can visualize them on a two-dimensional graph.

When datasets get complex and more than two variables are used to capture the essence of the data, PCA can be used as a tool to visualize and capture information about the data structure.  In the following example, using Python, we will move through Principal Component Analysis on a built-in Python dataset: Continue reading Principal Component Analysis with Python (Intro)