Principal Component Analysis (PCA) is a statistical remedy that allows data science practitioners to pare down numerous variables in a dataset to a predefined number of ‘principal components.’ Essentially, this method allows statisticians to visualize and manipulate unwieldy data.
For a moment, take a look at the graph below, which comes from Jose Portilla’s Udemy course on machine learning. On the upper left graph, you see what would be considered a normal data with two features, or ‘components.’ This graph is the eventual output of a PCA transformation. Looking at the bottom left graph, you see all of the data points graphed on a single axis, with the y value (‘Feature 2’) dropped so as to only display the values on a single x axis. The bottom right graph functions in a similar way, but using the other ‘principal component’ (‘Feature 2’) as the axis.
At its root, PCA requires understanding the theory behind the X and Y axes that normally goes unnoticed when looking at plotted data. Below, you see a traditional number line like the one that gets presented to primary school students across the country. Normally, any value is plotted as a one-to-one relationship to a point on the graph. For instance, take stock returns. If in year one returns for the S&P 500 were 6%, that number would be dotted at the six. If in the following year returns for the S&P 500 were -2%, that number would be represented with a dot at the -2. Simply, all graphs that we see in everyday life are representations in 2-dimensional space as the intersection between two variables.
In finance, statistics, epidemiology, and elsewhere, we typically see this referred to as an ‘axis.’ So, plotting any two of these at a 90 degree angle typically yields the scatterplots, bar charts, or trendlines typically seen in the professional and academic worlds. PCA is principally no different from this, except that we take dimensions to the nth number and pare them down until we can visualize them on a two-dimensional graph.
When datasets get complex and more than two variables are used to capture the essence of the data, PCA can be used as a tool to visualize and capture information about the data structure. In the following example, using Python, we will move through Principal Component Analysis on a built-in Python dataset:
# begin by importing the necessary libraries, which in this example being ran in Jupyter
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
# then import the dataset from Python’s sklearn library titled, ‘load_breast_cancer’
from sklearn.datasets import load_breast_cancer
# get a feel for the dataset and look for the headings
# create the variable ‘df’ to manipulate the cancer dataset
df = pd.DataFrame(cancer[‘data’], columns=cancer[‘feature_names’])
# look at the head of the data frame to get a sense of variables present
# doing this yields 30 columns of data, most of which have little meaning on their own in determining what is cancerous and what isn’t, and as such, we will transform the dataset in to 2 components
# to do this we will use sklearn’s built in command ‘StandardScaler’ and fit the dataset to the command
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.transform(df)
# now is where we will use sklearn’s built in PCA commands
from sklearn.decomposition import PCA
# then we will create the variable ‘pca’ using a PCA transformation of 2 components
pca = PCA(n_components=2)
x_pca = pca.transform(scaled_data)
# looking at the scaled data shape versus the ‘x_pca’ data shape, we can see that we have reduced the number of variables from 30 to 2
# output (569, 30)
# output (569, 2)
#finally, to visualize the data, we use the commands shown below
# color coding for binary variable cancer
# plot labels
plt.xlabel(‘First Principal Component’)
plt.ylabel(‘Second Principal Component’)
# the commands above yield this output:
It is worth taking a moment to discuss what just happened. Each of the 30 initial values (variables given from the cancer dataset) were given a number from -1 to 1, which were then added together to find the mathematically optimal method (minimized square error) from the data.
The second component (‘Second Principal Component’) is the one orthogonal to the original component (‘First Principal Component’). While the statistics behind orthogonal transformation are complex, it can be assumed for the simplicity’s sake that the second component is uncorrelated to the first. This is partly why the graph (and findings) presented by PCA is optimal, because it both finds the minimized square error of the components and chooses components that are uncorrelated from one another. How PCA does this is through an unsupervised algorithm, whereby the outcome lowest squared error is selected. This finding can be presented in the heatmap below:
# heatmap visual for principal component characteristics
The heatmap above display the two principal components, 0 and 1, in relation to the 30 variables presented in the cancer dataset. While this Python introduction focused on this single data set, this same method could be applied to an infinite number of possible problems.
Take, for example, wealth. What factors might be associated with a high net worth? Education level? Years of experience? Age? Income? Yearly expenditure? Rather than looking at a laundry list of variables and trying to determine the most valuable predictive variables (imagine a 30-dimension scatterplot) we can simply pare down the data set from x-number of variables to 2, allowing us to visualize the data and make predictions about the net worth of the persons that are in our dataset. Although the statistics are considerably more complex, and there are nuances related to picking n-variables to use as components, the above guide should (hopefully) give you a quick and dirty introduction to the concept of Principal Component Analysis.