Training a Model with PCA-Extracted Features

chum
3 min readNov 1, 2020

--

Principal component analysis, or PCA, is a dimensionality reduction technique for Unsupervised Learning. It compresses a dataset into a lower dimensional space with fewer features while maintaining as much of the original information as possible. As the number of dimensions increases, things like distance calculations become less and less effective. As such, we want to
implement techniques that let us achieve around the same level of results
as including the full feature set but with a reduced number of features. This process is a tool for handling the curse of dimensionality.

When adding features or variables to modeling you are adding dimensions to the space that encompass your data. Two points that were initially close to each other now grows by a factor of the square root of the number of dimensions. So if you’re trying to classify a prediction, the more features you add the harder it is to classify because of the increase in variance.

Another reason to use PCA is that it makes all the features independent — Untangling multicollinearity. We might want to ensure that none of our features are correlated in order to ease the interpretation of our model’s coefficients. We would thus want to create a set of features containing as much information as possible but linearly independent.

In this example, I’ll apply the unsupervised learning technique of Principal Components Analysis to a wine dataset from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html

I’ll use the principal components of the dataset as features in a machine learning model. Then I’ll use the extracted features to train a vanilla Random Forest Classifier, and compare model performance to a model trained without PCA-extracted features.

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
X, y = load_wine(return_X_y=True)

wine = load_wine()
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'class'

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1) Fit PCA to the training data

Call the PCA instance you’ll create wine_pca. Set n_components=0.9 and make sure to use random_state = 42.

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)


from sklearn.decomposition import PCA
wine_pca = PCA(n_components=0.9, random_state = 42)
wine_pca.fit(X_train_sc)

2) Find how many principal components there are in the fitted PCA object. (you should end up with 8 components)

len(wine_pca.components_)

Next, I’ll reduce the dimensionality of the training data to the number of components that explain at least 90% of the variance in the data, and then I’ll use this transformed data to fit a Random Forest classification model.

I’ll also compare the performance of the model trained on the PCA-extracted features to the performance of a model trained using all features without feature extraction.

3) Transform the training features into an array of reduced dimensionality using the wine_pca PCA object fit earlier.

I’ll call this array X_train_pca.

X_train_pca = wine_pca.transform(X_train_sc)

Next, create a dataframe from this array of transformed features and inspect the first five rows of the dataframe.

X_train_pca = pd.DataFrame(X_train_pca)

# Inspect the first five rows of the transformed features dataset
X_train_pca.head()

4) Instantiate a Random Forest Classifier (call it rfc) and fit it to the transformed training data.

Set n_estimators=10, random_state=42, and make sure you include the relevant import(s).

rfc = RandomForestClassifier(n_estimators=10,random_state=42)
rfc.fit(X_train_pca, y_train)

5) Evaluate model performance on the test data and place model predictions in a variable called y_pca_pred.

Make sure to transform the test data the same way as you transformed the training data.

X_test_sc = ss.transform(X_test)
X_test_pca = wine_pca.transform(X_test_sc)

y_pca_pred = rfc.predict(X_test_pca)

Print the classification report for the model performance on the test data.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pca_pred))

Run the cell below to fit a vanilla Random Forest Classifier to the untransformed training data, evaluate its performance on the untransformed test data, and print the classification report for the model.

vanilla_rfc = RandomForestClassifier(n_estimators=10, random_state=42)
vanilla_rfc.fit(X_train, y_train)

y_pred = vanilla_rfc.predict(X_test)

print(classification_report(y_test, y_pred))

6) Compare model performance.

Did the overall accuracy of the model improve when using the transformed features? Yes:

The overall accuracy of the model improved from 93% to 98% with the transformed features.

--

--

chum
chum

No responses yet