Machine_Learning/Python_sklearn_PCA.py at master · nnewman1/Machine_Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Python tutorial using scikit-learn for Principal Component Analysis (PCA) on the digits dataset.
# Principal component analysis (PCA) is used to explain the variance-covariance structure of a set of variables through linear combinations.
# It is often used as a dimensionality-reduction technique.
# Python is an interpreted, high-level, general-purpose programming language.
# sci-kit learn or sklearn is an high-level machine learning library for python.
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
# Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.

# Import python libraries
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# load the digits dataset
theData = load_digits()

# Analyze the dataset's shape
print("The Data's shape: ", theData.data.shape, "\n")

# Create the PCA model and specify the number of components
theModel = PCA(n_components=2)

# Train the PCA model using the X data
Transformed = theModel.fit_transform(theData.data)

# Analyze the dataset before the PCA transformation
print("Before PCA transformation: ", theData.data.shape, "\n")

# Analyze the dataset after the PCA transformation
print("After PCA transformtion: ", Transformed.shape, "\n")

# Analyze the PCA's component values
print("The PCA's components: ", theModel.components_, "\n")

# Analyze the PCA's explained variance
print("The PCA's Explained Variance: ", theModel.explained_variance_, "\n")

# Analyze the PCA's explained variance ratio
print("the PCA's Explained Variance Ratio: ", theModel.explained_variance_ratio_, "\n")

# Create a scatter plot showing the data after the PCA transformation
plt.scatter(Transformed[:, 0], Transformed[:, 1], c=theData.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
#plt.show()

####################  Number of Components Selection ####################

# Create and train a new PCA model with the full dataset to determine the optimal number of components to correctly express the data and plot the results
pca = PCA().fit(theData.data)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
#plt.show()