An unsupervised technique, target feature is not needed here. It is used to reduce the dimensionality of the datasets that possess multicollinearity. Hence, making the variables orthogonal.
Now, what is dimensionality reduction?
Reducing the dimensions up to an extent doesn't mean that data has lost its important features. Let's look at it with the help of an example:
Here, the data is arranged in an ellipse that runs at 45 degrees to the axes.
Here, the axes have been moved so that the data now runs along the x−axis and is centered on the origin.
The potential for dimensionality reduction is in the fact that the y dimension does not now demonstrate much variability, therefore it might be possible to ignore it and use the x-axis values only. This may be the reason to decrease the noise of the data as well.
NOTE: PC technique must be used on the scaled data.
How can we find the principal components?
So, principal components(PC's) are the Eigenvectors of a Covariance Matrix, we will be looking at these one by one
Covariance Matrix Formulation:
Cov(x,y) = submssion((x-xbar)(y-ybar))/(n-1)
Cov Matrix x Vector = Eigen Value x Vector
Computing the principal components of the 2D dataset on the left and using only the first one to reconstruct it produces the line of data shown on the right, which is along the principal axis of the ellipse
that the data was sampled from.
def pca(data,nRedDim=0,normalise=1): # Centre data m = np.mean(data,axis=0) data -= m # Covariance matrix C = np.cov(np.transpose(data)) # Compute eigenvalues and sort into descending order evals,evecs = np.linalg.eig(C) indices = np.argsort(evals) indices = indices[::-1] evecs = evecs[:,indices] evals = evals[indices] if nRedDim>0: evecs = evecs[:,:nRedDim] if normalise: for i in range(np.shape(evecs)): evecs[:,i] / np.linalg.norm(evecs[:,i]) * np.sqrt(evals[i]) # Produce the new data matrix x = np.dot(np.transpose(evecs),np.transpose(data)) # Compute the original data again y=np.transpose(np.dot(evecs,x))+m return x,y,evals,evecs
Eigenvectors & Eigenvalues are used to understand the variance (maximum information) of the data.