1. 程式人生 > >Understanding Data and Machine Learning Models with Visualizations (Part 1)

Understanding Data and Machine Learning Models with Visualizations (Part 1)

Examining Feature-PC Correlation

On the non-interactive side, the tool also generates heatmaps with additional information about the principal components. The figure below displays a correlation matrix, between each principal component and our original set of features:

iris: Z-normalized (μ=0, σ=1) correlation matrix, examining correlation of each feature (y-axis) and each principal component (x-axis)

At first glance, it’s hard find meaning in just the correlation of the original features and our PCs. Often, when I show just this heatmap to others, they’d wonder what to look for.

After some discussion with others, I thought it would be more helpful to further normalize the matrix, to more immediately reveal meaningful information.

Take 2: below, we see the dot product of the explained variance per principal component with the previous correlation matrix. Essentially, we normalize by the explained variance in order to better highlight features that contribute to variance in the data and PCs.

iris: The normalized correlation matrix (|Z-normalized(features*PCs)|*explained_variance) gives a much clearer view of the features that contribute to variance in the dataset.

Now, we can find interesting aspects of the data more easily. As we saw in the earlier interactive plot, PC1 explains a majority of the variance in the dataset (~72.8%). Now we easily see that its contributors are primarily petal length (cm) and sepal width (cm) — features that correlate with PCs that explain the most variance are stacked-ranked on the y-axis now.

Let’s say we determine that we want to engineer features in our original dataset, to better classify each iris category. Following from the PCA analysis, we may conclude that petal length (cm) and sepal width (cm) are primary features of interest. We can iterate and engineer additional features with them (e.g. petal length/sepal width ratio, or petal length*sepal width) that may improve any classifier we train.