使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入(t-SNE)进行降维
使用mnist数据集
It is easy for us to visualize two or three dimensional data, but once it goes beyond three dimensions, it becomes much harder to see what high dimensional data looks like.
对我们来说,可视化二维或三维数据很容易,但是一旦它超出了三维,就很难看到高维数据的外观。
Today we are often in a situation that we need to analyze and find patterns on datasets with thousands or even millions of dimensions, which makes visualization a bit of a challenge. However, a tool that can definitely help us better understand the data is dimensionality reduction.
如今,我们经常处于需要分析和查找具有数千甚至上百万个维度的数据集的模式的情况,这使可视化成为一个挑战。 但是,绝对可以帮助我们更好地理解数据的工具是降维 。
In this post, I will discuss t-SNE, a popular non-linear dimensionality reduction technique and how to implement it in Python using sklearn. The dataset I have chosen here is the popular MNIST dataset.
在本文中,我将讨论t-SNE(一种流行的非线性降维技术)以及如何使用sklearn在Python中实现该技术。 我在这里选择的数据集是流行的MNIST数据集。
好奇心表 (Table of Curiosities)
What is t-SNE and how does it work?
什么是t-SNE,它如何工作?
How is t-SNE different with PCA?
t-SNE与PCA有何不同?
How can we improve upon t-SNE?
我们如何改善t-SNE?
What are the limitations?
有什么限制?
What can we do next?
接下来我们该怎么办?
总览 (Overview)
T-Distributed Stochastic Neighbor Embedding, or t-SNE, is a machine learning algorithm and it is often used to embedding high dimensional data in a low dimensional space [1].
T分布随机邻居嵌入(t-SNE)是一种机器学习算法,通常用于在低维空间中嵌入高维数据[1]。
In simple terms, the approach of t-SNE can be broken down into two steps. The first step is to represent the high dimensional data by constructing a probability distribution P, where the probability of similar points being picked is high, whereas the probability of dissimilar points being picked is low. The second step is to create a low dimensional space with another probability distribution Q that preserves the property of P as close as possible.
简单来说,t-SNE的方法可以分为两个步骤。 第一步是通过构造概率分布P来表示高维数据,其中相似点被拾取的概率较高,而相异点被拾取的概率较低。 第二步是创建具有另一个概率分布Q的低维空间,该概率分布Q保持P的属性尽可能接近。
In step 1, we compute the similarity between two data points using a conditional probability p. For example, the conditional probability of j given i represents that x_j would be picked by x_i as its neighbor assuming neighbors are picked in proportion to their probability density under a Gaussian distribution centered at x_i [1]. In step 2, we let y_i and y_j to be the low dimensional counterparts of x_i and x_j, respectively. Then we consider q to be a similar conditional probability for y_j being picked by y_i and we employ a student t-distribution in the low dimension map. The locations of the low dimensional data points are determined by minimizing the Kullback–Leibler divergence of probability distribution P from Q.
在步骤1中,我们使用条件概率p计算两个数据点之间的相似度。 例如,给定i的条件概率j表示x_j将被x_i作为其邻居,并假设在以x_i [1]为中心的高斯分布下,按其概率密度成比例地选择了邻居。 在步骤2中,我们让y_i和y_j分别为x_i和x_j的低维对应物。 然后我们认为q是y_i选择y_j的相似条件概率,并且在低维图中使用学生t分布 。 通过最小化概率分布P与Q的Kullback-Leibler散度来确定低维数据点的位置。
For more technical details of t-SNE, check out this paper.
有关t-SNE的更多技术细节,请查阅本文 。
I have chosen the MNIST dataset from Kaggle (link) as the example here because it is a simple computer vision dataset, with 28x28 pixel images of handwritten digits (0–9). We can think of each instance as a data point embedded in a 784-dimensional space.
我选择了Kaggle( link )中的MNIST数据集作为示例,因为它是一个简单的计算机视觉数据集,具有28x28像素数字(0–9)的像素图像。 我们可以将每个实例视为嵌入784维空间的数据点。
To see the full Python code, check out my Kaggle kernel.
要查看完整的Python代码,请查看我的Kaggle内核 。
Without further ado, let’s get to the details!
事不宜迟,让我们来谈谈细节!
勘探 (Exploration)
Note that in the original Kaggle competition, the goal is to build a ML model using the training images with true labels that can accurately predict the labels on the test set. For our purposes here we will only use the training set.
请注意,在原始的Kaggle竞赛中,目标是使用带有真实标签的训练图像构建ML模型,该标签可以准确预测测试集上的标签。 为了我们的目的,我们将仅使用训练集。
As usual, we check its shape first:
与往常一样,我们首先检查其形状:
train.shape--------------------------------------------------------------------
(42000, 785)
There are 42K training instances. The 785 columns are the 784 pixel values, as well as the ‘label’ column.
有42K个训练实例。 785列是784像素值,以及“标签”列。
We can check the label distribution as well:
我们也可以检查标签分布:
label = train["label"]label.value_counts()
--------------------------------------------------------------------
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64
Principal Component Analysis (PCA)
主成分分析(PCA)
Before we implement t-SNE, let’s try PCA, a popular linear method for dimensionality reduction.
在实施t-SNE之前,让我们尝试PCA,一种流行的线性降维方法。
After we standardize the data, we can transform our data using PCA (specify ‘n_components’ to be 2):
在对数据进行标准化之后,我们可以使用PCA转换数据(将'n_components'指定为2):
from sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAtrain = StandardScaler().fit_transform(train)
pca = PCA(n_components=2)
pca_res = pca.fit_transform(train)
Let’s make a scatter plot to visualize the result:
让我们绘制一个散点图以可视化结果:
sns.scatterplot(x = pca_res[:,0], y = pca_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying PCA应用PCA后MNIST数据的2D散点图As shown in the scatter plot, PCA with two components does not sufficiently provide meaningful insights and patterns about the different labels. We know one drawback of PCA is that the linear projection can’t capture non-linear dependencies. Let’s try t-SNE now.
如散点图所示,具有两个组件的PCA不足以提供有关不同标签的有意义的见解和模式。 我们知道PCA的一个缺点是线性投影无法捕获非线性依赖性。 让我们现在尝试t-SNE。
T-SNE with sklearn
带Sklearn的T-SNE
We will implement t-SNE using sklearn.manifold (documentation):
我们将使用sklearn.manifold ( 文档 )实现t-SNE:
from sklearn.manifold import TSNEtsne = TSNE(n_components = 2, random_state=0)tsne_res = tsne.fit_transform(train)
sns.scatterplot(x = tsne_res[:,0], y = tsne_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying t-SNE应用t-SNE后MNIST数据的二维散点图
Now we can see that the different clusters are more separable compared with the result from PCA. Here are a few observations on this plot:
现在我们可以看到,与PCA的结果相比,不同的聚类更可分离。 以下是该图的一些观察结果:
An Approach that Combines Both
结合两者的方法
It is generally recommended to use PCA or TruncatedSVD to reduce the number of dimension to a reasonable amount (e.g. 50) before applying t-SNE [2].
通常建议在应用t-SNE之前使用PCA或TruncatedSVD将尺寸数量减少到合理的数量(例如50)[2]。
Doing so can reduce the level of noise as well as speed up the computations.
这样做可以降低噪声水平并加快计算速度。
Let’s try PCA (50 components) first and then apply t-SNE. Here is the scatter plot:
让我们先尝试PCA(50个组件),然后再应用t-SNE。 这是散点图:
2D Scatter plot of MNIST data after applying PCA(50 components) and then t-SNE先应用PCA(50个分量)再进行t-SNE后的MNIST数据的二维散点图Compared with the previous scatter plot, wecan now separate out the 10 clusters better. here are a few observations:
与以前的散点图相比,我们现在可以更好地分离出10个群集。 以下是一些观察结果:
Besides, the runtime in this approach decreased by over 60%.
此外,这种方法的运行时间减少了60%以上。
For more interactive 3D scatter plots, check out this post.
有关更多交互式3D散点图,请查看此文章 。
局限性 (Limitations)
Here are a few limitations of t-SNE:
这是t-SNE的一些限制:
下一步 (Next Steps)
Here are a few things that we can try as next steps:
以下是一些我们可以尝试做的下一步操作:
Try some of the other non-linear techniques such as Uniform Manifold Approximation and Projection (UMAP), which is the generalization of t-SNE and it is based on Riemannian geometry.
尝试其他一些非线性技术,例如统一流形逼近和投影 (UMAP),它是t-SNE的推广,它基于黎曼几何。
摘要 (Summary)
Let’s quickly recap.
让我们快速回顾一下。
We implemented t-SNE using sklearn on the MNIST dataset. We compared the visualized output with that from using PCA, and lastly, we tried a mixed approach which applies PCA first and then t-SNE.
我们在MNIST数据集上使用sklearn实现了t-SNE。 我们将可视化输出与使用PCA的可视化输出进行了比较,最后,我们尝试了一种混合方法,该方法首先应用PCA,然后再应用t-SNE。
I hope you enjoyed this blog post and please share any thoughts that you may have :)
我希望您喜欢这篇博客文章,并请分享您可能有的任何想法:)
Check out my other post on Chi-square test for independence:
看看我关于卡方检验的其他文章是否具有独立性:
翻译自: https://towardsdatascience.com/dimensionality-reduction-using-t-distributed-stochastic-neighbor-embedding-t-sne-on-the-mnist-9d36a3dd4521
使用mnist数据集
总结
以上是生活随笔为你收集整理的使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入(t-SNE)进行降维的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: 国债指数
- 下一篇: 总体方差的充分统计量_R方是否衡量预测能