Dimensionality reduction methods

What is dimension reduction?

Dimension reduction is a tool that enables visualization of high-dimensional omics data in 2D. The main goal of any dimension reduction method is to extract the most important information from the data, usually by analyzing the way samples are clustered.

If two points (samples) are close in the 2D representation, they are also similar in high-dimensional space, e.g. they have a similar transcriptome or proteome profile.

Dimensionality reduction methods available in PanHunter

The algorithms for dimensionality reduction available in PanHunter are:

  • PCA – Principal component analysis. A linear deterministic dimensionality reduction method. For more information watch this 5-minute explanation and detailed video tutorial or read this article.

  • UMAP - Uniform manifold approximation and projection. A non-linear stochastic dimensionality reduction method. UMAP has two hyperparameters - number of neighbors and minimum distance - that can be set within PanHunter. The number of neighbors balances the local and global structure in the data. Low values result in a focus on very local structure, while high values focus on global structure. The minimum distance influences the degree of separation between clusters by controlling the compactness of points in the embedding space. In addition, it is also possible to set the seed, which ensures reproducibility by setting the starting point for the random number generator. Check out this video tutorial or read the article for more explanations.

  • t-SNE – t-distributed stochastic neighbor embedding. A nonlinear stochastic dimensionality reduction method. t-SNE has a hyperparameter, perplexity, that can be adjusted in PanHunter. The perplexity parameter, as highlighted by L. van der Maaten & G. Hinton (2008), serves as a critical measure to determine the appropriate number of neighbors. This measure, with typical values ranging between 5 and 50, provides a nuanced approach to evaluate the local structure of the data. In addition to the perplexity parameter, it is also possible to set the seed, which ensures reproducibility by setting the starting point for the random number generator. Check out this video explanation or read this article for more explanations.

  • PHATE – Potential of heat-diffusion for affinity-based trajectory embedding. A nonlinear stochastic dimensionality reduction method which preserves both global and local structures. It is used with the default parameters set in scanpy. For more information see this article or learn more on here.

📝 Setting seeds: Users are encouraged to try multiple random seeds to validate consistency and robustness of observed pattern within the resulting 2D space.

For more information on comparison between PCA, UMAP and t-SNE methods, and which one to use, check out following resources: