Exploratory analysis

What is exploratory analysis?

The goal of exploratory analysis is to determine which of the associated metadata best explains the clustering of the abundance data. Exploratory analysis in PanHunter can be performed in the New Comparison app on categorical and numerical values, or feature abundances.

How is exploratory analysis done in PanHunter?

The calculation is done in the following way:

1. First, a distance matrix between samples is computed using the metric that was selected for dimension reduction (default is Euclidean).

2. Second, p-values are calculated with different statistical tests depending on the metadata used for the exploratory analysis:

For categorical variables Mann-Whitney test is employed two times.

  • First time to check if the distances between the samples that share a level of the analyzed category are smaller than the distances between samples that fall into different level of this category. Thus, it is tested if this category drives the sample clustering. For example, we are analyzing the category sex that has the three levels ‘Male’, ‘Female’ and ‘Unknow’. It is tested whether the distribution of the pairwise distances between all ‘Male’, all ‘Female’ and all ‘Unknow samples is smaller than the distribution of pairwise distances between ‘Male’ and ‘Female’, ‘Male’ and ‘Unknown’ and ‘Female’ and ‘Unknown’ combinations.
  • Second time the Mann-Whitney test is employed is to check, whether the distances between the samples for a particular level of a category are smaller than the distances between all the samples. Thus, it is tested whether this level of the category drives the samples clustering. For example, we are analyzing again the category sex that has the three levels ‘Male’, ‘Female’ and ‘Unknow’. Now it is tested if ‘Male’ is driving the clustering by testing if the distribution of the pairwise distances between all ‘Male’ samples is smaller than the distribution of pairwise distances between ‘Male’ and ‘Female’, ‘Female’ and ‘Female’, ‘Male’ and ‘Unknown’, ‘Female’ and ‘Unknown’ and ‘Unknown’ and ‘Unknown’ combinations.

For numerical variables and features, Spearman’s rank correlation (ρ) is calculated between the distances for the samples and the Manhattan distances between the values of the variable that is being analyzed. The p-value for the null-hypothesis that ρ = 0 and the alternative that ρ > 0 is computed via asymptotic t approximation.

3. As a last step, a statistical score (StatScore) is calculated for each variable, which is equal to the -log10 transformation of the p-value, capped at 100. For numerical variables and features, the correlation is reported along in addition to StatScore.