Title: Exploratory Factor Analysis for Data on a Sphere
Abstract: Many modern applications, for example, those involving text, image or genetic data, often yield, after processing, observations on a unit sphere. The projected normal distribution that results when scaling a multivariate Gaussian random vector to have unit norm allows us to conveniently characterize relationships between variables through the variance-covariance matrix. We incorporate latent factor models into the projected normal distribution to obtain lower-dimensional representations of the variability in data on a sphere. Maximum likelihood estimation is proposed via a novel alternating expectation profile conditional maximization algorithm that incorporates implicitly-restarted Lanczos and SQUAREM steps to enable fast exploratory analysis.
Four separate applications are presented. In the first case, we explore 43,247 #MeToo tweets from December 3-13, 2018 and find support for the eight underlying latent variables displayed through word clouds in Figure 1. Our second example finds that two separate time series can characterize the variability of a typically developing child's brain in the resting state and that the related factor scores correspond to different functional areas of the brain. Application on the 70,000 28x28 images of digits in the MNIST database shows that 13 factors are adequate to explain the variability of handwritten digits and correspond to different handwriting styles (Figure 2). In the fourth application, we apply our method to RNA-seq data from 775 breast cancer samples. We discover12 latent factors out of which the sixth is significantly associated with survival. Using the loadings on the sixth factor as the gene-level statistic, gene-set enrichment analysis (GSEA) finds this factor enriches for 31 biological processes, including protein expression and the immune response.
This work is joint with Fan Dai of Michigan Technological University and Ranjan Maitra and Karin Dorman of Iowa State University.