Principal component analysis of binary genomics data

Brief Bioinform. 2019 Jan 18;20(1):317-329. doi: 10.1093/bib/bbx119.

Abstract

Motivation: Genome-wide measurements of genetic and epigenetic alterations are generating more and more high-dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low-dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low-dimensional structure, variable importance, etc. The results show that if a low-dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended.

Availability: The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Computational Biology / statistics & numerical data
  • Computer Simulation
  • DNA Copy Number Variations
  • DNA Methylation
  • Databases, Genetic / statistics & numerical data
  • Genomics / statistics & numerical data*
  • Humans
  • Logistic Models
  • Machine Learning
  • Neoplasms / genetics
  • Nonlinear Dynamics
  • Principal Component Analysis*
  • Software
  • Statistics, Nonparametric