Unsupervised methods in LC-MS data treatment: Application for potential chemotaxonomic markers search

J Pharm Biomed Anal. 2021 Nov 30:206:114382. doi: 10.1016/j.jpba.2021.114382. Epub 2021 Sep 21.

Abstract

The combination of Liquid Chromatography and Mass Spectrometry (LC-MS) is commonly used to determine and characterize biologically active compounds because of its high resolution and sensitivity. In this work we explore the interpretation of LC-MS data using multivariate statistical analysis algorithms to extract useful chemical information and identify clusters of similar samples. Samples of leaves from 19 plants belonging to the Apiaceae family were analyzed in unified LC conditions by high- and low-resolution mass spectrometry in a wide range scan mode. LC-MS data preprocessing was performed followed by statistical analysis using tensor decomposition in the form of Parallel Factor Analysis (PARAFAC); matrix factorization following tensor unfolding with principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF); or unsupervised feature selection (UFS). The optimal number of components for each of these methods were found and results were compared using four different metrics: silhouette score, Davies-Bouldin index, computational time, number of noisy components. It was found that PCA, ICA and UFS give the best results across the majority of the criteria for both low- and high-resolution data. An algorithm for biomarker signal selection is suggested and 23 potential chemotaxonomic markers were tentatively identified using MS2 data. Dendrograms constructed by the methods were compared to the molecular phylogenic tree by calculating pixel-wise mean square error (MSE). Therefore, the suggested approach can support chemotaxonomic studies and yield valuable chemical information for biomarker discovery.

Keywords: Apiaceae; Liquid chromatography; Machine learning; Mass spectrometry; Multi-way data.

MeSH terms

  • Algorithms*
  • Biomarkers
  • Chromatography, Liquid
  • Principal Component Analysis
  • Tandem Mass Spectrometry*

Substances

  • Biomarkers