Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank

Genes (Basel). 2021 Jun 29;12(7):991. doi: 10.3390/genes12070991.

Abstract

We use UK Biobank data to train predictors for 65 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, etc. from SNP genotype. For example, our Polygenic Score (PGS) predictor correlates ∼0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information); we call these predictors biomarker risk scores, BMRS. Individuals who are at high risk (e.g., odds ratio of >5× population average) can be identified for conditions such as coronary artery disease (AUC∼0.75), diabetes (AUC∼0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ∼10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: PRS) for common diseases to the risk predictors which result from the concatenation of learned functions BMRS and PGS, i.e., applying the BMRS predictors to the PGS output.

Keywords: atherosclerotic cardiovascular disease; biomarkers; disease risk; machine learning; polygenic scores.

MeSH terms

  • Adult
  • Atherosclerosis / blood
  • Atherosclerosis / epidemiology*
  • Atherosclerosis / urine
  • Biological Specimen Banks
  • Biomarkers / blood*
  • Biomarkers / urine*
  • Calcium / blood
  • Calcium / urine
  • Cardiovascular Diseases / blood
  • Cardiovascular Diseases / epidemiology*
  • Female
  • Heart Disease Risk Factors
  • Hemoglobins / genetics
  • Humans
  • Lipoprotein(a) / blood*
  • Lipoproteins, HDL / blood
  • Lipoproteins, LDL / blood
  • Machine Learning
  • Male
  • Middle Aged
  • Multifactorial Inheritance / genetics
  • Risk Assessment
  • United Kingdom / epidemiology
  • United States / epidemiology

Substances

  • Biomarkers
  • Hemoglobins
  • Lipoprotein(a)
  • Lipoproteins, HDL
  • Lipoproteins, LDL
  • Calcium