Multinomial logistic regression with missing outcome data: An application to cancer subtypes

Stat Med. 2020 Oct 30;39(24):3299-3312. doi: 10.1002/sim.8666. Epub 2020 Jul 6.

Abstract

Many diseases such as cancer and heart diseases are heterogeneous and it is of great interest to study the disease risk specific to the subtypes in relation to genetic and environmental risk factors. However, due to logistic and cost reasons, the subtype information for the disease is missing for some subjects. In this article, we investigate methods for multinomial logistic regression with missing outcome data, including a bootstrap hot deck multiple imputation (BHMI), simple inverse probability weighted (SIPW), augmented inverse probability weighted (AIPW), and expected estimating equation (EEE) estimators. These methods are important approaches for missing data regression. The BHMI modifies the standard hot deck multiple imputation method such that it can provide valid confidence interval estimation. Under the situation when the covariates are discrete, the SIPW, AIPW, and EEE estimators are numerically identical. When the covariates are continuous, nonparametric smoothers can be applied to estimate the selection probabilities and the estimating scores. These methods perform similarly. Extensive simulations show that all of these methods yield unbiased estimators while the complete-case (CC) analysis can be biased if the missingness depends on the observed data. Our simulations also demonstrate that these methods can gain substantial efficiency compared with the CC analysis. The methods are applied to a colorectal cancer study in which cancer subtype data are missing among some study individuals.

Keywords: hot deck multiple imputation; inverse probability weighting; missing at random.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Data Interpretation, Statistical
  • Humans
  • Logistic Models
  • Models, Statistical*
  • Neoplasms* / epidemiology
  • Probability