Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review

Accid Anal Prev. 2017 Jan:98:359-371. doi: 10.1016/j.aap.2016.10.014. Epub 2016 Nov 15.

Abstract

Injury narratives are now available real time and include useful information for injury surveillance and prevention. However, manual classification of the cause or events leading to injury found in large batches of narratives, such as workers compensation claims databases, can be prohibitive. In this study we compare the utility of four machine learning algorithms (Naïve Bayes, Single word and Bi-gram models, Support Vector Machine and Logistic Regression) for classifying narratives into Bureau of Labor Statistics Occupational Injury and Illness event leading to injury classifications for a large workers compensation database. These algorithms are known to do well classifying narrative text and are fairly easy to implement with off-the-shelf software packages such as Python. We propose human-machine learning ensemble approaches which maximize the power and accuracy of the algorithms for machine-assigned codes and allow for strategic filtering of rare, emerging or ambiguous narratives for manual review. We compare human-machine approaches based on filtering on the prediction strength of the classifier vs. agreement between algorithms. Regularized Logistic Regression (LR) was the best performing algorithm alone. Using this algorithm and filtering out the bottom 30% of predictions for manual review resulted in high accuracy (overall sensitivity/positive predictive value of 0.89) of the final machine-human coded dataset. The best pairings of algorithms included Naïve Bayes with Support Vector Machine whereby the triple ensemble NBSW=NBBI-GRAM=SVM had very high performance (0.93 overall sensitivity/positive predictive value and high accuracy (i.e. high sensitivity and positive predictive values)) across both large and small categories leaving 41% of the narratives for manual review. Integrating LR into this ensemble mix improved performance only slightly. For large administrative datasets we propose incorporation of methods based on human-machine pairings such as we have done here, utilizing readily-available off-the-shelf machine learning techniques and resulting in only a fraction of narratives that require manual review. Human-machine ensemble methods are likely to improve performance over total manual coding.

Keywords: Cause of injury; Injury; Injury surveillance; Machine learning; Narrative text.

MeSH terms

  • Accidents, Occupational / statistics & numerical data*
  • Algorithms*
  • Bayes Theorem
  • Clinical Coding / methods
  • Databases, Factual / statistics & numerical data*
  • Humans
  • Logistic Models
  • Machine Learning
  • Models, Theoretical
  • Narration
  • Public Health Surveillance / methods*
  • Workers' Compensation / statistics & numerical data
  • Wounds and Injuries / epidemiology*