PMC Article Datasets
Interested in automated retrieval of articles in machine-readable formats in PubMed Central? PubMed Central and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
A few items to keep in mind before you begin:
- Not all articles in PMC are available for text mining and other reuse. The AWS RODA, PMC OAI service, FTP service, E-Utilities and the BioC API are the only services that may be used for automated retrieval of PMC content. Systematic retrieval (bulk retrieval) of articles through any other automated process is prohibited.
- License terms vary. Please refer to the license statement in each article for specific terms of use.
- Users are directly and solely responsible for compliance with copyright restrictions (see Restrictions on Systematic Downloading of Articles section of the PMC Copyright page for details).
About the Datasets
Content | License Terms | How to Access | XML | TXT | ||
---|---|---|---|---|---|---|
PMC Open Access Subset | The PMC Open Access Subset (or PMC OA Subset) contains millions of full-text open access article files made available under a Creative Commons or similar license terms or with publisher permission. This dataset includes retractions, corrections, and expressions of concern*.
This subset includes articles collected under the Public Health Emergency COVID-19 Initiative with Creative Commons licenses as well as those with timebound license statements that allow for secondary analysis and reuse for the duration of the global pandemic. |
Broken down by license type:
Commercial use allowed: CC0, CC BY, CC BY-SA, CC BY-ND Non-commercial use only: CC BY-NC, CC BY-NC-SA, CC BY-NC-ND Other: no machine-readable Creative Commons license, no license tagged, or a custom license |
![]() |
![]() |
![]() |
|
Author Manuscript Dataset | Full-text files of hundreds of thousands accepted author manuscripts (AAMs) that have been made available in PMC under a partner funder's policy. This dataset includes retractions, corrections, and expressions of concern*. |
Default license: "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
AAMs that include a Creative Commons license are also available via the Open Access Subset. |
![]() |
![]() |
||
Historical OCR Dataset | Full-text files of OCR'd text from articles published in the 18th, 19th, and 20th centuries added to PMC as part of an NLM Digitization Project. | Files are generally made available for text mining. Articles added more recently may also include a Creative Commons license and therefore will also be available via the Open Access Subset. |
![]() |
|||
LitArch Open Access Subset | The LitArch Open Access Subset contains the full-text of thousands of the books and documents in the NLM Literature Archive. | Creative Commons or similar license | ![]() |
* Retractions, corrections, and expressions of concern can be identified in the downloadable XML files by looking for the attribute article-type="retraction" or "correction" or "expression-of-concern" in the <article> element. In plain text files look for Retraction, Correction, or Expression of Concern in the Front section. Retractions, corrections, or expressions of concern can also be found using search filters with values of retraction[filter], correction[filter] or expression of concern[filter] respectively.