Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

Accessing PMC Article Datasets Using Amazon Web Services

As part of our Cloud Service, PMC makes the datasets described below freely accessible on Amazon Web Services (AWS), without charge, through either an HTTPS or S3 URL, and without any log-in requirement for retrieval (see Access Using the Command Line Interface). The National Library of Medicine works with the AWS Open Data Sponsorship Program to provide this access. Read on to learn why and how you may access these datasets from our AWS Cloud Service.

Note:

Description and Location of PMC Article Datasets on AWS

Resource type: S3 Bucket, world-readable

Amazon Resource Name (ARN): arn:aws:s3:::pmc-oa-opendata

AWS Region: us-east-1

AWS CLI (Command Line Interface) Access (No AWS account required if you use --no-sign-request)

aws s3 ls s3://pmc-oa-opendata --no-sign-request

PMC Open Access (OA) Subset on AWS

The PMC Open Access Subset in PMC's S3 bucket on AWS is divided into three top-level directories: oa_comm, oa_noncomm, and phe_timebound. For commercial usage, you are limited to the articles in the oa_comm directory which includes articles licensed under CC BY and CC0 licenses and to articles in the phe_timebound directory all of which have a standard PHE COVID-19 Initiative timebound license statement. For non-commercial usage, you may access articles in the oa_noncomm (which contains articles licensed under all Creative Commons license types with the exclusion of CC BY and CC0), oa_comm, and phe_timebound directories.

The license terms on articles are not all identical. Please refer to the license statement in each article for specific terms of use. The oa_comm/, oa_noncomm/, and phe_timebound/ directories follow similar structures:

|_ txt/           
   |_ all/ 
      individual plain text files for each article, named 
      PMC[accession_id].txt, e.g. PMC1043859.txt
   |_ metadata
       |_csv/
          [oa_comm or oa_noncomm or phe_timebound].filelist.csv
   |_ metadata
       |_txt/
          [oa_comm or oa_noncomm or phe_timebound].filelist.txt        
|_ xml/               
   |_ all/     
      individual XML files for each article, named 
      PMC[accession_id].xml, e.g. PMC1043859.xml
   |_ metadata
      |_csv/
         [oa_comm or oa_noncomm or phe_timebound].filelist.csv
   |_ metadata
      |_txt/
         [oa_comm or oa_noncomm or phe_timebound].filelist.txt

Note that on AWS, we are limiting distribution of open access articles to those that have a machine-readable Creative Commons license. Those articles that have been identified by the publishers as open access, but that do not have machine-readable Creative Commons licenses tagged are available via the PMC FTP Service.

File lists are updated daily. Each contains a row per article with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.

Key,ETag,Article Citation,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,License,Retracted
oa_comm/xml/all/PMC1043859.xml,801ba4a4c2d48ad98149e4e481a55b06,PLoS Biol. 2005 Apr 22; 3(4):e60,PMC1043859,2021-06-17 18:35:10,15736975,CC BY,no

Header Definitions

PMC Author Manuscript Dataset on AWS

The PMC Author Manuscript Dataset in PMC's S3 bucket on AWS is found in the author_manuscript directory. Articles in this directory are accepted author manuscripts that have been collected under a funder policy in PMC. They are available in XML and plain text for text mining purposes.

The author_manuscript/ directory is organized as follows:

|_ txt/         
   |_ all/ 
      individual plain text files for each author manuscript,
      named PMC[accession_id].txt, e.g. PMC1249490.txt     
   |_ metadata/
      |_csv/
         author_manuscript.filelist.csv
   |_metadata/
      |_txt/
      author_manuscript.filelist.txt
|_ xml/   
   |_all/
       individual XML files for each author manuscript,
       named PMC[accession_id].xml, e.g. PMC1043859.xml   
   |_metadata/
      |_ csv/
         author_manuscript.filelist.csv
   |_metadata/
      |_txt/
         author_manuscript.filelist.txt

File lists are updated daily. Each contains a row per manuscript with a number of metadata fields. Below is a sample header and sample row for the CSV formatted file list. The plain text file list uses tabs to separate the fields.

Key,ETag,AccessionID,Last Updated UTC (YYYY-MM-DD HH:MM:SS),PMID,MID
author_manuscript/xml/all/PMC8218989.xml,c9090970ef2d0ab762ef473a18eac2ef,PMC8218989,2021-06-24 07:31:23,32914184,NIHMS1703867

Header Definitions

Retrieval from AWS

Retrieving files from PMC's S3 bucket on AWS does not require an AWS account. In addition, there are no transfer fees to users for downloading or transferring files, because these costs are covered through PMC's participation in the AWS Open Data Sponsorship Program. There are several methods available to retrieve files as described in the Downloading an object documentation from AWS.

AWS Command Line Interface (CLI)

First, download the AWS Command Line Interface (CLI) following these instructions.

Because the PMC S3 bucket is world-readable, you do not need an AWS account ID to read or download these files; however, if you choose to access the data anonymously, you will need to include a --no-sign-request option on any of the below examples. If, however, you wish to copy these data into your own S3 bucket or use AWS services like AWS Elastic Compute Cloud or Amazon Athena on these data, you will need an AWS account and you will need to input your AWS credentials.

The following examples take advantage of the bucket-, prefix-, and object-level s3 commands. Read more about s3 commands.

Using AWS CLI to access and retrieve objects: Examples

There are several methods available to download files as described in the AWS Downloading an object documentation.

Download everything in a directory using sync

Let's say you want to download everything living under the prefix /oa_comm/xml/all. In this example, we've already generated a directory called pmc-test that we want all these objects to be copied into. aws sync syncs everything in a source bucket into your designated directory. Note that sync does not have a --prefix option as list-objects-v2 does. However, since the key to any object includes prefixes, you can use --include and --exclude filters to designate what prefixes you want to sync.

Note that filters can accommodate a number of patterns, and have a precedence hierarchy! Read more about include and exclude filters.

aws s3 sync s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "/oa_comm/xml/all/"

Download new or updated files in a directory using sync

A common use case is that you will want to only download new or updated data. Per sync documentation, "a s3 object will require downloading if the size of the s3 object differs from the size of the local file, the last modified time of the s3 object is newer than the last modified time of the local file, or the s3 object does not exist in the local directory". So after you've used sync once to get everything, you can continue to use it whenever you want to retrieve only new or updated files.

Read the official aws s3 sync documentation.

Download a subset using cp

If you only want a subset of data to work with and don't want to keep the entirety of a bucket in your own storage, you can also use aws cp. cp is a single-object command, so if you want cp to scan the entire bucket for anything added after a specific timestamp, you'll want to add the --recursive tag.

Copy all files
aws s3 cp s3://pmc-oa-opendata ./pmc-test/  --recursive
Copy files within a certain prefix

This example also defines that you want to download data, but it includes --exclude and --include prefixes to limit the cp to files under a certain prefix.

aws s3 cp s3://pmc-oa-opendata ./pmc-test/ --exclude "*" --include "oa_comm/xml/all/" --recursive

Explore the official aws s3 cp documentation.

Engage

NCBI wants your feedback on accessing PMC Article Datasets using AWS. Contact pubmedcentral@ncbi.nlm.nih.gov with feedback and questions.

Last modified: Fri June 17 2022
Feedback