Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

Shashikant Pujar; Nuala A O'Leary; Catherine M Farrell; Jane E Loveland; Jonathan M Mudge; Craig Wallin; Carlos G Girón; Mark Diekhans; If Barnes; Ruth Bennett; Andrew E Berry; Eric Cox; Claire Davidson; Tamara Goldfarb; Jose M Gonzalez; Toby Hunt; John Jackson; Vinita Joardar; Mike P Kay; Vamsi K Kodali; Fergal J Martin; Monica McAndrews; Kelly M McGarvey; Michael Murphy; Bhanu Rajput; Sanjida H Rangwala; Lillian D Riddick; Ruth L Seal; Marie-Marthe Suner; David Webb; Sophia Zhu; Bronwen L Aken; Elspeth A Bruford; Carol J Bult; Adam Frankish; Terence Murphy; Kim D Pruitt

doi:10.1093/nar/gkx1031

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

Nucleic Acids Res. 2018 Jan 4;46(D1):D221-D228. doi: 10.1093/nar/gkx1031.

Authors

Shashikant Pujar¹, Nuala A O'Leary¹, Catherine M Farrell¹, Jane E Loveland², Jonathan M Mudge², Craig Wallin¹, Carlos G Girón², Mark Diekhans³, If Barnes², Ruth Bennett², Andrew E Berry², Eric Cox¹, Claire Davidson², Tamara Goldfarb¹, Jose M Gonzalez², Toby Hunt², John Jackson¹, Vinita Joardar¹, Mike P Kay², Vamsi K Kodali¹, Fergal J Martin², Monica McAndrews⁴, Kelly M McGarvey¹, Michael Murphy¹, Bhanu Rajput¹, Sanjida H Rangwala¹, Lillian D Riddick¹, Ruth L Seal⁵, Marie-Marthe Suner², David Webb¹, Sophia Zhu⁴, Bronwen L Aken², Elspeth A Bruford⁵, Carol J Bult⁴, Adam Frankish², Terence Murphy¹, Kim D Pruitt¹

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
² European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
³ University of California Santa Cruz Genomics Institute, Santa Cruz, CA 95064, USA.
⁴ Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA.
⁵ HUGO Gene Nomenclature Committee, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

Abstract

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.

Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

Publication types

Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural
Research Support, Non-U.S. Gov't

MeSH terms

Animals
Consensus Sequence*
Data Curation / methods
Data Curation / standards
Databases, Genetic* / standards
Guidelines as Topic
Humans
Mice
Molecular Sequence Annotation
National Library of Medicine (U.S.)
Open Reading Frames*
United States
User-Computer Interface

Abstract

Publication types

MeSH terms

Grants and funding