GEO Platform content guidelines

Introduction
Standard Platform Headers

Introduction

A GEO Platform table is a tab-delimited table containing the array definition. Platforms in GEO are submitted by the scientific community and represent a very diverse range of technologies, molecule types, and annotation conventions. In order that the community may properly evaluate data, GEO submitters are required to provide meaningful, trackable, sequence identifier information for each feature on the array using one or more of the standard platform headers described below.

General points for submitters:

It may not be necessary to submit a Platform record if your experiments are performed using commercial arrays (e.g., Affymetrix GeneChips). Official versions of many commercial array templates have already been deposited with GEO. To locate a commercial array, use the FIND PLATFORM tool and reference the appropriate Platform accession number (GPLxxx) when prompted during the submission process. If you use a commercial array, but cannot locate its template in GEO, please proceed with Platform submission. If we can verify the content of the commercial Platform you submit, the contact information presented on that record will be edited from you to that of the vendor so that other users may easily locate and submit Sample data corresponding to that Platform.
The Platform content guidelines on this page apply regardless of the submission route you use.
The Platform data table should only contain information that pertains to the content and design of the array. No expression measurement or hybridization intensity data should be included in the Platform table.
Each row of the Platform table must be represented by its own unique identifier (ID). Keep in mind that the ID column you provide in your Platform table corresponds to the ID_REF column you provide in accompanying Sample data tables - there should be a 1:1 correspondence. Sample data tables should contain normalized data. This means, for example, if your normalization strategy requires taking the average of replicate array features, or removing control spots, your Platform table should reflect the condensed template. In this case, please e-mail or FTP the full array design file to us and we will attach it to your Platform record as a supplementary file - this ensures that your submission remains in compliance with MIAME standards.
The Platform table must include meaningful, trackable, sequence identifiers (e.g. GenBank/RefSeq accessions, locus tags, clone IDs, oligo sequences, chromosome locations, etc - see table below for full list). This information enables other users to comprehensively interpret your data in compliance with MIAME standards, and allows GEO to retrieve up-to-date annotation for your Platform when incorporated into our downstream data query tools. References to in-house databases or top BLAST hits are not sufficient.
The principal reason many journals require deposit of microarray data to a public repository is so that the scientific community has the ability to comprehensively evaluate or reanalyze the entire dataset. While we understand the various reasons and difficulties some researchers have with sharing data and array designs, the demand from users and journal editors together with our need to maintain a useful and transparent database has led to our policy of only accepting well-annotated datasets. If you have any questions or concerns regarding this issue, please e-mail us.

Standard Platform HeadersBack to top

The first row in the Platform table must be a header line that identifies the content of each column. Column headers may be standard or non-standard. It is expected that at least one standard column (other than ID) will be supplied with each Platform submission.

In addition to these standard columns, your data table may include any number of non-standard columns. Examples of non-standard columns include array coordinate information, gene symbol or description, gene ontology terms, quality indicators, etc. Columns may appear in any order after the ID column. In this way, GEO is a flexible and open system, allowing you to provide all information necessary to thoroughly annotate your array.

Standard column headers and their content are as follows:

HEADER	CONTENT
ID	(Required) An identifier that unambiguously identifies each row on your Platform table. Each ID within a Platform table must be unique. This column heading should appear first and may be used only once in the data table. Keep in mind that the ID column you provide in your Platform data table corresponds with the ID_REF column you provide in accompanying Sample data tables. Sample data tables should contain normalized data. If your normalization strategy requires taking the average of replicate array features, your Platform should reflect the condensed template. In this case, please e-mail or FTP the full template file to us and we will attach it to your Platform record as a supplementary file.
SEQUENCE	The nucleotide sequence of each oligo, clone or PCR product.
GB_ACC	GenBank accession - identifies a biological sequence through the GenBank sequence accession number assigned to the sequence, or the representative GenBank or RefSeq accession number upon which your sequence was designed. It is recommended that you include the version number of the accessions upon which your sequences were designed (e.g., NM_022975.1 rather than NM_022975). This is particularly important for RefSeq accessions which are updated frequently. GenBank accessions representing the top BLAST hits for your sequences are not acceptable. Also, chromosome, genome and contig accession numbers are generally not acceptable as they are not specific enough to accurately identify the portion of the sequence printed on arrays (use GB_RANGE instead).
GB_LIST	GenBank accession list - as for GB_ACC, but allows more than one GenBank accession number to be presented. For example, your sequences may have GenBank accession numbers representing both the 5' and 3' ends of your clones. Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one GB_ACC column may be supplied.
GB_RANGE	GenBank accession range - specifies a particular sequence position within a GenBank accession number. Use format ACCESSION.VERSION[start..end]. Useful for tiling arrays.
RANGE_GB	Use format ACCESSION.VERSION. Should be used in conjunction with RANGE_START and RANGE_END. Useful for tiling arrays.
RANGE_START	Use in conjunction with RANGE_GB. Indicates the start position (relative to the RANGE_GB accession). Useful for tiling arrays.
RANGE_END	Use in conjunction with RANGE_GB. Indicates the end position (relative to the RANGE_GB accession). Useful for tiling arrays.
RANGE_STRAND	Use in conjunction with RANGE_GB. Indicates the strand represented. Use + or - or empty. Useful for tiling arrays.
GI	GenBank identifier - as for GB_ACC, but specify the GenBank identifier number rather than the GenBank accession number.
GI_LIST	GenBank identifier list - as for GI, but allows more than one GenBank identifier to be presented. Multiple GIs should be separated using commas or spaces. Alternatively, more than one GI column may be supplied.
GI_RANGE	GenBank identifier range - specifies a particular sequence position on a GenBank identifier number. Use format GI[start..end].
CLONE_ID	Clone identifier - identifies a biological sequence through a standard clone identifier. Only CLONE_IDs that can be used to identify the sequence through an NCBI or other public-database query should be provided in this column. Examples include FlyBase IDs, RIKEN clone IDs and IMAGE clone numbers.
CLONE_ID_LIST	CLONE_ID list - as for CLONE_ID, but allows more than one clone identifier to be presented. Multiple Clone IDs should be separated using commas or spaces. Alternatively, more than one CLONE_ID column may be supplied.
ORF	Open reading frame designator - identifies a biological sequence through an experimentally or computationally derived open reading frame identifier. The ORF designator is intended to represent a known or predicted DNA coding region or locus_tag identified in NCBI's Entrez Genomes division. It may be appropriate to include a GENOME_ACC column to reference the GenBank accession from which the ORF names are derived.
ORF_LIST	ORF list - as for ORF, but allows more than one open reading frame designator to be presented. Multiple ORFs should be separated using commas or spaces. Alternatively, more than one ORF column may be supplied.
GENOME_ACC	Genome accession number - specifies the GenBank or RefSeq genome accession number from which ORF identifiers are derived. It is important to include the version number of the genome accession upon which your sequences were generated (e.g., NC_004721.1 rather than NC_004721) because updates to the genome sequence may render your ORF designations incorrect.
SNP_ID	SNP identifier - typically specifies a dbSNP refSNP ID with format rsXXXXXXXX.
SNP_ID_LIST	SNP identifier list - as for SNP_ID, but allows more than one SNP_ID to be presented. Multiple SNP_IDs should be separated using commas or spaces. Alternatively, more than one SNP_ID column may be supplied.
miRNA_ID	microRNA identifier - typically has format e.g., hsa-let-7a or MIRNLET7A2.
miRNA_ID_LIST	microRNA identifier list - as for miRNA_ID, but allows more than one miRNA_ID to be presented. Multiple miRNA_IDs should be separated using commas or spaces. Alternatively, more than one miRNA_ID column may be supplied.
SPOT_ID	Alternative spot identifier - use only when no identifier or sequence tracking information is available. This column is useful for designating control and empty features.
ORGANISM	The organism source of each feature on your array. This is most useful for when your array contains sequences derived from multiple organisms.
PT_ACC	Protein accession - identifies any GenBank or RefSeq protein accession number. Protein accession numbers should only be supplied for protein arrays. Nucleotide accession numbers should be supplied for nucleotide arrays.
PT_LIST	Protein accession list - as for PT_ACC, but allows more than one protein accession number to be presented. Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one PT_ACC column may be supplied. Protein accession numbers should only be supplied for protein arrays. Nucleotide accession numbers should be supplied for nucleotide arrays.
PT_GI	Protein GenBank or RefSeq identifier. Protein identifiers should only be supplied for protein arrays or proteomic mass spectrometry Platforms. Nucleotide identifiers should be supplied for nucleotide arrays.
PT_GI_LIST	Protein identifier list - as for PT_GI, but allows more than one protein identifier to be presented. Multiple identifiers should be separated using commas or spaces. Alternatively, more than one PT_GI column may be supplied. Protein identifiers should only be supplied for protein arrays. Nucleotide identifiers should be supplied for nucleotide arrays.
SP_ACC	SwissProt accession. SwissProt accession numbers should only be supplied for protein arrays. Nucleotide accession numbers should be supplied for nucleotide arrays.
SP_LIST	SwissProt accession list - as for SP_ACC, but allows more than one SwissProt accession number to be presented. Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one SP_ACC column may be supplied. SwissProt accession numbers should only be supplied for protein arrays. Nucleotide accession numbers should be supplied for nucleotide arrays.