NCBI Logo
GEO Logo
   NCBI > GEO > Accession DisplayHelp Not logged in | LoginHelp
GEO help: Mouse over screen elements for information.
          Go
Series GSE32931 Query DataSets for GSE32931
Status Public on Oct 13, 2011
Title ENCODE Cold Spring Harbor Labs Long RNA-seq (hg18)
Project ENCODE
Organism Homo sapiens
Experiment type Expression profiling by high throughput sequencing
Non-coding RNA profiling by high throughput sequencing
Summary This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome.
For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf
 
Overall design Cells were grown according to the approved ENCODE cell culture protocols.
Sample preparation and sequencing:
K562 and GM12878 total cell, total RNA: Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5'-ACTGTAGGN6-3'. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5' end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5' end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5' truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5' end. The peak in insert size distribution is between 200-250 nucleotides.
K562 cytosol, polyA+ RNA: Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5' cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3' end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3' end. The 5' end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5' ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable.
Analysis:
K562 and GM12878 total cell, total RNA: Tags were removed from the 5' ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3' ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome.
K562 cytosol, polyA+ RNA: Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows: 1) Collect the read sequences from Illumina non-filtered output files. 2) Filter out all reads that contain undefined nucleotides ('N'). 3) Perform iterative alignment/C-tail chopping algorithm (below). On each alignment step, the reads are aligned to the genome with 100% identity. All reads that align to a single locus are withdrawn from the alignment pool and only the reads that could not be aligned continue to the next step. a) Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without chopping off any nucleotides. b) Chop off any C-blocks (until the first non-C) at the ends of the reads. c) Align to the genome -> remove and save those that align. d) Chop off any non-Cs until the next C. e) Chop off C-block until the next non-C. f) Align to the genome -> remove and save those that align. g) Repeat steps d, e, and f until the reads align to the genome, or chopping results in the reduction of the reads' lengths to below 16 (default), or there are no non-Cs left.
Web link http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeCshlLongRnaSeq
http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/info/ENCODE.html
 
Contributor(s) Gingeras T, Davis C
Citation(s) 22019781
BioProject PRJNA30709
Submission date Oct 12, 2011
Last update date May 15, 2019
Contact name ENCODE DCC
E-mail(s) encode-help@lists.stanford.edu
Organization name ENCODE DCC
Street address 300 Pasteur Dr
City Stanford
State/province CA
ZIP/Postal code 94305-5120
Country USA
 
Platforms (1)
GPL9052 Illumina Genome Analyzer (Homo sapiens)
Samples (3)
GSM646522 Gingeras_GM12878_cell_total
GSM646523 Gingeras_K562_cell_total
GSM646524 Gingeras_K562_cytosol_longPolyA
This SubSeries is part of SuperSeries:
GSE26284 ENCODE Cold Spring Harbor Labs Long RNA-seq
Relations
SRA SRP005098

Download family Format
SOFT formatted family file(s) SOFTHelp
MINiML formatted family file(s) MINiMLHelp
Series Matrix File(s) TXTHelp

Supplementary file Size Download File type/resource
GSE32931_RAW.tar 1.8 Gb (http)(custom) TAR (of BB)
SRA Run SelectorHelp
Raw data are available in SRA
Processed data provided as supplementary file

| NLM | NIH | GEO Help | Disclaimer | Accessibility |
NCBI Home NCBI Search NCBI SiteMap