BAM files of RNA-seq in TCGA

Question

TANRIC: lncRNA and methylation interaction

0

Entering edit mode

7.5 years ago

Shicheng Guo ★ 9.4k

Hi All,

I am trying to make some interaction analysis between methylation and lncRNA with TCGA dataset. However, there is no any lncRNA existed dataset in TCGA project. What lucky is TANRIC provided lncRNA expression to TCGA cancer samples. However, their sample size is quite limited and didn't make good update as along as the increasing of the sample size in TCGA project.

My question is: is there any existed pipeline to quantify lncRNA expression level from BAM file of RNA-seq (BWA) from TCGA project.

Thanks.

TANRIC Contains read counts for ensembl defined lncRNAs, but also allows users to define their own lncRNA by inputting genomic coordinates. TANRIC also includes various analyses including survival analyses and allows for download of their data.

BAM files of RNA-seq in TCGA

Illumina paired-end RNA sequencing reads were aligned to GRCh37-lite genome-plus-junctions reference using BWA version 0.5.7. This reference combined genomic sequences in the GRCh37-lite assembly and exon-exon junction sequences whose corresponding coordinates were defined based on annotations of any transcripts in Ensembl (v59), Refseq and known genes from the UCSC genome browser, which was downloaded on August 19 2010, August 8 2010, and August 19 2010, respectively. Reads that mapped to junction regions were then repositioned back to the genome, and were marked with 'ZJ:Z' tags. BWA is run using default parameters, except that the option (-s) is included to disable Smith-Waterman alignment. Finally, reads failing the Illumina chastity filter are flagged with a custom script, and duplicated reads were flagged with Picard's MarkDuplicates

lncRNA methylation • 2.5k views

ADD COMMENT • link 7.5 years ago by Shicheng Guo ★ 9.4k

score 0 · Answer 1 · 2016-10-27

Thanks. Maybe I found the solution, but anyway, share with you and hope it is helpful.

TCGA RNA-seq processing

RNA-seq sequence libraries in BAM format (2x75 nt paired-end reads) for 412 primary HGS-OvCa tumors were downloaded from cgHub (http://cghub.ucsc.edu, data available on Oct 9 2012). The BAM files were produced by the BCCA Genome Science Center TCGA RNA-seq pipeline, which briefly uses BWA [50] for alignment to the Hg18 genome assembly and to exon junctions derived from Ensembl/GENCODE, UCSC genes and RefSeq. Low-quality alignments (mapping quality 0) were removed and sequences were name-sorted and converted to SAM format using SAMtools [51]. We used TopHat [52] with default parameters to realign a subset of the samples to enable unbiased study of splicing patterns in the AXI region. TCGA endometrial RNA-seq data in BAM format (76 nt single-end reads) for 321 tumors was obtained from cgHub (downloaded on Oct 18 2012). These BAM files are not directly useful for quantifying GENCODE lncRNAs as they were generated by alignment to a limited transcriptome database. They were therefore converted to FASTQ format and realigned to the Hg19 assembly with TopHat using the “-G” option with known splice junctions from GENCODE. Read counts for individual GENCODE genes were subsequently determined using HTSeq-count (http://www-huber.embl.de/users/anders/HTSeq) in “intersection-strict” mode, by considering only uniquely mapped reads. RPKM expression levels for lncRNAs (n = 10,419) and other GENCODE genes were finally calculated by normalizing for mRNA length and library size as determined by the number of GENCODE-mapped reads. For analyses requiring log2-scale values, a pseudo value of 0.01 was added before conversion to avoid log of zero [53]. HGS-OvCa samples with less than 20 million GENCODE-mapped read pairs and without matching copy-number data were excluded, resulting in a final set of 407 tumor expression profiles with on average 63.1 million GENCODE-mapped read pairs each (25.7 billion in total). For endometrial samples, 10 million GENCODE-mapped reads were required, for a resulting final set of 293 tumors with on average 19.1 million GENCODE-mapped reads (5.6 billion in total). Expression coverage plots were generated by dividing the genome into partially overlapping 1000 nt tiles spaced 500 nt apart. Tile read counts were determined using BEDTools (coverageBed utility) [54], and these were normalized based on the median of the top 5% expressed tiles in each sample.