Why does TCGA RNA-seq pipeline starts from a BAM file?
2
1
Entering edit mode
2.3 years ago
シン ▴ 10

I am trying to do exactly the same pipeline for RNA-seq data process as the TCGA does. Usually when we ask a sequencing service, we can get a fastQ file. It contains sequence and read quality information. The alignment step comes next. However in the case of TCGA, as their pipeline suggested (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/). It seems that they used BAM files as one of the inputs. I was wondering why they used BAM files as inputs and how can I repeat what they did? In addition, why isn't there seem to be a adaptor trimming process?

1: TCGA mRNA-seq pipeline schematichttps://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

RNA-Seq TCGA • 1.8k views
ADD COMMENT
2
Entering edit mode
2.3 years ago
dsull ★ 5.8k

https://gdc.cancer.gov/about-gdc/gdc-faqs - Read the answer to "How can I access GDC sequencing data in FASTQ format?"

Level 1 (aka the raw fastq) data is restricted; to request access to it, see instructions here: https://gdc.cancer.gov/access-data/obtaining-access-controlled-data

Adaptor trimming is unnecessary in RNA-seq read mapping; many papers have been written about it (e.g. https://academic.oup.com/nargab/article/2/3/lqaa068/5901066 ).

ADD COMMENT
0
Entering edit mode

Thank you Delaney, these links answered many questions of mine.

ADD REPLY
0
Entering edit mode
2.1 years ago
Zhenyu Zhang ★ 1.2k

For your curiosity of why GDC sometimes uses BAM as input? Because GDC is not a sequencing center. GDC only does analysis on data that data submitters provided. If data submitters only provide BAMs, GDC has no choice but to use BAMs.

ADD COMMENT

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6