Question

Why does TCGA RNA-seq pipeline starts from a BAM file?

1

Entering edit mode

2.3 years ago

シン ▴ 10

I am trying to do exactly the same pipeline for RNA-seq data process as the TCGA does. Usually when we ask a sequencing service, we can get a fastQ file. It contains sequence and read quality information. The alignment step comes next. However in the case of TCGA, as their pipeline suggested (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/). It seems that they used BAM files as one of the inputs. I was wondering why they used BAM files as inputs and how can I repeat what they did? In addition, why isn't there seem to be a adaptor trimming process?

1: TCGA mRNA-seq pipeline schematic https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

RNA-Seq TCGA • 1.8k views

ADD COMMENT • link updated 2.1 years ago by Zhenyu Zhang ★ 1.2k • written 2.3 years ago by シン ▴ 10

0

Entering edit mode

2.1 years ago

Zhenyu Zhang ★ 1.2k

For your curiosity of why GDC sometimes uses BAM as input? Because GDC is not a sequencing center. GDC only does analysis on data that data submitters provided. If data submitters only provide BAMs, GDC has no choice but to use BAMs.

ADD COMMENT • link 2.1 years ago by Zhenyu Zhang ★ 1.2k

score 2 · Accepted Answer · 2022-01-18

2

Entering edit mode

2.3 years ago

dsull ★ 5.8k

https://gdc.cancer.gov/about-gdc/gdc-faqs - Read the answer to "How can I access GDC sequencing data in FASTQ format?"

Level 1 (aka the raw fastq) data is restricted; to request access to it, see instructions here: https://gdc.cancer.gov/access-data/obtaining-access-controlled-data

Adaptor trimming is unnecessary in RNA-seq read mapping; many papers have been written about it (e.g. https://academic.oup.com/nargab/article/2/3/lqaa068/5901066 ).