Question

Does TCGA data only use host reads to make pre-processed data (CNV, Mutations status, RNA-seq)?

1

Entering edit mode

10 months ago

yahn ▴ 10

I have a question regarding data available on TCGA, including copy number variation, mutation status and RNA-seq transcriptome profiling data. The following are the bioinformatics pipelines used for pre-processing.

Copy Number Variation: ASCAT3 pipeline
Mutation status: MuTect2 pipeline
RNA-seq: STAR pipeline

My question is, are these pre-processed data (not raw reads) processed from only host genomics/transcriptomics reads? If so, I am curious what kind of filter is used to filter out non-host reads (i.e. viral reads) from BAM files.

Your replies will be valuable in helping me understand the bioinformatics pipelines of data pre-processing. Thank you very much.

rna-seq copy-number-variation mutation-status TCGA • 904 views

ADD COMMENT • link updated 9 months ago by Zhenyu Zhang ★ 1.3k • written 10 months ago by yahn ▴ 10

0

Entering edit mode

There are no filtering for host or non-host reads in GDC. All GDC analysis use all reads that passed quality criteria. GDC reference genome also contains 200 virus subtypes. You should read this paper if you need more details https://www.nature.com/articles/s41467-021-21254-9

ADD REPLY • link 9 months ago by Zhenyu Zhang ★ 1.3k

score 0 · Answer 1 · 2024-12-18

There are viral reads in sequencing runs but they don't align to the human reference. The alignment is the filter of which you speak. The reads are unmapped in the BAM files.

https://pmc.ncbi.nlm.nih.gov/articles/PMC5828528/

There are endogenous retroviruses (ERVs) in the human genome that are basically viral fossils but I don't think most viral sequences would align with them.

Some reference genomes include decoy sequences to intercept viral reads with significant homology to human genes such as the Epstein-Barr virus (EBV).