Does TCGA data only use host reads to make pre-processed data (CNV, Mutations status, RNA-seq)?
1
1
Entering edit mode
8 months ago
yahn ▴ 10

I have a question regarding data available on TCGA, including copy number variation, mutation status and RNA-seq transcriptome profiling data. The following are the bioinformatics pipelines used for pre-processing.

  • Copy Number Variation: ASCAT3 pipeline
  • Mutation status: MuTect2 pipeline
  • RNA-seq: STAR pipeline

My question is, are these pre-processed data (not raw reads) processed from only host genomics/transcriptomics reads? If so, I am curious what kind of filter is used to filter out non-host reads (i.e. viral reads) from BAM files.

Your replies will be valuable in helping me understand the bioinformatics pipelines of data pre-processing. Thank you very much.

rna-seq copy-number-variation mutation-status TCGA • 806 views
ADD COMMENT
0
Entering edit mode

There are no filtering for host or non-host reads in GDC. All GDC analysis use all reads that passed quality criteria. GDC reference genome also contains 200 virus subtypes. You should read this paper if you need more details https://www.nature.com/articles/s41467-021-21254-9

ADD REPLY
0
Entering edit mode
8 months ago

There are viral reads in sequencing runs but they don't align to the human reference. The alignment is the filter of which you speak. The reads are unmapped in the BAM files.

https://pmc.ncbi.nlm.nih.gov/articles/PMC5828528/

There are endogenous retroviruses (ERVs) in the human genome that are basically viral fossils but I don't think most viral sequences would align with them.

Some reference genomes include decoy sequences to intercept viral reads with significant homology to human genes such as the Epstein-Barr virus (EBV).

ADD COMMENT

Login before adding your answer.

Traffic: 2890 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6