I have a question regarding data available on TCGA, including copy number variation, mutation status and RNA-seq transcriptome profiling data. The following are the bioinformatics pipelines used for pre-processing.
- Copy Number Variation: ASCAT3 pipeline
- Mutation status: MuTect2 pipeline
- RNA-seq: STAR pipeline
My question is, are these pre-processed data (not raw reads) processed from only host genomics/transcriptomics reads? If so, I am curious what kind of filter is used to filter out non-host reads (i.e. viral reads) from BAM files.
Your replies will be valuable in helping me understand the bioinformatics pipelines of data pre-processing. Thank you very much.
There are no filtering for host or non-host reads in GDC. All GDC analysis use all reads that passed quality criteria. GDC reference genome also contains 200 virus subtypes. You should read this paper if you need more details https://www.nature.com/articles/s41467-021-21254-9