Coverage data for TCGA BAM files
1
0
Entering edit mode
16 months ago
enho ▴ 40

Hi everyone,

I am currently performing an analysis on pancan TCGA wxs data (>10000 normal-tumor pairs) . For my analysis I need to have the total coverage of each BAM file, so I can perform a depth normalization for on tumor vs matched normal sample.

My initial idea was to find the .bam.bai files and use them to find the total number of reads, but I couldn't find those files either.

exome number tcga copy dna whole coverage • 874 views
1
Entering edit mode

If you already granted access to TCGA raw files, create a manifest for those samples, then remove bams and keep .bai files in the manifest file, then this:

gdc-client download -m gdc_manifest_XXXXX.txt  -t gdc-user-token.XXXXX.txt


will download .bai files. Not sure what info you might get on read number from a bai file though.

0
Entering edit mode

Thanks for your respond Hamid. I am going to make a dummy bam file and then use samtools idxstats to retrieve the number of reads from index file. The pitfall of this method is that you can't separate reads based on their flag/MAPQ, so for example you might count some of the reads that are non-uniquely mapped twice.

1
Entering edit mode
16 months ago
enho ▴ 40

For anyone reading this later, I figured out how to do it: (Scripts are in R) (package "GenomicDataCommons" in bioconductor is used)

1. First get a Manifest for BAM files. Initially you can't get a manifest of .BAI files, you can only get BAM files manifest
manifest = GenomicDataCommons::files() %>%
GenomicDataCommons::filter(~ cases.project.project_id == "TCGA-KICH" &
experimental_strategy == "WXS" &
data_format == "BAM") %>%   GenomicDataCommons::manifest()

1. Using UUID of BAM files, run this query (according to here)
manifest.bai = lapply(manifest$id, function(uuid) { con = curl::curl(paste0("https://api.gdc.cancer.gov/files/", uuid, "?pretty=true&expand=index_files")) tbl = jsonlite::fromJSON(con) bai = data.table(id = tbl$data$index_files$file_id,
filename = tbl$data$index_files$file_name, md5 = tbl$data$index_files$md5sum,
size = tbl$data$index_files$file_size, state = tbl$data$index_files$state)
return(bai)
})

2. Make a dummy BAM file (or any BAM file for this matter)
3. Use samtools idxstats DUMMY.BAM to find the coverage info from each individual bam.bai file
    Note: if you are running a script to count them one by one, at each step you should change the name of dummy.bam to the name of bam.bai file, so idxstats can read it!

4. Sum up the results from third column (mapped reads) for whichever sequence name you like (usually chr1:chrY)