Entering edit mode
                    3.5 years ago
        Thanh
        
    
        •
    
    0
    I was following the DNA-seq analysis workflow from NCI GDC. I was wondering what is the rationale behind mergind the BAM files. Specifically, I am dealing with paired-end WES data for 4 samples. Does merging of BAM files here mean that I have to merge all 4 BAM files and then mark the duplicates in the merged BAM file?
This is the relevant section:
Basically if the same sample library ran on multiple lanes/flowcells then those BAM files are merged.
Can you add a link to the relevant section? Thank you!
A follow-up question, if I have 4 ERR files of the same tumor sample (i.e. CDX18), that means these 4 runs belong to the same library CDX18, and I should merge the 4 bam files before marking duplicates right? I'm quite confused about what an aliquot exactly means and where I can find that information. Please attach any article explaining this term.
by TCGA/GDC definition, portion is the piece of physical tissue used for extraction of DNA/RNA/protein/etc, analyte is the extracted material (a.k.a. DNA/RNA/protein), and aliquot is simply a fraction of an analyte.
aliquot will be then used in library preparation step to make a library (homogeneous solution of DNA with adaptors linked to it and ready for sequencing). When you are doing sequencing, each library x lane combination is a read group, and if a library is sequenced in multiple read groups, normally ppl merge them together into one BAM. If multiple libraries have been made from the same analyte/aliquot, sometimes ppl might also merge them.
Only reads from the same aliquot will be merged. There are no merging across aliquots. This is standard practice of almost all sequencing and analysis centers.
Irrespective of this, the sort option would allow you to get the reads in the order of chromosome