Question

WES data analysis: the rationale behind merging BAM files and marking duplicates

0

Entering edit mode

2.0 years ago

Thanh • 0

I was following the DNA-seq analysis workflow from NCI GDC. I was wondering what is the rationale behind mergind the BAM files. Specifically, I am dealing with paired-end WES data for 4 samples. Does merging of BAM files here mean that I have to merge all 4 BAM files and then mark the duplicates in the merged BAM file?

sequencing whole-exome WES BAM merge Picard MarkDuplicates • 1.6k views

ADD COMMENT • link updated 2.0 years ago by Zhenyu Zhang ★ 1.2k • written 2.0 years ago by Thanh • 0

1

Entering edit mode

This is the relevant section:

Each read group is aligned to the reference genome separately and all read group alignments that belong to a single aliquot are merged using Picard Tools SortSam and MergeSamFiles. Duplicate reads, which may persist as PCR artifacts, are then flagged to prevent downstream variant call errors

Basically if the same sample library ran on multiple lanes/flowcells then those BAM files are merged.

ADD REPLY • link 2.0 years ago by GenoMax 141k

0

Entering edit mode

Can you add a link to the relevant section? Thank you!

ADD REPLY • link 2.0 years ago by Thanh • 0

0

Entering edit mode

A follow-up question, if I have 4 ERR files of the same tumor sample (i.e. CDX18), that means these 4 runs belong to the same library CDX18, and I should merge the 4 bam files before marking duplicates right? I'm quite confused about what an aliquot exactly means and where I can find that information. Please attach any article explaining this term.

ADD REPLY • link 2.0 years ago by Thanh • 0

1

Entering edit mode

by TCGA/GDC definition, portion is the piece of physical tissue used for extraction of DNA/RNA/protein/etc, analyte is the extracted material (a.k.a. DNA/RNA/protein), and aliquot is simply a fraction of an analyte.

aliquot will be then used in library preparation step to make a library (homogeneous solution of DNA with adaptors linked to it and ready for sequencing). When you are doing sequencing, each library x lane combination is a read group, and if a library is sequenced in multiple read groups, normally ppl merge them together into one BAM. If multiple libraries have been made from the same analyte/aliquot, sometimes ppl might also merge them.

ADD REPLY • link 2.0 years ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

Only reads from the same aliquot will be merged. There are no merging across aliquots. This is standard practice of almost all sequencing and analysis centers.

ADD REPLY • link 2.0 years ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

Irrespective of this, the sort option would allow you to get the reads in the order of chromosome

ADD REPLY • link 2.0 years ago by Prash ▴ 280