I did an experiment where I designed a sequencing library focused on a specific genomic region of interest. To go very deep and to avoid PCR duplicates issues I added an UMI to my reads (Unique Molecular Identifiers).
I used UMI-tools (https://github.com/CGATOxford/UMI-tools/blob/master/QUICK_START.md#read-grouping) to group my reads regarding its UMI (a 9bp region in my read 2) and the alignment position. Thus reads that have the same UMI sequence and the same alignment position will be flagged as PCR duplicate and will be assigned the same UG tag in the bam file.
Here's the UMI-tools command I used (subsequent to a classic bwa mem on the reads).
umi_tools group -I mapped.bam --paired --group-out=groups.tsv --output-bam -S mapped_grouped.bam
Now how can I process my bam file to take each group of reads with the same UG tag and report their consensus sequence. My goal is to discard potential sequencing errors.
One example (read1, 2, 3 and 4 have the same UMI and same alignment position - thus are PCR duplicates) :
reference AAATTTGGGAAATTTGGG read1 ------------A----- read2 ------------A----- read3 ---G--------A----- read4 ------C----------- consensus AAATTTGGGAAAATTGGG |
so in the consensus sequence there is an T->A change. The other modification in read 3 (T->G) and 4 (G->C) will not be included in the consensus sequence as they are minors.
Any advice ?
EDIT: One idea would be to split the bam file regarding the UG tag and to perform picard MarkDuplicates on every splitted bam file. Then merge all the processed bam files into one bam file. What do you think ?
EDIT2 : The split-markDuplicates approach works but it's very very (very) slow. picard takes ~1-2 seconds per occurence and I have ~10000 groups to process. Any other idea ?