Question: Extract consensus sequence reads (collapse PCR duplicates) from bam
0
gravatar for Nicolas Rosewick
21 months ago by
Belgium, Brussels
Nicolas Rosewick7.4k wrote:

Hi,

I did an experiment where I designed a sequencing library focused on a specific genomic region of interest. To go very deep and to avoid PCR duplicates issues I added an UMI to my reads (Unique Molecular Identifiers).

I used UMI-tools (https://github.com/CGATOxford/UMI-tools/blob/master/QUICK_START.md#read-grouping) to group my reads regarding its UMI (a 9bp region in my read 2) and the alignment position. Thus reads that have the same UMI sequence and the same alignment position will be flagged as PCR duplicate and will be assigned the same UG tag in the bam file.

Here's the UMI-tools command I used (subsequent to a classic bwa mem on the reads).

umi_tools group -I mapped.bam --paired --group-out=groups.tsv --output-bam -S mapped_grouped.bam

Now how can I process my bam file to take each group of reads with the same UG tag and report their consensus sequence. My goal is to discard potential sequencing errors.

One example (read1, 2, 3 and 4 have the same UMI and same alignment position - thus are PCR duplicates) :

reference AAATTTGGGAAATTTGGG
read1     ------------A-----
read2     ------------A-----
read3     ---G--------A-----
read4     ------C-----------
consensus AAATTTGGGAAAATTGGG
                      |

so in the consensus sequence there is an T->A change. The other modification in read 3 (T->G) and 4 (G->C) will not be included in the consensus sequence as they are minors.

Any advice ?

Thanks

EDIT: One idea would be to split the bam file regarding the UG tag and to perform picard MarkDuplicates on every splitted bam file. Then merge all the processed bam files into one bam file. What do you think ?

_

EDIT2 : The split-markDuplicates approach works but it's very very (very) slow. picard takes ~1-2 seconds per occurence and I have ~10000 groups to process. Any other idea ?

consensus sequence pcr umi • 1.6k views
ADD COMMENTlink modified 21 months ago • written 21 months ago by Nicolas Rosewick7.4k

I am going to suggest that you take a look at clumpify.sh from BBMap (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) to address the duplication issue. This can allow you to compress the data (e.g. read 1 and read 2 will be deduped leaving one copy). Clumpify is fast and does not need the reads to be -prealigned, unlike picard.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax64k

Thank you but I don't think it will solve my question. It may speed up a little bit the workflow but the number of picard instances will not change.

ADD REPLYlink written 21 months ago by Nicolas Rosewick7.4k

You will not need to use picard at all since you would have taken care of the duplicates beforehand.

ADD REPLYlink written 21 months ago by genomax64k

ok just to be sure if I have an example as :

read1 ===========T========>---------<============ATGC
read2 ===========T========>---------<============ATGC
read3 ===========T========>---------<============ATGC
read4 ===========T========>---------<============ATGC
read5 ===========A========>---------<============ATGC
will it remove the duplicates and report this consensus sequence 
readC ===========T========>---------<============ATGC
or report two reads as read5 has one mismatch compared to read 1,2,3 and 4
readC1 ===========T========>---------<============ATGC
readC2 ===========A========>---------<============ATGC

Let's say the first feature is correct (so report consensus sequence even if mismatches occur), what happend when only two reads are in the same group

read6 ===========T========>---------<============AATT
read7 ===========A========>---------<============AATT
|
readC ===========?========>---------<============AATT

Thanks

ADD REPLYlink modified 21 months ago • written 21 months ago by Nicolas Rosewick7.4k

read1 --> read4 can be represented by only 1 after you dedupe the sample.

See @Brian's answer regarding consensus with clumpify here: C: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remov

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax64k

I found this paper : https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/bts123 The tools is named fulcrum but I cannot find it ? The weblink seems dead...

ADD REPLYlink written 21 months ago by Nicolas Rosewick7.4k

Is clumpify not working for you? Just curious.

ADD REPLYlink written 21 months ago by genomax64k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2480 users visited in the last hour