Question: Consensus reads from UMIs
gravatar for graeme.thorn
14 months ago by
London, United Kingdom
graeme.thorn50 wrote:

I have DNA-seq data from a 30-gene capture panel, with UMIs in the FASTQ header for each read. This panel is for variant detection in tissue and cell-free DNA samples with a high coverage (>1000X) and the UMIs will help in removing duplicates from sequencing the same fragment of DNA multiple times, getting a better estimate of the allele fractions. However, we want to reduce the false calling of variants from sequencing errors as much as possible, so we will need to generate a consensus sequence for the DNA fragment from the multiple duplicates (with the same UMI and position). This is similar to Extract consensus sequence reads (collapse PCR duplicates) from bam, but not exactly the same, as the UMIs are in the FASTQ read id rather than the read itself.

In a similar situation (RNA-seq with UMIs), I have successfully used UMI-tools to deduplicate the mapped reads: UMI-tools dedup retains the one with highest mapping quality, lowest position or chooses one at random, which is fine for RNA-seq, but not for variant calling where the sequence of the mapped read is important.

There is also clumpify (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.) but this appears to work on the (FASTQ) reads only, which for high-depth capture-seq would mean that all reads matching the position would be compressed into a single read, even if the UMIs are different (if I've understood the documentation correctly).

Are there any other tools which can work with UMIs to deduplicate and generate a consensus sequence from the duplicates per deduplicated read?

deduplication dna-seq umi • 881 views
ADD COMMENTlink modified 14 months ago by finswimmer13k • written 14 months ago by graeme.thorn50

I can't give an answer but can you split the fastq on the UMI's? If you can, you could maybe do something with a cluster or denovo assembly tool.

Some cluster tools have a consensus option:

ADD REPLYlink written 14 months ago by gb1.9k

graeme.thorn : You can just clump the reads together with clumpify based on how strict you want the sequence identity to be. You don't have to compress/de-dupe them. Depending on how many reads you have you can then use a pileup/usearch to generate consensus.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax91k
gravatar for finswimmer
14 months ago by
finswimmer13k wrote:

In the past I've worked sometimes with fgbio. An example on how you work with it you can see in the presentation.

I modified the pipeline shown in the presentation to fit my needs. I replaced some tools because I'm not a big fan of the picard staff. I also introduced a step using bbmerge to do a error correction if reads overlap.

The example workflow looks like this:

ADD COMMENTlink modified 14 months ago • written 14 months ago by finswimmer13k

Of course, I will have to modify slightly, but this does look like the most consistent way of doing it, particularly given I need high sensitivity, and low false positives for my purposes.

ADD REPLYlink written 13 months ago by graeme.thorn50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1414 users visited in the last hour