Question

Consensus reads from UMIs

0

Entering edit mode

4.7 years ago

graeme.thorn ▴ 100

I have DNA-seq data from a 30-gene capture panel, with UMIs in the FASTQ header for each read. This panel is for variant detection in tissue and cell-free DNA samples with a high coverage (>1000X) and the UMIs will help in removing duplicates from sequencing the same fragment of DNA multiple times, getting a better estimate of the allele fractions. However, we want to reduce the false calling of variants from sequencing errors as much as possible, so we will need to generate a consensus sequence for the DNA fragment from the multiple duplicates (with the same UMI and position). This is similar to Extract consensus sequence reads (collapse PCR duplicates) from bam, but not exactly the same, as the UMIs are in the FASTQ read id rather than the read itself.

In a similar situation (RNA-seq with UMIs), I have successfully used UMI-tools to deduplicate the mapped reads: UMI-tools dedup retains the one with highest mapping quality, lowest position or chooses one at random, which is fine for RNA-seq, but not for variant calling where the sequence of the mapped read is important.

There is also clumpify (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.) but this appears to work on the (FASTQ) reads only, which for high-depth capture-seq would mean that all reads matching the position would be compressed into a single read, even if the UMIs are different (if I've understood the documentation correctly).

Are there any other tools which can work with UMIs to deduplicate and generate a consensus sequence from the duplicates per deduplicated read?

deduplication UMI DNA-seq • 3.8k views

ADD COMMENT • link updated 4.7 years ago by finswimmer 16k • written 4.7 years ago by graeme.thorn ▴ 100

0

Entering edit mode

I can't give an answer but can you split the fastq on the UMI's? If you can, you could maybe do something with a cluster or denovo assembly tool.

Some cluster tools have a consensus option: https://drive5.com/usearch/manual/output_files.html

ADD REPLY • link 4.7 years ago by gb ★ 2.2k

0

Entering edit mode

graeme.thorn : You can just clump the reads together with clumpify based on how strict you want the sequence identity to be. You don't have to compress/de-dupe them. Depending on how many reads you have you can then use a pileup/usearch to generate consensus.

ADD REPLY • link 4.7 years ago by GenoMax 141k

score 4 · Accepted Answer · 2019-08-16

4

Entering edit mode

4.7 years ago

finswimmer 16k

In the past I've worked sometimes with fgbio. An example on how you work with it you can see in the presentation.

I modified the pipeline shown in the presentation to fit my needs. I replaced some tools because I'm not a big fan of the picard staff. I also introduced a step using bbmerge to do a error correction if reads overlap.

The example workflow looks like this:

ADD COMMENT • link 4.7 years ago by finswimmer 16k

0

Entering edit mode

Of course, I will have to modify slightly, but this does look like the most consistent way of doing it, particularly given I need high sensitivity, and low false positives for my purposes.

ADD REPLY • link 4.6 years ago by graeme.thorn ▴ 100