Question

Error correction based on PCR duplicates using clumpify.sh

0

Entering edit mode

2.3 years ago

lechu ▴ 20

Hi!

I am looking for a tool to error-correct my RNA-seq data using PCR duplicates. I know UMIs would be helpful here, but I already have the data, and UMIs were not included. This question is a bit similar to (Extract consensus sequence reads (collapse PCR duplicates) from bam), but the devil is in details.

I can only error using reads pairs that are duplicates (same start of both mates), so first question is: --> Would clumpify only use duplicates (identical start positions of both mates) for error correction? It is not entirely clear to me if the error correction and advanced error correction parameters apply to correction within duplicates ONLY, but I was assuming it is so. Am I wrong here?

Regarding error correction within duplicates (aka collapsing duplicates to a consensus sequence): Let's say we have a few (8 in the example) read pairs with same start and end position that would normally be considered as a duplicates by Pickard MarkDuplicates and similar tools. Knowing it is not unequivocally true that such reads are true duplicates, I would like to first define clumps (call them sub-clumps) within groups (clumps) of duplicates. Sub-clumps could be defined based on edit distance. With the duplicate reads in the example, it is possible to define two sub-clumps with edit distance >=2 between them (in this case, reads 1,2,3,4,8 would form one clump, and reads 5,6,7 form another one). Only reads within such sub-clumps would be considered duplicates and collapsed to a majority consensus. So the 8 pairs in the example would be collapsed to two (majority) consensus sequences. There will always be a majority consensus, because otherwise a new sub-clump could always be defined, and in my understanding only unique ("orphan") sequences would be corrected.

ground truth 1 AAATTTGGGAAATTTGGG
ground truth 2 AAGTTTGGGAACTTTGGG
         read1 ------------------
         read2 ---------C--------
         read3 ------------------
         read4 ------------------
         read5 --G--------C------
         read6 --G--------C----A-
         read7 --G--------C------
         read8 ------C-----------
    consensus1 AAATTTGGGAAATTTGGG
    consensus2 AAGTTTGGGAACTTTGGG

Possibly clumpify is already doing this and I just find the description confusing. In any case, I would very much appreciate feedback on what parameters to use. I should mention, that the goal is to detect frequencies of random mutations (very rare substitutions) caused by a combination of biochemical treatments.

best wishes, Lech

RNAseq bbmap duplicates clumpify • 687 views

ADD COMMENT • link 2.3 years ago by lechu ▴ 20

1

Entering edit mode

You can add counts of reads of a type in the header

addcount=f          Append the number of copies to the read name

You can set this parameter to true. This is not a consensus in a sense since you are simply counting reads of each type.

I have never used clumpify for error correction so you may need to play with those parameters to see what it does in practice. My hunch is you are not going to get the two consensus you want in theory.

You must have already seen @Brian's explanation here : Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY • link 2.3 years ago by GenoMax 142k

0

Entering edit mode

I've seen the Brian's explanation. Maybe the way I put it made it sound more complicated that it is. I was hoping that consensus=f Output consensus sequence instead of clumps. combined with subs=2 (s) Maximum substitutions allowed between duplicates. would do what I described (subs=2 being the edit distance that I mentioned).

In fact I would be happy with any type error correction, as long as it's independently for every clump of duplicates, but not the error correction as done by, for e.g., tadpole. Will definitely play around with clumpify, as it seems to do something very close to what I need, was just hoping that maybe someone already knew sth I don't.

ADD REPLY • link 2.3 years ago by lechu ▴ 20