Question

Question about deduplication with UMI-tools

0

Entering edit mode

5.8 years ago

GSAENZDEPIP ▴ 30

Hello, I am using UMI-tools for deduplicating some RNA-seq samples. I have high duplication levels due to the protocol used for library generation. However, after using UMI-tools the duplication levels are still higher than 20% in all the samples. Is this normal or I should change some UMI-tools parametres?

Many thanks, Goren

P.D: In this image you can see that, for example, sample "CCI-121" has 96% duplicates before using UMI-tools and 76% after deduplication. I measure duplication levels using Picard.

Script for UMI-tools:

umi_tools dedup -I /input.bam --output-stats=/output/directory/stats -S /output.bam  --buffer-whole-contig

Script for Picard:

picard MarkDuplicates REFERENCE_SEQUENCE=/reference/genome INPUT=/input.bam OUTPUT=output.txt REMOVE_SEQUENCING_DUPLICATES=FALSE ASSUME_SORTED=TRUE METRICS_FILE=output.txt CREATE_INDEX=TRUE

RNA-Seq UMI • 6.6k views

ADD COMMENT • link updated 5.8 years ago by GenoMax 141k • written 5.8 years ago by GSAENZDEPIP ▴ 30

0

Entering edit mode

There seems to be something strange about this or I am not understanding what you are doing here. You used UMI's to label individual RNA's then made the libraries etc. Now after you deduped the data using umitools dedupe (did you do umitools counts also) the actual sequence of the reads is still showing high duplication? As if identical RNA species got labelled with unique UMI's?

ADD REPLY • link 5.8 years ago by GenoMax 141k

0

Entering edit mode

I did not use "UMI-tools count" because I want to use HTSeq for counting. I simply use the output BAM file (from STAR aligner) as the input for "UMI-tools dedup" in order to obtain a BAM file without duplicates. Then I want to use that BAM file without duplicates for counting with HTSeq.

Thank you for your help!

ADD REPLY • link 5.8 years ago by GSAENZDEPIP ▴ 30

0

Entering edit mode

"UMI-tools dedup" in order to obtain a BAM file without duplicates

So that is removal of duplicates for UMI's where the actual sequence of the read has not been taken into account. Correct?

Is the % duplication in the table above referring to sequences? If picard markdups is saying that you still have high sequence duplication then you need to find out why that happened? Was the RNA amplified before UMI's were added?

ADD REPLY • link 5.8 years ago by GenoMax 141k

2

Entering edit mode

This was the answer from Tom Smith (Github):

Hi @gsaenzdepip. Your picard MarkDuplicates command ignores UMIs. Thus, two reads with the same alignment coordinates with be considered duplicates even if they have different UMIs. The duplication rates you have obtained are therefore over-estimates. There's actually no need to quantify duplication levels post umi_tools dedup since this tool is explicitly designed to remove all duplicate reads. If you want to determine the duplication rate according to umi_tools, you can use

1 - (# reads post-UMI-tools / # reads pre)

The right hand side of the above is what you have in parentheses in the "No reads (%) After deduplication" column. For example, your CC-121 sample has duplication rate of:

1 - (1309295 / 9547864) = 0.86 = 86%.

Your results actually look positive to me. Had you used a library preparation without UMIs, you would have an apparent duplication rate of 92-96% per sample, thus only 4-8% of your reads would be retained for downstream analyses. Since you used UMIs, it's possible to more accurately estimate the duplication rates (86-91%) and 9-14% of your reads can be retained.

P.s If you want to compare the UMI-tools duplication rate estimate with picard and your BAM contains the UMI in a read tag, you can provide the BARCODE_TAG=[tag] option to get picard's estimate, taking into account UMIs. Note, it's not possible to accurately estimate the duplicate rate in the deduplicated BAM with any tool since we have removed reads which will invalidate the assumptions underlying any method to estimate duplication rate.

ADD REPLY • link 5.8 years ago by GSAENZDEPIP ▴ 30

0

Entering edit mode

There's actually no need to quantify duplication levels post umi_tools dedup since this tool is explicitly designed to remove all duplicate reads.

and

Since you used UMIs, it's possible to more accurately estimate the duplication rates (86-91%) and 9-14% of your reads can be retained.

Since you have self identified this as a RNAseq dataset consider this. It is not advisable to remove sequence duplicates since you would be messing with distribution of data. Since you used UMI's as an extra insurance against PCR duplicates it would be fine to do umitools dedupe.

Note, it's not possible to accurately estimate the duplicate rate in the deduplicated BAM with any tool since we have removed reads which will invalidate the assumptions underlying any method to estimate duplication rate.

I am not sure I understand this. A tool like clumpify.sh from BBMap suite looks exclusively at sequence to estimate duplication (in PCR dup mode but can also look for optical duplicates based on cluster coordinates, A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). Again you would not want to do this with an RNAseq dataset.

By default multi-mappers are not counted by htseq-count and featureCounts (correct setting). I have a feeling that you are going to hit that hurdle next.

ADD REPLY • link 5.8 years ago by GenoMax 141k