How to deal with UMIs in exome sequencing

Question

How to take in consideration UMI in WES pipeline?

0

Entering edit mode

5.9 years ago

jonessara770 ▴ 240

Hi

I have received WES data treated with SureSelectXT HS Reagent that has UMI. I wonder how I should do the analysis for this? what I should change in my WES pipeline to consider this? is there any tutorial for analyzing WES with UMI?

could you please share your experiences?

Thanks Sara

WES • 3.0k views

ADD COMMENT • link updated 5.9 years ago by i.sudbery 19k • written 5.9 years ago by jonessara770 ▴ 240

1

Entering edit mode

TIL: UMIs are "unique molecular identifiers", more widely known as barcodes

ADD REPLY • link 5.9 years ago by Jeremy Leipzig 22k

score 4 · Answer 1 · 2018-05-30

UMIs are taken into account during the de-duplication stage of your pipeline. Normally during a WES analysis you will collapse all reads that have the same start and end coordinates into a single read. This is on the assumption that reads that start and end at the same location are PCR duplicates.

However, it is possible that reads that start and end at the same location represent genuinely independent biological entities. We can distinguish the two possibilities by added a random barcode to each molecule before PCR.

How to deal with UMIs in exome sequencing

There are basically two ways in which you could deal with this. The first is to mark any two molecules that start at the same position and have the same UMI as duplicates of each other, removing one of them on some criteria.

The second is to group together all reads that have the same mapping position and UMI and "average" them to find the consensus sequence of the reads.

Our software, UMI-Tools, will do the first for you and help with the second. In all modes it starts by grouping together all reads that have the same start co-ordinates, mate inner distance and UMI sequence. It then further groups these where it believes that one UMI could have arisen from another as a PCR or sequencing error and determines which UMI represents the "parent" sequence.

In dedup mode it will then select one read from each group as representative, based of various metrics and a good dose of randomness and output that read and its mate.

In group mode it assigns a unique group code to each group of reads, adding it as a new tag on the BAM entry for each read. You could then take all the reads assigned to the same group and determine the consensus sequence. I believe others have done this before.