10 months ago by
UMIs are taken into account during the de-duplication stage of your pipeline. Normally during a WES analysis you will collapse all reads that have the same start and end coordinates into a single read. This is on the assumption that reads that start and end at the same location are PCR duplicates.
However, it is possible that reads that start and end at the same location represent genuinely independent biological entities. We can distinguish the two possibilities by added a random barcode to each molecule before PCR.
How to deal with UMIs in exome sequencing
There are basically two ways in which you could deal with this. The first is to mark any two molecules that start at the same position and have the same UMI as duplicates of each other, removing one of them on some criteria.
The second is to group together all reads that have the same mapping position and UMI and "average" them to find the consensus sequence of the reads.
Our software, UMI-Tools, will do the first for you and help with the second. In all modes it starts by grouping together all reads that have the same start co-ordinates, mate inner distance and UMI sequence. It then further groups these where it believes that one UMI could have arisen from another as a PCR or sequencing error and determines which UMI represents the "parent" sequence.
dedup mode it will then select one read from each group as representative, based of various metrics and a good dose of randomness and output that read and its mate.
group mode it assigns a unique group code to each group of reads, adding it as a new tag on the BAM entry for each read. You could then take all the reads assigned to the same group and determine the consensus sequence. I believe others have done this before.