Question: How to take in consideration UMI in WES pipeline?
0
gravatar for jonessara770
10 months ago by
jonessara770130
jonessara770130 wrote:

Hi

I have received WES data treated with SureSelectXT HS Reagent that has UMI. I wonder how I should do the analysis for this? what I should change in my WES pipeline to consider this? is there any tutorial for analyzing WES with UMI?

could you please share your experiences?

Thanks Sara

wes • 510 views
ADD COMMENTlink modified 10 months ago by i.sudbery4.3k • written 10 months ago by jonessara770130

TIL: UMIs are "unique molecular identifiers", more widely known as barcodes

ADD REPLYlink written 10 months ago by Jeremy Leipzig18k
2
gravatar for i.sudbery
10 months ago by
i.sudbery4.3k
Sheffield, UK
i.sudbery4.3k wrote:

UMIs are taken into account during the de-duplication stage of your pipeline. Normally during a WES analysis you will collapse all reads that have the same start and end coordinates into a single read. This is on the assumption that reads that start and end at the same location are PCR duplicates.

However, it is possible that reads that start and end at the same location represent genuinely independent biological entities. We can distinguish the two possibilities by added a random barcode to each molecule before PCR.

How to deal with UMIs in exome sequencing

There are basically two ways in which you could deal with this. The first is to mark any two molecules that start at the same position and have the same UMI as duplicates of each other, removing one of them on some criteria.

The second is to group together all reads that have the same mapping position and UMI and "average" them to find the consensus sequence of the reads.

Our software, UMI-Tools, will do the first for you and help with the second. In all modes it starts by grouping together all reads that have the same start co-ordinates, mate inner distance and UMI sequence. It then further groups these where it believes that one UMI could have arisen from another as a PCR or sequencing error and determines which UMI represents the "parent" sequence.

In dedup mode it will then select one read from each group as representative, based of various metrics and a good dose of randomness and output that read and its mate.

In group mode it assigns a unique group code to each group of reads, adding it as a new tag on the BAM entry for each read. You could then take all the reads assigned to the same group and determine the consensus sequence. I believe others have done this before.

ADD COMMENTlink written 10 months ago by i.sudbery4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour