Consensus deduplication issues with non-matching/'slippy' primers?
1
0
Entering edit mode
5 months ago
graeme.thorn ▴ 60

I have an issue with some mapped sequencing with 9-base UMIs (the UMI is at the end of the read ID). A lot of the time, the individual molecules are (consensus) deduplicated correctly, but where there are very slight location differences between two or more sequences of the same molecule, such as a slightly shorter read,

NB500905:50:HVY23AFXY:2:21206:18601:14328:AAAAAAAAA 97  chr4    76426801    60  66M =   76426799    -68 TCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTTTTTTTTTTTT  AAAEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEE  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:66 YS:i:0  YT:Z:DP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21206:18601:14328:AAAAAAAAA 145 chr4    76426799    60  57M =   76426801    68  CCTCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTT   AE/EAE/E/EEEE/EA/EE/EEEAE/EEAAE/EEEEE/E//EE/EAEEAEEEEEEEE   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:57 YS:i:0  YT:Z:DP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:4:21412:17448:14097:AAAAAAAAA 97  chr4    76426801    60  67M =   76426799    0   TCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTTTTTTTTTTTTT AAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA AS:i:-5 ZS:i:-10    XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:66A0   YT:Z:UP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:4:21412:17448:14097:AAAAAAAAA 145 chr4    76426799    60  58M =   76426801    0   CCTCAGTTTCCCGAGTAGCTGGGACTACAGGTGCCCGCCACCATGCCCGGCTAATTTT  6E/////A/6EE//////A//E////A///E//EEE/EE//E//6EEE/E//AEEEEE  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:58 YT:Z:UP RG:Z:10-B   NH:i:1  

the molecule is being retained, duplicated in the deduplicated BAM files. So my first question is this, is there an algorithm to correctly deduplicate this molecule and return one consensus sequence rather than two?

Secondly, it appears that other sequencing issues with the first read in a pair has resulted in the first read being different, with the second being the same, such as in this example (12 pairs of what is presumably the same molecule - the 9-bp UMI and the locations of the second read are very similar):

NB500905:50:HVY23AFXY:1:11204:12926:7292:AAAAAAAAA  83  chr17   39699570    60  69M =   39699476    -163    GGTCCTGGAAGCCACAAGGTAAACACAACACATCCCCCTCCTTGACTATCAATTTTACTAGAGGATGTG   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:1:11204:12926:7292:AAAAAAAAA  163 chr17   39699476    60  59M =   39699570    163 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEE/EE6EEEEEEEE AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:59 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:1:11210:13179:6528:AAAAAAAAA  83  chr17   39699492    60  67M =   39699473    -86 GTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCCAGAAGA EEEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:67 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:1:11210:13179:6528:AAAAAAAAA  163 chr17   39699473    60  60M =   39699492    86  AAGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTT    E66/A/EE6//6E6E/A/EA///E//EE/EEAEEE/E///EA///EEEE/E//<A</EEE    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:60 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:11111:19485:18118:AAAAAAAAA 83  chr17   39699525    60  69M =   39699474    -120    CATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAAC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:-3 YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:11111:19485:18118:AAAAAAAAA 163 chr17   39699474    60  60M =   39699525    120 AGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTT    EE/EAEEEEEEEEAE6/EE/E/EE/EEEEEEEEEEEEEAEEE/EEEEEEEEEAAE/EEEE    AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:0  MD:Z:38A21  YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:11211:18464:14540:AAAAAAAAA 83  chr17   39699484    60  69M =   39699477    -76 ATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCC   EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:11211:18464:14540:AAAAAAAAA 163 chr17   39699477    60  58M =   39699484    76  CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEE  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:58 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21111:8435:13671:AAAAAAAAA  83  chr17   39699534    60  68M =   39699476    -126    CAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAACACA    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:68 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21111:8435:13671:AAAAAAAAA  163 chr17   39699476    60  59M =   39699534    126 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEAEAEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEE/EEEEEEEEEA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:59 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21112:12742:5719:AAAAAAAAA  83  chr17   39699531    60  67M =   39699476    -122    TTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:67 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21112:12742:5719:AAAAAAAAA  163 chr17   39699476    60  59M =   39699531    122 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEE AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:59 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21308:6819:11408:AAAAAAAAA  83  chr17   39699519    60  67M =   39699477    -109    TAAGATCATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:67 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21308:6819:11408:AAAAAAAAA  163 chr17   39699477    60  58M =   39699519    109 CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC  EA/<EAAE//EE/EE/EEEAE<EAAE/EAE</EEEEE/EEEEEEEAEE/A<<EEEAEE  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:58 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21309:11545:15754:AAAAAAAAA 83  chr17   39699486    60  69M =   39699477    -78 GTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAGATTCCAG   EEEAEE6/EE/EEA/A/EEE/EEEEEEAEEE/EAEEEEEAEE/EEEE6EEEE/EEEEEEEAEEE/AAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:2:21309:11545:15754:AAAAAAAAA 163 chr17   39699477    60  58M =   39699486    78  CCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC  //E/EEAEEEEE6EE/EAE////EEEE/EAEEEAEA///<E/<EE/EEEEEE/A6AA6  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:58 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:11405:26640:4281:AAAAAAAAA  83  chr17   39699480    60  68M =   39699476    -72 TTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTCAGATACTTCAAAG    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:68 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:11405:26640:4281:AAAAAAAAA  163 chr17   39699476    60  59M =   39699480    72  TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEE//EEEEEEAEEEEEAEEEEEEEAA AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:59 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:11406:13328:7002:AAAAAAAAA  83  chr17   39699533    60  68M =   39699477    -124    TCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCAAAAGGTAAACACAACAC    EEEEEEE/AE</EEEAAEEEAEEAAEE/EEE/EEEEEEEEEEEAEE//EEEEEEAEEEEEEE/AE6AA    AS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:51C16  YS:i:-4 YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:11406:13328:7002:AAAAAAAAA  163 chr17   39699477    60  58M =   39699533    124 CCTTTCGATGTGACTGTCTCCTCCCAAAATTGTAGACCCTCTTAAGATCATGCTTTTC  E/EE/EEEEEEEEEEEEEEEEEEEEEEE6EEEEEAEE/A6EEEEEEEEEEEEAEEAEE  AS:i:-4 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:28T29  YS:i:-5 YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:21610:12320:10534:AAAAAAAAA 83  chr17   39699525    60  69M =   39699476    -118    CATGCTTTTCAGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAAC   EEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:3:21610:12320:10534:AAAAAAAAA 163 chr17   39699476    60  59M =   39699525    118 TCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEAEEEEEEEEEEEE AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:59 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:4:21402:24172:19713:AAAAAAAAA 83  chr17   39699535    60  69M =   39699474    -130    AGATACTTCAAAGATTCCAGAAGATATGCCCCGGGGGTCCTGGAAGCCACAAGGTAAACACAACACATC   /EEEEEAA/AAEE/AEEEAEA//EEEE<AEEAEE</EEEEAE<E/EAEEAEE/E6E/EAEEEE/EEAAA   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:69 YS:i:-6 YT:Z:CP RG:Z:10-B   NH:i:1  
NB500905:50:HVY23AFXY:4:21402:24172:19713:AAAAAAAAA 163 chr17   39699474    60  60M =   39699535    130 AGTCCTTTCGATGTGACTGTCTCCTCCCAAATTTGTAGACCCTCTTAAGATCATGCTTTT    E6/EAEE////EE///A///A/EE////EE/E6</<A/AE//!//<E66E/!E///6/E/    AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:0  MD:Z:42T8C8 YS:i:0  YT:Z:CP RG:Z:10-B   NH:i:1  

The obvious non-deduplication works to bias anything downstream, such as variant calling. This is a more difficult molecule to consensus deduplicate, but a method which would work with this case would work in the first case.

To check that this is not a mapping issue, I have blatted the two sets of sequences on the genome browser, and it is very clear that one of the two reads in each pair is effectively fixed, with the other being started at different positions on the molecule (my seuqencing collaborator suggests that the Illumina primer for read 1 did not bind effectively, and that it is 'slippy' in some sense):

BLAT of first set of non-dedup reads

BLAT of 12 pairs of non-dedup reads from same molecule

So my question is, is there a robust consensus deduplication method that can deal with both cases, and derive a single pair of consensus reads from each set?

DNA-seq deduplication • 390 views
ADD COMMENT
0
Entering edit mode
5 months ago

The following may or may not be useful.

UMI-tools only uses the outside coordinates of an alignment pair, not the both the start and end coordinates. This might help with your first problem.

UMI-tools can be told to ignore read2 when doing deduplication, and only focus on read1 (--ignore-tlen). At the moment your problem is that you want to focus only on read2. The easy way to sort that out, would be to map the reads the other way around. This might help with your second problem.

Unfortunately, UMI-tools does not have a facility for calculating read consensus sequences. It can either deduplicate - keeping only a single read, selected according to a set of fairly basic heuristics, or it can group - assigning an identifier to reads that are duplicates of each other.

If you want a consensus via umi-tools, you'd need to group and then get all the reads in one group and write custom code to calculate the consensus.

ADD COMMENT
0
Entering edit mode

I've been using gencore here at https://github.com/OpenGene/gencore, which looks at the start points of both R1 and R2, and treats molecules (with the same UMI) as different if they don't match those locs or if the reads are different lengths. It doesn't consensus deduplicate the whole set, just a subset that match, leaving the remainder, which would bias everything.

I have used UMI-tools before with similar sequencing chemistries, and it works fine for RNA-seq, where the counts are important, but the consensus sequence is important here as I want to call variants on it.

ADD REPLY
0
Entering edit mode

It shouldn't be that difficult to take the output from umi_tools group and write some custom code that calculates a consensus sequence, depending on how sophisticated you want that consensus to be - if you just want a majority call, it should be fairly straight forward.

ADD REPLY
0
Entering edit mode

Which version of UMI-tools is this? The one currently documented on readthedocs.io (1.0.0) does not have this option, and the one I currently have installed (0.5.5) does not have this either.

ADD REPLY
0
Entering edit mode

It was in 1.1.0. I don't know why the documentation is still showing 1.0.0.

ADD REPLY

Login before adding your answer.

Traffic: 2804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6