Question

which options to select with UMI-tools and BCLconvert for deduplication of reads

0

Entering edit mode

11 months ago

ben.vanderveer • 0

Hi,

I have paired-end RRBS data with a 6bp UMI encoded in the middle of the header made by our sequencing core using BCLconvert, looking like this (UMI in bold, index in italics):

@A01685:159:H2YHFDSX7:4:2463:10655:16971:TAGCGC 1:N:0:CGTCTAAC

After aligning (with bismark) and sorting (samtools sort), the header looks like this:

A01685:159:H2YHFDSX7:4:2463:10655:16971:TAGCGC_1:N:0:CGTCTAAC

I would like to deduplicate these reads using UMI-tools dedup software, but in the documentation it is stated that the UMI needs to be encoded at the end of the readname.

I tried running UMI-tools dedup on these reads, but it does not recognize the UMI. How do I specificy the location of the UMI in the header here? I cannot seem to figure it out based on the documentation.

Cheers!

deduplication • 532 views

ADD COMMENT • link updated 11 months ago by i.sudbery 19k • written 11 months ago by ben.vanderveer • 0

score 0 · Answer 1 · 2023-05-23

Is this is due to unusal behavoir by bismark, which handles read names differently to most aligners, which just discard any part of the read name after a space. This they do in order to ensure that the read name for read1 and read2 is the same, where as you'll notice that the bit after the space is different in read1 and read2.

The best way to deal with this is probably to alter the readnames before running the alignment. the easiest way is probably:

$ zcat fastq_1.fq.gz | sed -E 's/ 1:N:.+//' | gzip > fastq_1.processed.fq.gz
$ zcat fastq_2.fq.gz | sed -E 's/ 2:N:.+//' | gzip > fastq_2.processed.fq.gz