which options to select with UMI-tools and BCLconvert for deduplication of reads
1
0
Entering edit mode
11 months ago

Hi,

I have paired-end RRBS data with a 6bp UMI encoded in the middle of the header made by our sequencing core using BCLconvert, looking like this (UMI in bold, index in italics):

@A01685:159:H2YHFDSX7:4:2463:10655:16971:TAGCGC 1:N:0:CGTCTAAC

After aligning (with bismark) and sorting (samtools sort), the header looks like this:

A01685:159:H2YHFDSX7:4:2463:10655:16971:TAGCGC_1:N:0:CGTCTAAC

I would like to deduplicate these reads using UMI-tools dedup software, but in the documentation it is stated that the UMI needs to be encoded at the end of the readname.

I tried running UMI-tools dedup on these reads, but it does not recognize the UMI. How do I specificy the location of the UMI in the header here? I cannot seem to figure it out based on the documentation.

Cheers!

deduplication • 532 views
ADD COMMENT
0
Entering edit mode
11 months ago

Is this is due to unusal behavoir by bismark, which handles read names differently to most aligners, which just discard any part of the read name after a space. This they do in order to ensure that the read name for read1 and read2 is the same, where as you'll notice that the bit after the space is different in read1 and read2.

The best way to deal with this is probably to alter the readnames before running the alignment. the easiest way is probably:

$ zcat fastq_1.fq.gz | sed -E 's/ 1:N:.+//' | gzip > fastq_1.processed.fq.gz
$ zcat fastq_2.fq.gz | sed -E 's/ 2:N:.+//' | gzip > fastq_2.processed.fq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2895 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6