Question

How to remove duplicate reads / create consensus reads based on umi tags in 2nd fastq file?

0

Entering edit mode

4.3 years ago

William ★ 5.3k

In a targeted sequencing dataset, I have for each sample a r1.fastq.gz file and a umi.fastq.gz file.

How can I use the short umi tags that are in the umi.fastq.gz file, to remove duplicates or create a consensus sequence of the reads in the r1.fastq.gz file? Or do I need to go back to, or recreate the original fastq files?

The r1.fastq.gz file and the umi.fastq.gz file have the same number of reads. I expect also that the order of the reads and the umi tags in both files is in sync.

I found fastp, but it is not clear to me which option I should use, or that I need a different tool/script for this setup of the data. https://github.com/OpenGene/fastp#unique-molecular-identifier-umi-processing

The files look like this (actual sequence masked XXXXX characters)

$ zcat sample_1.R1.fastq.gz | head
@HVVFKAFXY:1:11101:10004:17048
AATXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

+
AA/A/EA/EEAA/E//EA/EEEEEE/A/EE</<//EEEEEE/EE//EAEEA///66E//EE/AEE/EEAEEE/AA
@HVVFKAFXY:1:11101:10009:11939
CTAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE6EEEEEEEEEEEEAE/EEEEEEEEEEAE<E<EEEE
@HVVFKAFXY:1:11101:10010:19964
TTCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

$ zcat sample_1.umi.fastq.gz | head
@HVVFKAFXY:1:11101:10004:17048
GTAGGGACACTT
+
AAA///EE//EE
@HVVFKAFXY:1:11101:10009:11939
TGCTGCATTTTC
+
AAAAAEEEEEEE
@HVVFKAFXY:1:11101:10010:19964
CTAATCTAGTAA

umi • 1.5k views

ADD COMMENT • link 4.3 years ago by William ★ 5.3k

0

Entering edit mode

Can you show a read example (or two) from each of these files? zcat file.gz | head -8.

You may be able to use umi-tools but will likely need to align the data (https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md ).

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

I use umi-tools, but you'll have put the umi in the read name. Bcl2fastq can likely do this for you if you have access to the bcl files.

ADD REPLY • link 4.3 years ago by swbarnes2 14k