In a targeted sequencing dataset, I have for each sample a r1.fastq.gz
file and a umi.fastq.gz
file.
How can I use the short umi tags that are in the umi.fastq.gz
file, to remove duplicates or create a consensus sequence of the reads in the r1.fastq.gz
file? Or do I need to go back to, or recreate the original fastq files?
The r1.fastq.gz
file and the umi.fastq.gz
file have the same number of reads.
I expect also that the order of the reads and the umi tags in both files is in sync.
I found fastp, but it is not clear to me which option I should use, or that I need a different tool/script for this setup of the data. https://github.com/OpenGene/fastp#unique-molecular-identifier-umi-processing
The files look like this (actual sequence masked XXXXX characters)
$ zcat sample_1.R1.fastq.gz | head
@HVVFKAFXY:1:11101:10004:17048
AATXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
AA/A/EA/EEAA/E//EA/EEEEEE/A/EE</<//EEEEEE/EE//EAEEA///66E//EE/AEE/EEAEEE/AA
@HVVFKAFXY:1:11101:10009:11939
CTAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE6EEEEEEEEEEEEAE/EEEEEEEEEEAE<E<EEEE
@HVVFKAFXY:1:11101:10010:19964
TTCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
$ zcat sample_1.umi.fastq.gz | head
@HVVFKAFXY:1:11101:10004:17048
GTAGGGACACTT
+
AAA///EE//EE
@HVVFKAFXY:1:11101:10009:11939
TGCTGCATTTTC
+
AAAAAEEEEEEE
@HVVFKAFXY:1:11101:10010:19964
CTAATCTAGTAA
Can you show a read example (or two) from each of these files?
zcat file.gz | head -8
.You may be able to use
umi-tools
but will likely need to align the data (https://github.com/CGATOxford/UMI-tools/blob/master/doc/QUICK_START.md ).I use umi-tools, but you'll have put the umi in the read name. Bcl2fastq can likely do this for you if you have access to the bcl files.