Hi everyone,
I'm looking for some advice regarding UMI processing.
Currently, I'm working on establishing the process of panel data on a previous construct pipeline. I've got access to the raw sequencing data (FASTQ), and the collapsed BAM file from the DRAGEN pipeline.
In your experience, what are the best methods for working from the FASTQ file with molecular barcodes (Illumina UMIs)? How do you collapse the molecular barcode families? Are there any well-documented tools for it?
Do you follow your workflow without making any changes when starting from the collapsed BAM file? Or do you remove UMIs from the read using UMI tools and then continue the workflow?
Any tips are appreciated!
UMI-tools
is an established package for dealing with UMI. You will also find a detailed usage guide linked here: https://github.com/CGATOxford/UMI-toolsIf your data already contains collapsed UMI's then that may limit usage of
umi-tools
.My main issue with
UMI-tools
is that I understand that thededup
andgroup
functions don't work for FASTQ, but BAM files. I will look at their documentation again.Thanks!
In that case you can try
fastp
instead: Use fastp to preprocess FASTQ data with unique molecular identifer (UMI) integratedThere is some advice from @Ian.Sudbery here: De-duplicate UMI at FASTQ level
Cool, I didn't know this tool! I'll delve into it!
Thanks!
Is there any particular reason you want to work with the fastq rather than aligning reads and then collapsing them?
Not really! I want to provide the user of my pipeline with the flexibility to start with either BAM or FASTQ.
However, I believe that starting from the collapsed reads BAM file, UMIs shouldn't be an issue anymore. Am I right?