Hi,
I have 149 bulk RNA-Seq samples (100 bp, paired-end, Illumina) which have come from sequencing in the form of fastq triplets, i.e. pairs of reads plus a third fastq which contains only UMIs.
My first question is - do I need to use the UMIs at all or just ignore them?
So far I've ignored them, and used this workflow (on just 10 samples to begin with):
- FastQC on raw reads
- Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
- Produce counts using featureCounts
- MultiQC on results produced so far
- Analyse using DESeq2
This works, but ignores the UMIs completely. Results from multiQC show, from STAR:
and from featureCounts:
(note the fairly large percentages of unassigned multimapping reads there).
Alternatively, I've tried to incorporate the UMIs with this changed workflow:
- FastQC on raw reads
- Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
- Add the UMIs from the fastq files to the BAMs produced by STAR, using fgbio’s AnnotateBamWithUmis
but I'm getting lost down a rabit-hole now, adding more and more steps to this pipeline just in order to satisfy various errors I'm getting from downstream tools, e.g.
- fgbio SortBam
- fgbio SetMateInformation
- fgbio GroupReadsByUmi
- fgbio CallMolecularConsensusReads
- samtools rehead, to add SM tag to BAMs
- fgbio FilterConsensusReads (results in vastly reduced BAM file sizes)
for the moment I've stopped here - maybe I can use these BAM files, but this workflow is starting to feel over-complicated and I don't have confidence it's the correct way to go.
So to summarise:
- Do I need to incorporate the UMIs at all?
- If so, could anybody suggest a workflow?
Many thanks, Steve
Can not see images,.
apologies, corrected now
Have you tried to de-duplicate reads using UMI's alone or in combination with read alignment starts using
umi_tools
?