Bulk RNA-Seq pipeline suggestion incorporating UMIs
3
1
Entering edit mode
22 months ago
steveh ▴ 40

Hi,

I have 149 bulk RNA-Seq samples (100 bp, paired-end, Illumina) which have come from sequencing in the form of fastq triplets, i.e. pairs of reads plus a third fastq which contains only UMIs.

My first question is - do I need to use the UMIs at all or just ignore them?

So far I've ignored them, and used this workflow (on just 10 samples to begin with):

2. Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
3. Produce counts using featureCounts
4. MultiQC on results produced so far
5. Analyse using DESeq2

This works, but ignores the UMIs completely. Results from multiQC show, from STAR:

and from featureCounts:

(note the fairly large percentages of unassigned multimapping reads there).

Alternatively, I've tried to incorporate the UMIs with this changed workflow:

2. Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
3. Add the UMIs from the fastq files to the BAMs produced by STAR, using fgbio’s AnnotateBamWithUmis

but I'm getting lost down a rabit-hole now, adding more and more steps to this pipeline just in order to satisfy various errors I'm getting from downstream tools, e.g.

1. fgbio SortBam
2. fgbio SetMateInformation
6. fgbio FilterConsensusReads (results in vastly reduced BAM file sizes)

for the moment I've stopped here - maybe I can use these BAM files, but this workflow is starting to feel over-complicated and I don't have confidence it's the correct way to go.

So to summarise:

• Do I need to incorporate the UMIs at all?
• If so, could anybody suggest a workflow?

Many thanks, Steve

RNA-Seq bulk UMI • 1.2k views
0
Entering edit mode

Can not see images,.

0
Entering edit mode

apologies, corrected now

0
Entering edit mode

Have you tried to de-duplicate reads using UMI's alone or in combination with read alignment starts using umi_tools?

4
Entering edit mode
22 months ago

Here is what I would recommend with umi-tools.

Extract the UMIs from the fastqs before mapping. You'll need to do this once for each of the non-UMI reads.

umi_tools extract --bc-pattern=NNNNNNNNNN -I umi_reads.fastq.gz --read2s-in=reads_R1.fastq.gz --read2-stdout | gzip > reads_R1.extracted.fastq.gz


where the number of Ns in the bc-pattern matches the number of bases in the UMI.

You can then proceed to map these reads using STAR as before.

Once the reads are mapped, sorted and indexed, deduplicate the BAMs with umi_tools dedup:

umi_tools dedup -I mapped_reads.bam -S deduplicated_reads.bam --paired


Now you can proceed to quantify with featureCounts and analyse with Deseq2 as before.

0
Entering edit mode

Thanks Ian - would that be sorted by coordinate? (asking because the fgbio workflow seems to require re-sorting by Queryname)

0
Entering edit mode

Yes, sorted by coordinate.

0
Entering edit mode

thanks - and for the dedup step, do I need the --paired option or is that assumed?

0
Entering edit mode

Ooops. Yes, you will need the paired option, I'll edit the post.

0
Entering edit mode

great, thanks so much for taking the time to answer at this time of year!

0
Entering edit mode

Just to update after lots of testing - this is the method I settled on, although adding the UMIs to the already-aligned BAMs and then using umi_tools dedup also works fine.

I don't recommend the method mentioned in my original post, using fgbio.

0
Entering edit mode

Not to be pedantic but "this" meaning the method/answer suggested by @i.sudbery above? If so I can move that comment to an answer, which you can then accept to provide closure to this thread.

0
Entering edit mode

yes that's correct, the @i.sudbery answer. The general pointer to umi_tools is also useful, but Ian's answer is very specific.

0
Entering edit mode

You are able to accept more than one answer. Ian's comment has been moved to an answer now.

1
Entering edit mode
22 months ago

Have you looked at umi-tools?

https://github.com/CGATOxford/UMI-tools

0
Entering edit mode
22 months ago

Hi, Check out the number of the read in UMI, If it's less than 1 or 2 % of total reads then you do not have to worry et al. But if it's more than 10 % than you have to do something about it.