Question: Bulk RNA-Seq pipeline suggestion incorporating UMIs
0
gravatar for steveh
5 months ago by
steveh20
steveh20 wrote:

Hi,

I have 149 bulk RNA-Seq samples (100 bp, paired-end, Illumina) which have come from sequencing in the form of fastq triplets, i.e. pairs of reads plus a third fastq which contains only UMIs.

My first question is - do I need to use the UMIs at all or just ignore them?

So far I've ignored them, and used this workflow (on just 10 samples to begin with):

  1. FastQC on raw reads
  2. Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
  3. Produce counts using featureCounts
  4. MultiQC on results produced so far
  5. Analyse using DESeq2

This works, but ignores the UMIs completely. Results from multiQC show, from STAR:

star-alignment-plot

and from featureCounts:

feature-Counts-assignment-plot

(note the fairly large percentages of unassigned multimapping reads there).

Alternatively, I've tried to incorporate the UMIs with this changed workflow:

  1. FastQC on raw reads
  2. Align to full ref human genome using STAR (in: fastq, out: BAMs, sortedByCoord)
  3. Add the UMIs from the fastq files to the BAMs produced by STAR, using fgbio’s AnnotateBamWithUmis

but I'm getting lost down a rabit-hole now, adding more and more steps to this pipeline just in order to satisfy various errors I'm getting from downstream tools, e.g.

  1. fgbio SortBam
  2. fgbio SetMateInformation
  3. fgbio GroupReadsByUmi
  4. fgbio CallMolecularConsensusReads
  5. samtools rehead, to add SM tag to BAMs
  6. fgbio FilterConsensusReads (results in vastly reduced BAM file sizes)

for the moment I've stopped here - maybe I can use these BAM files, but this workflow is starting to feel over-complicated and I don't have confidence it's the correct way to go.

So to summarise:

  • Do I need to incorporate the UMIs at all?
  • If so, could anybody suggest a workflow?

Many thanks, Steve

rna-seq umi bulk • 282 views
ADD COMMENTlink modified 5 months ago by swbarnes27.8k • written 5 months ago by steveh20

Can not see images,.

ADD REPLYlink written 5 months ago by MatthewP620

apologies, corrected now

ADD REPLYlink modified 5 months ago • written 5 months ago by steveh20

Have you tried to de-duplicate reads using UMI's alone or in combination with read alignment starts using umi_tools?

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax84k
1
gravatar for swbarnes2
5 months ago by
swbarnes27.8k
United States
swbarnes27.8k wrote:

Have you looked at umi-tools?

https://github.com/CGATOxford/UMI-tools

ADD COMMENTlink written 5 months ago by swbarnes27.8k
1
gravatar for i.sudbery
5 months ago by
i.sudbery7.8k
Sheffield, UK
i.sudbery7.8k wrote:

Here is what I would recommend with umi-tools.

Extract the UMIs from the fastqs before mapping. You'll need to do this once for each of the non-UMI reads.

umi_tools extract --bc-pattern=NNNNNNNNNN -I umi_reads.fastq.gz --read2s-in=reads_R1.fastq.gz --read2-stdout | gzip > reads_R1.extracted.fastq.gz
umi_tools extract --bc-pattern=NNNNNNNNNN -I umi_reads.fastq.gz --read2s-in=reads_R2.fastq.gz --read2-stdout | gzip > reads_R2.extracted.fastq.gz

where the number of Ns in the bc-pattern matches the number of bases in the UMI.

You can then proceed to map these reads using STAR as before.

Once the reads are mapped, sorted and indexed, deduplicate the BAMs with umi_tools dedup:

umi_tools dedup -I mapped_reads.bam -S deduplicated_reads.bam --paired

Now you can proceed to quantify with featureCounts and analyse with Deseq2 as before.

ADD COMMENTlink modified 5 months ago • written 5 months ago by i.sudbery7.8k

Thanks Ian - would that be sorted by coordinate? (asking because the fgbio workflow seems to require re-sorting by Queryname)

ADD REPLYlink written 5 months ago by steveh20

Yes, sorted by coordinate.

ADD REPLYlink written 5 months ago by i.sudbery7.8k

thanks - and for the dedup step, do I need the --paired option or is that assumed?

ADD REPLYlink written 5 months ago by steveh20

Ooops. Yes, you will need the paired option, I'll edit the post.

ADD REPLYlink modified 5 months ago • written 5 months ago by i.sudbery7.8k

great, thanks so much for taking the time to answer at this time of year!

ADD REPLYlink written 5 months ago by steveh20

Just to update after lots of testing - this is the method I settled on, although adding the UMIs to the already-aligned BAMs and then using umi_tools dedup also works fine.

I don't recommend the method mentioned in my original post, using fgbio.

ADD REPLYlink written 4 months ago by steveh20

Not to be pedantic but "this" meaning the method/answer suggested by @i.sudbery above? If so I can move that comment to an answer, which you can then accept to provide closure to this thread.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax84k

yes that's correct, the @i.sudbery answer. The general pointer to umi_tools is also useful, but Ian's answer is very specific.

ADD REPLYlink written 4 months ago by steveh20

You are able to accept more than one answer. Ian's comment has been moved to an answer now.

ADD REPLYlink written 4 months ago by genomax84k
0
gravatar for padwalmk
5 months ago by
padwalmk90
padwalmk90 wrote:

Hi, Check out the number of the read in UMI, If it's less than 1 or 2 % of total reads then you do not have to worry et al. But if it's more than 10 % than you have to do something about it.

ADD COMMENTlink written 5 months ago by padwalmk90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 882 users visited in the last hour