Question

Downsampling UMI reads

0

Entering edit mode

4.2 years ago

Bedirhan • 0

Hi,

I am currently reading this paper (https://www.ncbi.nlm.nih.gov/pubmed/30214446 ), and are using the same protocol to build a bioinformatics pipeline to look at T cell clonality, I am quite unsure about how they were able to downsample the UMI reads.

"To control this over- sequencing error in the UMI and CDR3 sequences, we randomly discard the reads until the remaining reads contain about 8 reads per UMI."

I have used umi-tools to extract the umi information but unsure how to get around this step. My understanding is that they have achieved this downsampling on the fastq files not on mapped reads.

Any help or suggestions are appreciated.

Thank you

RNA-Seq UMI next-gen • 2.2k views

ADD COMMENT • link updated 4.2 years ago by Kevin Blighe 87k • written 4.2 years ago by Bedirhan • 0

0

Entering edit mode

So what is the actual question? Do you need a tool for downsampling fastq?

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Yes, I need to downsample fastq files based on UMI. I couldn't find any tools out there to do it.

ADD REPLY • link 4.2 years ago by Bedirhan • 0

1

Entering edit mode

I do not think they used a dedicated tool but simply counted how many reads were on average per UMi in the full dataset and then simply downsampled the total reads to somewhat match the expected number. Downsampling dataset with more than 60 million reads

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

Thank you for the explanation, I will try out seqtk mentioned in the link.

ADD REPLY • link 4.2 years ago by Bedirhan • 0

score 2 · Answer 1 · 2020-01-29

reformat.sh from BBMap suite also has downsampling options.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

I doubt there is a tool that can downsample taking into account the UMIs.