Downsampling UMI reads
1
0
Entering edit mode
4.2 years ago
Bedirhan • 0

Hi,

I am currently reading this paper (https://www.ncbi.nlm.nih.gov/pubmed/30214446 ), and are using the same protocol to build a bioinformatics pipeline to look at T cell clonality, I am quite unsure about how they were able to downsample the UMI reads.

"To control this over- sequencing error in the UMI and CDR3 sequences, we randomly discard the reads until the remaining reads contain about 8 reads per UMI."

I have used umi-tools to extract the umi information but unsure how to get around this step. My understanding is that they have achieved this downsampling on the fastq files not on mapped reads.

Any help or suggestions are appreciated.

Thank you

RNA-Seq UMI next-gen • 2.2k views
ADD COMMENT
0
Entering edit mode

So what is the actual question? Do you need a tool for downsampling fastq?

ADD REPLY
0
Entering edit mode

Yes, I need to downsample fastq files based on UMI. I couldn't find any tools out there to do it.

ADD REPLY
1
Entering edit mode

I do not think they used a dedicated tool but simply counted how many reads were on average per UMi in the full dataset and then simply downsampled the total reads to somewhat match the expected number. Downsampling dataset with more than 60 million reads

ADD REPLY
0
Entering edit mode

Thank you for the explanation, I will try out seqtk mentioned in the link.

ADD REPLY
2
Entering edit mode
4.2 years ago
GenoMax 141k

reformat.sh from BBMap suite also has downsampling options.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f      If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

I doubt there is a tool that can downsample taking into account the UMIs.

ADD COMMENT
1
Entering edit mode

Agreed, it would have to be some custom script in conjunction with UMItools and / or BBMap. The giveaway word in the methods is 'about':

To control this over- sequencing error in the UMI and CDR3 sequences, we randomly discard the reads until the remaining reads contain about 8 reads per UMI.

So, there's nothing exact about what they are doing.

ADD REPLY

Login before adding your answer.

Traffic: 2070 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6