Question: How to down-sample a full data
gravatar for XBria
2.2 years ago by
XBria60 wrote:


I need to analyze down-sampled data of couple of Rna_Seq full data set (samples,paired-end,fastq).sub-sampling method should work the same for all samples. In the end I compare for example how 5% of full data differs from 10% of full and 20 and 40% as well. (a sample is :ERR188044) The final graph will depict how amount of data affects the result.

The question is : How to download the data in these four forms ? shall I firstly download the full data and then downsample ? Or I can directly download down-sampled data. how to sub-sample data to get only a few number of specific chromosomes ? how to sub-sample data to get only a percent of whole paired-end reads?

What do you suggest me to do ?

Your advice is appreciated.


rna-seq • 1.7k views
ADD COMMENTlink modified 2.2 years ago by Alex Reynolds29k • written 2.2 years ago by XBria60
gravatar for Devon Ryan
2.2 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

FYI, the term you're looking for to describe this is a "rarefaction curve".

Yes, you will need to download the full dataset and then subsample it (typically a few times) for each percentage you want to plot on the X axis. You can use things like seqtk or even samtools view to subsample files. Depending on what you need to do, aligning the whole dataset once and then subsampling that will likely turn out to be the quickest strategy.

ADD COMMENTlink written 2.2 years ago by Devon Ryan94k

UInless you are doing a 2-pass alignment, I'd say that reads are aligned independently. Wouldn't it then be easier/most efficient to downsample the read counts table? See for example subsample.

ADD REPLYlink written 2.2 years ago by WouterDeCoster43k

If you just need counts, then yes. It's not clear to me that that's exactly what's going on here, though.

ADD REPLYlink written 2.2 years ago by Devon Ryan94k

thanks, Devon! could you please tell what do you mean by subsampling typically few times? i do understand that in order for it to be robust its better to do subsampling several times.. but i dont know how to understand how many times? and how to do it using seqtk? For instance, if I need to downsample 10 M PE reads to 2 M PE reads, should I subsample 500 000 PE reads from say, 4 times, and then merge together? But then I have a problem because how can I do it with seqtk it will lead to repeats cause every time I subsample from the same original file randomly.. could you please recommend anything to look into to get more ideas of what could i decide on it? thank you!

ADD REPLYlink written 4 weeks ago by dhlsl0

2 or 3 times per read number should be fine to produce a smooth enough curve. So if you start with 10 million reads, the produce 2 or 3 datasets each of 1, 3, 5 and 7 million reads. You can just rerun seqtk with a different seed each time, since otherwise you'll end up with the same subsampled reads again and again.

ADD REPLYlink written 4 weeks ago by Devon Ryan94k
gravatar for genomax
2.2 years ago by
United States
genomax80k wrote:

You can use from BBMap suite to down-sample data.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

@Brian also has a tool that plots library uniqueness as you add more reads.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by genomax80k
gravatar for Alex Reynolds
2.2 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You can use sample (via Github) to downsample a FASTQ file, e.g.:

$ sample -l 4 -k 1234 -o reads.fastq > randomSample.fastq
  • The -l 4 option groups every four lines into one record for sampling. If you have paired reads that follow each other, you can use -l 8, to sample every eight lines that form a pair of reads.
  • The -k 1234 option specifies the number of samples to draw: replace 1234 with the number of samples you want to draw.
  • The -o option samples without replacement; replace with -r to sample with replacement.

You can run sample --help to see a full description of all options.

This tool has the advantage that it can draw random samples from very large, whole-genome scale files. Many other tools will run into memory problems.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour