Question

How to down-sample a full data

2

Entering edit mode

6.3 years ago

XBria ▴ 90

Hi,

I need to analyze down-sampled data of couple of Rna_Seq full data set (samples,paired-end,fastq).sub-sampling method should work the same for all samples. In the end I compare for example how 5% of full data differs from 10% of full and 20 and 40% as well. (a sample is :ERR188044) The final graph will depict how amount of data affects the result.

The question is : How to download the data in these four forms ? shall I firstly download the full data and then downsample ? Or I can directly download down-sampled data. how to sub-sample data to get only a few number of specific chromosomes ? how to sub-sample data to get only a percent of whole paired-end reads?

What do you suggest me to do ?

Your advice is appreciated.

Thanks.

RNA-Seq • 5.0k views

ADD COMMENT • link updated 6.3 years ago by Alex Reynolds 35k • written 6.3 years ago by XBria ▴ 90

3

Entering edit mode

6.3 years ago

GenoMax 141k

You can use reformat.sh from BBMap suite to down-sample data.

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

@Brian also has a tool that plots library uniqueness as you add more reads.

ADD COMMENT • link 6.3 years ago by GenoMax 141k

2

Entering edit mode

6.3 years ago

Alex Reynolds 35k

You can use sample (via Github) to downsample a FASTQ file, e.g.:

$ sample -l 4 -k 1234 -o reads.fastq > randomSample.fastq

The -l 4 option groups every four lines into one record for sampling. If you have paired reads that follow each other, you can use -l 8, to sample every eight lines that form a pair of reads.
The -k 1234 option specifies the number of samples to draw: replace 1234 with the number of samples you want to draw.
The -o option samples without replacement; replace with -r to sample with replacement.

You can run sample --help to see a full description of all options.

This tool has the advantage that it can draw random samples from very large, whole-genome scale files. Many other tools will run into memory problems.

ADD COMMENT • link 6.3 years ago by Alex Reynolds 35k

score 4 · Accepted Answer · 2018-01-15

4

Entering edit mode

6.3 years ago

Devon Ryan 104k

FYI, the term you're looking for to describe this is a "rarefaction curve".

Yes, you will need to download the full dataset and then subsample it (typically a few times) for each percentage you want to plot on the X axis. You can use things like seqtk or even samtools view to subsample files. Depending on what you need to do, aligning the whole dataset once and then subsampling that will likely turn out to be the quickest strategy.

ADD COMMENT • link 6.3 years ago by Devon Ryan 104k

1

Entering edit mode

UInless you are doing a 2-pass alignment, I'd say that reads are aligned independently. Wouldn't it then be easier/most efficient to downsample the read counts table? See for example subsample.

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

0

Entering edit mode

If you just need counts, then yes. It's not clear to me that that's exactly what's going on here, though.

ADD REPLY • link 6.3 years ago by Devon Ryan 104k

0

Entering edit mode

thanks, Devon! could you please tell what do you mean by subsampling typically few times? i do understand that in order for it to be robust its better to do subsampling several times.. but i dont know how to understand how many times? and how to do it using seqtk? For instance, if I need to downsample 10 M PE reads to 2 M PE reads, should I subsample 500 000 PE reads from say, 4 times, and then merge together? But then I have a problem because how can I do it with seqtk it will lead to repeats cause every time I subsample from the same original file randomly.. could you please recommend anything to look into to get more ideas of what could i decide on it? thank you!

ADD REPLY • link 4.2 years ago by dhlsl • 0

0

Entering edit mode

2 or 3 times per read number should be fine to produce a smooth enough curve. So if you start with 10 million reads, the produce 2 or 3 datasets each of 1, 3, 5 and 7 million reads. You can just rerun seqtk with a different seed each time, since otherwise you'll end up with the same subsampled reads again and again.

ADD REPLY • link 4.2 years ago by Devon Ryan 104k