Question

Subsetting synthetic reads to a number of different total read numbers

0

Entering edit mode

23 months ago

Jonathan ▴ 20

I have generated synthetic reads using wgsim as follows:

wgsim -1 100 -2 100 -r 0 -R 0 -X 0 -e 0 GCA_900186885_1.fa reads_1.fastq reads_2.fastq

This created paired-end reads of length 100 bp with no errors, mutations, or indels. I would now like to subset those reads so as to produce reads with variable levels of coverage (i.e., depth of coverage). In other words, I would like some reads to have 1x coverage, some to have 2x, some 3x and so on until 30x coverage. I would also like to randomly choose which reads get which level of coverage.

My ultimately goal is to produce a synthetic set of reads with variable levels of coverage that should in theory map perfectly to the reference genome from which they were derived. I have searched far and wide, to no avail, for a program that can do such subsetting as I described. Does anyone know of some simple bash, R or python coding that could pull this off? Any insight would be much appreciated.

python subsetting R fastq bash • 982 views

ADD COMMENT • link 23 months ago by Jonathan ▴ 20

0

Entering edit mode

map your reads to a bam file , get the coverage for the RAW bam, and the run samtools view -s according to your needs: http://www.htslib.org/doc/samtools-view.html

--subsample FLOAT

    Output only a proportion of the input alignments, as specified by 0.0 ≤ FLOAT ≤ 1.0, which gives the fraction of templates/pairs to be kept. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate. 
--subsample-seed INT

    Subsampling seed used to influence which subset of reads is kept. When subsampling data that has previously been subsampled, be sure to use a different seed value from those used previously; otherwise more reads will be retained than expected. [0] 
-s FLOAT

    Subsampling shorthand option: -s INT.FRAC is equivalent to --subsample-seed INT --subsample 0.FRAC.

ADD REPLY • link 23 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm not sure mapping before subsetting will work if unmapped reads get thrown out. I'd like to avoid that step if at all possible. Surely there's an easier way to produce synthetic reads with variable (known) known levels of coverage.

ADD REPLY • link 23 months ago by Jonathan ▴ 20

score 0 · Answer 1 · 2022-06-13

0

Entering edit mode

23 months ago

GenoMax 142k

Use reformat.sh from BBMap suite with the sampling options.

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

ADD COMMENT • link 23 months ago by GenoMax 142k

0

Entering edit mode

I like this idea, but I suspect an issue with resampling. Perhaps I am missing something. This is what I did to produce reads with 30X coverage. Note that there are 1000000 reads in each fastq file. I used the reads=31250 parameter to restrict the input to a fraction of the original fastq file and then the samplereadstarget=937500 parameter to output new fastq files with 30 times the number of input reads, and finally the upsample=TRUE parameter to upsample (duplicate reads) when the target is greater than input.

reformat.sh in=$read1 in2=$read2 out=$OUT out2=$OUT2 reads=31250 samplereadstarget=937500 upsample=TRUE

I could then vary the value for the samplereadstarget parameter to produce fastq files with different levels of coverage (which I would concatenate in the end), but there is no way to avoid resampling the sample input reads each time I repeat this.

Am I making this far too complicated?

ADD REPLY • link 23 months ago by Jonathan ▴ 20

score 0 · Answer 2 · 2022-06-13

0

Entering edit mode

23 months ago

Mensur Dlakic ★ 27k

khmer has a set of tools that could prove useful to you. You probably want "digital normalization" from the second link.

BBNorm also does a good job.

https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/44377-introducing-bbnorm-a-read-normalization-and-error-correction-tool?t=49763

ADD COMMENT • link 23 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

This looks promising, thank you! I'm going to try the following command to reduce areas with coverage above the target to be reduced to the target, and areas below the target to be left untouched.

bbnorm.sh in=$IN1 in2=$IN2 out=$OUT1 out2=$OUT2 target=30 mindepth=1

In theory this should give me fastq files with a range of coverage between 1 and 30x. Correct?

ADD REPLY • link 23 months ago by Jonathan ▴ 20

0

Entering edit mode

This means that any reads with coverage below 1x will be thrown out, and any reads with coverage above 30x will be subsampled down to 30x.

The program's main purpose is for metagenomic samples, where rare reads that can't be assembled (below mindepth) are discarded, but other rare reads are preserved with the same coverage. Very abundant reads are subsampled.

If I understand correctly what you did, there is a single genome in your reads and all of the reads are presumably at same depth. That most likely means that you can ignore the mindepth switch and simply run it with different target values for each depth of coverage desired.

ADD REPLY • link 23 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

To be honest I don't actually know if all of my synthetic reads produced by wgsim are at the same depth. I'll look into this now.

ADD REPLY • link 23 months ago by Jonathan ▴ 20