Question: How can I create a dummy dataset from my NGS whole-exome data?
1
gravatar for olavur
3.1 years ago by
olavur100
Tórshavn, Faroe Islands
olavur100 wrote:

I am developing a pipeline (with GATK v3.7) for data from whole-exome sequencing from Illumina NextSeq 500. I have FASTQ files, and I want to create a very small dummy dataset from this, so that I can easily test tweaks to my pipeline, without having to wait for hours every time I tweak something.

Can I simply use, say, the first 1000 reads from a file, and expect that to work as a test dataset?

alignment fastq next-gen • 1.5k views
ADD COMMENTlink modified 3.1 years ago by Brian Bushnell17k • written 3.1 years ago by olavur100
1

you can do that or you can random sample your fastq.

For first 1000 sequences:

seqkit head -n 1000 input.fq.gz -o output.fq.gz

For random sampling 1000 reads from your fastq:

seqkit sample -n 1000 -o input.fq.gz -o output.fq.gz

Download seqkit from here: http://bioinf.shenwei.me/seqkit/download/

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by cpad011214k
2
gravatar for genomax
3.1 years ago by
genomax89k
United States
genomax89k wrote:

Can I simply use, say, the first 1000 reads from a file

Don't do that especially since the first reads in an illumina dataset may not be of good quality. You can use reformat.sh from BBMap suite to sample reads (options below) in various ways (by number of reads, bases, a certain fraction (which may be best)).

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

BTW: This would not be a dummy dataset but a representative one.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by genomax89k
1
gravatar for Brian Bushnell
3.1 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

For testing a variant-calling pipeline, you may want sufficient coverage to actually call a variant. To do so I would suggest selecting some 100kbp chunk of the genome (which should be enough for perhaps 100 variants, including indels and substitutions) and pulling out all the reads that map to it. Then use those reads as your representative dataset (and you can use the 100kbp genome chunk as your representative reference); it will have sufficient coverage to actually call variants normally, but will still be a very small file.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Brian Bushnell17k

Very good point! Getting a VCF file with no variants would probably not be very useful for testing the pipeline.

However, I don't see how I might do this, as there is no way of extracting a particular region from short-read data.

ADD REPLYlink written 3.1 years ago by olavur100
1

I don't see how I might do this

You will start with your aligned bam file and then use (Extract Reads From A Bam File That Fall Within A Given Region ) to extract reads from a particular region you are interested in.

ADD REPLYlink written 3.1 years ago by genomax89k

Ah, I see. Using: samtools view -b input.bam "chr10:1-10000000" > output.bam, for example. This is a good idea.

It presents some challenges in my particular case, however. I need separate FASTQs for paired-end reads and four lanes to test my pipeline, and while I can extract the four lanes using the read group information, I can't do this for the paired-end reads.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by olavur100

Fear not. Both samtools or BBMap can help. Take a look at samtools fastq command or use BBMap, reformat.sh in=output.bam out1=read1.fq out2=read2.fq should do what you need.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by genomax89k

Thanks, will try this.

ADD REPLYlink written 3.1 years ago by olavur100

What I typically do is open a fasta file in a text editor and manually select a bunch of text and paste it into a new fasta file (so that I can make sure I'm not selecting 100kbp of Ns or something). Then:

bbmap.sh in1=r1.fq in2=r2.fq outm1=mapped1.fq outm2=mapped2.fq ref=100kbp.fa
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1050 users visited in the last hour