How can I create a dummy dataset from my NGS whole-exome data?
2
1
Entering edit mode
6.7 years ago
olavur ▴ 150

I am developing a pipeline (with GATK v3.7) for data from whole-exome sequencing from Illumina NextSeq 500. I have FASTQ files, and I want to create a very small dummy dataset from this, so that I can easily test tweaks to my pipeline, without having to wait for hours every time I tweak something.

Can I simply use, say, the first 1000 reads from a file, and expect that to work as a test dataset?

next-gen FASTQ alignment • 3.4k views
ADD COMMENT
2
Entering edit mode

you can do that or you can random sample your fastq.

For first 1000 sequences:

seqkit head -n 1000 input.fq.gz -o output.fq.gz

For random sampling 1000 reads from your fastq:

seqkit sample -n 1000 -o input.fq.gz -o output.fq.gz

Download seqkit from here: http://bioinf.shenwei.me/seqkit/download/

ADD REPLY
0
Entering edit mode

small typo in the second command

seqkit sample -n 1000 input.fq.gz -o output.fq.gz

ADD REPLY
2
Entering edit mode
6.7 years ago
GenoMax 141k

Can I simply use, say, the first 1000 reads from a file

Don't do that especially since the first reads in an illumina dataset may not be of good quality. You can use reformat.sh from BBMap suite to sample reads (options below) in various ways (by number of reads, bases, a certain fraction (which may be best)).

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.

BTW: This would not be a dummy dataset but a representative one.

ADD COMMENT
1
Entering edit mode
6.7 years ago

For testing a variant-calling pipeline, you may want sufficient coverage to actually call a variant. To do so I would suggest selecting some 100kbp chunk of the genome (which should be enough for perhaps 100 variants, including indels and substitutions) and pulling out all the reads that map to it. Then use those reads as your representative dataset (and you can use the 100kbp genome chunk as your representative reference); it will have sufficient coverage to actually call variants normally, but will still be a very small file.

ADD COMMENT
0
Entering edit mode

Very good point! Getting a VCF file with no variants would probably not be very useful for testing the pipeline.

However, I don't see how I might do this, as there is no way of extracting a particular region from short-read data.

ADD REPLY
1
Entering edit mode

I don't see how I might do this

You will start with your aligned bam file and then use (Extract Reads From A Bam File That Fall Within A Given Region ) to extract reads from a particular region you are interested in.

ADD REPLY
0
Entering edit mode

Ah, I see. Using: samtools view -b input.bam "chr10:1-10000000" > output.bam, for example. This is a good idea.

It presents some challenges in my particular case, however. I need separate FASTQs for paired-end reads and four lanes to test my pipeline, and while I can extract the four lanes using the read group information, I can't do this for the paired-end reads.

ADD REPLY
0
Entering edit mode

Fear not. Both samtools or BBMap can help. Take a look at samtools fastq command or use BBMap, reformat.sh in=output.bam out1=read1.fq out2=read2.fq should do what you need.

ADD REPLY
0
Entering edit mode

Thanks, will try this.

ADD REPLY
0
Entering edit mode

What I typically do is open a fasta file in a text editor and manually select a bunch of text and paste it into a new fasta file (so that I can make sure I'm not selecting 100kbp of Ns or something). Then:

bbmap.sh in1=r1.fq in2=r2.fq outm1=mapped1.fq outm2=mapped2.fq ref=100kbp.fa
ADD REPLY

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6