Question

Input Files In Fasta Sequence Are Required For Comparing Various Dna Sequencing Tools

2

Entering edit mode

14.3 years ago

Higherdefender ▴ 160

Hi,

I am currently new to the field of bioinformatics. While searching for sequence alignment I came across some alignment algorithms like BWA, Velvet , MAQ , SOAP etc.

I am interested in doing performance comparison of these tools via profiling myself (not so deeply but somewhat getting a rough idea).

For this purpose, I require benchmarks of human unaligned DNA (preferably short sequences) in FASTA file format. I don't want sequences of a specific kind but general enough to compare relative average performances.

After lot of searching on Google,Bing,Duck duck go, I wasn't able to find anything. Could you tell me

What kind (Reference DNA , short DNA etc.) of database is required for the job?
How many of them would suffix for the job?

Take care that the average and general performance analysis needs to be done so databases have to be general(not confined to a particular job) and also they should not be similar(difference should be there for complete performance analysis).

P.S. Give somewhat greater emphasis to BWA , BWA-SW

Thanks

next-gen sequencing alignment fasta • 12k views

ADD COMMENT • link updated 14.3 years ago by biobot 0.0.77.a.1099 6.2k • written 14.3 years ago by Higherdefender ▴ 160

0

Entering edit mode

You need fastq most likely not fasta.

ADD REPLY • link 14.3 years ago by Michael 56k

Ram · Answer 1 · 2011-03-10

5

Entering edit mode

14.3 years ago

Pierre Lindenbaum 166k

The samtools package contains a tool named wgsim.

This tool generates a set of random random short reads from a reference file.

Program: wgsim (short read simulator)
Version: 0.2.3
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.10]
         -X FLOAT      probability an indel is extended [0.30]
         -c            generate reads in color space (SOLiD reads)
         -C            show mismatch info in comment rather than read name
         -h            haplotype mode

Note: For SOLiD reads, the first read is F3 and the second is R3

It generates some FASTQ file that you can easily transform to FASTA.

For the reference genome you can use the Human Genome reference.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks a lot for the information. As an extension could you also tell me which chromosomes to be prefered (which files to be downloaded) for a general analysis?. I don't have enough time to analysis all the data sets. So 4-7 files would suffix.

ADD REPLY • link 14.3 years ago by Higherdefender ▴ 160

0

Entering edit mode

hum, I'm not sure that Life has a favorite chromosome... but you can try to download the smaller chromosomes chr19, chr21 and chr22.

ADD REPLY • link 14.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks. that is all I need for short reads. Can you detail something for long reads too :P?

ADD REPLY • link 14.3 years ago by Higherdefender ▴ 160

Ram · Answer 2 · 2011-03-10

2

Entering edit mode

14.3 years ago

biobot 0.0.77.a.1099 6.2k

See the answers to What Ngs Read Simulators Are Available For Paired-End Data? and to Where Can I Find Fastq Data (Ngs Raw Data) And Published Results? . Note that the SRA may go, but alternatives exist. See the answer to Sra Replacement

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.3 years ago by biobot 0.0.77.a.1099 6.2k

Aleksandr Levchuk · Answer 3 · 2011-03-10

1

Entering edit mode

14.3 years ago

Michael 56k

You can (still) download from the Short Reads Archive: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=download_reads

Hurry and grab what you can, because NCBI is closing SRA down due to lack of funding (which I find disappointing, but sure, the US needs to cut their spending, research is always a good point to start, lol)!

ADD COMMENT • link updated 12.4 years ago by Aleksandr Levchuk 3.2k • written 14.3 years ago by Michael 56k

score 1 · Answer 4 · 2011-03-10

I can tell about bwa.

To perform an alignment you need a fastq file containing the query and a genome (in fasta) as reference.

before doing the alignment you need to index the genome using bwa index

The fasta file can be dowloaded for instance here (you probably want chromFa.tar.gz, unzip and then cat together the chromosomes in one big file ) For the fastq file, you can download from the 1000 Genome Project Read and undestand what file you need.

I never used other programs, but you probably need fastq and reference fasta to start in all cases

Have fun and let us know the bechmark results!