Question: FASTQ to SAM File converter
0
gravatar for inkprs
3 months ago by
inkprs60
inkprs60 wrote:

Hi,

I have a FASTQ file and a reference genome file in FASTA format.

What are the fastest tools available to convert into SAM file?

I also use Hadoop and Spark, are there any tools available in big data world?''

My final goal is to create a VCF file after creating a SAM file.

sequencing big data fastq fasta • 444 views
ADD COMMENTlink modified 3 months ago by EagleEye4.7k • written 3 months ago by inkprs60
2

You should figure out exactly what you are trying to do and understand the whole process before trying to find "the fastest tools".

ADD REPLYlink written 3 months ago by igor4.5k
3
gravatar for WouterDeCoster
3 months ago by
Belgium
WouterDeCoster21k wrote:

"Converting" the data to sam/bam requires alignment. Although you haven't specified anything about your data, I would first suggest having a look at bwa mem.

ADD COMMENTlink modified 3 months ago • written 3 months ago by WouterDeCoster21k
1

"Converting" the data to sam/bam requires alignment.

No it doesn't! Using the BBMap package:

reformat.sh in=file.fastq out=file.sam

Completely valid sam file, really fast! :) Also that will ensure the final VCF file is extremely small and easy to work with.

*Disclaimer: I do not recommend this approach.

ADD REPLYlink modified 3 months ago • written 3 months ago by Brian Bushnell14k
2

Yes, I'm aware of unmapped sam/bam but decided not to add that to my answer, seems OP is already sufficiently confused :-)

ADD REPLYlink written 3 months ago by WouterDeCoster21k
1

Definitely faster than some of those fancy aligners.

ADD REPLYlink written 3 months ago by igor4.5k

My final goal is to create a VCF file after creating a SAM file.

Won't work for stated final goal :)

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax33k

Just to make sure, I tried it, and it worked (meaning it created a sam file and VCF file):

reformat.sh in=ATTPA.fq.gz out=foo.sam reads=100
java -ea -Xmx200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.ReformatReads in=ATTPA.fq.gz out=foo.sam reads=100
Executing jgi.ReformatReads [in=ATTPA.fq.gz, out=foo.sam, reads=100]

Input is being processed as paired
Input:                          200 reads               30200 bases
Output:                         200 reads (100.00%)     30200 bases (100.00%)

Time:                           0.095 seconds.
Reads Processed:         200    2.10k reads/sec
Bases Processed:       30200    0.32m bases/sec

callvariants.sh in=foo.sam ref=P.heparinus.fa out=foo.vcf
java -ea -Xmx206018m -Xms206018m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ var2.CallVariants in=foo.sam ref=P.heparinus.fa out=foo.vcf
Executing var2.CallVariants [in=foo.sam, ref=P.heparinus.fa, out=foo.vcf]

Loading reference.
Time:   0.097 seconds.
Processing input files.
Time:   0.018 seconds.
Memory: max=207024m, free=194063m, used=12961m

Processing variants.
Time:   0.002 seconds.

Writing output.
Time:   0.019 seconds.

0 of 0 variants passed filters (NaN%).

Substitutions:  0       NaN%
Deletions:      0       NaN%
Insertions:     0       NaN%
Variation Rate: 0/5167383
Homozygous:     0       NaN%

Time:                           0.195 seconds.
Reads Processed:         200    1.03k reads/sec
Bases Processed:       30200    0.15m bases/sec

As predicted, the resulting VCF was really small and had zero false positives.

ADD REPLYlink written 3 months ago by Brian Bushnell14k

But, is it suitable for machine learning?!?

ADD REPLYlink written 3 months ago by WouterDeCoster21k
1

I think the machine might eventually learn that this is not a very good approach.

ADD REPLYlink written 3 months ago by Brian Bushnell14k
2
gravatar for Pierre Lindenbaum
3 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

I also use Hadoop and Spark, are there any tools available in big data world?''

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155461

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

ADD COMMENTlink written 3 months ago by Pierre Lindenbaum98k
1
gravatar for EagleEye
3 months ago by
EagleEye4.7k
Sweden
EagleEye4.7k wrote:

Check my previous answer from below link and let us know if this is what you want to do, otherwise please explain in detail.

A: FASTQs to the VCF

ADD COMMENTlink modified 3 months ago • written 3 months ago by EagleEye4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 655 users visited in the last hour