Question

FASTQ to SAM File converter

0

Entering edit mode

8.1 years ago

inkprs ▴ 70

Hi,

I have a FASTQ file and a reference genome file in FASTA format.

What are the fastest tools available to convert into SAM file?

I also use Hadoop and Spark, are there any tools available in big data world?''

My final goal is to create a VCF file after creating a SAM file.

fastq sequencing big data fasta • 11k views

ADD COMMENT • link updated 8.1 years ago by EagleEye 7.6k • written 8.1 years ago by inkprs ▴ 70

2

Entering edit mode

You should figure out exactly what you are trying to do and understand the whole process before trying to find "the fastest tools".

ADD REPLY • link 8.1 years ago by igor 13k

score 3 · Answer 1 · 2017-06-14

3

Entering edit mode

8.1 years ago

WouterDeCoster 48k

"Converting" the data to sam/bam requires alignment. Although you haven't specified anything about your data, I would first suggest having a look at bwa mem.

ADD COMMENT • link 8.1 years ago by WouterDeCoster 48k

2

Entering edit mode

"Converting" the data to sam/bam requires alignment.

No it doesn't! Using the BBMap package:

reformat.sh in=file.fastq out=file.sam

Completely valid sam file, really fast! :) Also that will ensure the final VCF file is extremely small and easy to work with.

*Disclaimer: I do not recommend this approach.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

2

Entering edit mode

Yes, I'm aware of unmapped sam/bam but decided not to add that to my answer, seems OP is already sufficiently confused :-)

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

1

Entering edit mode

Definitely faster than some of those fancy aligners.

ADD REPLY • link 8.1 years ago by igor 13k

0

Entering edit mode

My final goal is to create a VCF file after creating a SAM file.

Won't work for stated final goal :)

ADD REPLY • link 8.1 years ago by GenoMax 152k

0

Entering edit mode

Just to make sure, I tried it, and it worked (meaning it created a sam file and VCF file):

reformat.sh in=ATTPA.fq.gz out=foo.sam reads=100
java -ea -Xmx200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.ReformatReads in=ATTPA.fq.gz out=foo.sam reads=100
Executing jgi.ReformatReads [in=ATTPA.fq.gz, out=foo.sam, reads=100]

Input is being processed as paired
Input:                          200 reads               30200 bases
Output:                         200 reads (100.00%)     30200 bases (100.00%)

Time:                           0.095 seconds.
Reads Processed:         200    2.10k reads/sec
Bases Processed:       30200    0.32m bases/sec

callvariants.sh in=foo.sam ref=P.heparinus.fa out=foo.vcf
java -ea -Xmx206018m -Xms206018m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ var2.CallVariants in=foo.sam ref=P.heparinus.fa out=foo.vcf
Executing var2.CallVariants [in=foo.sam, ref=P.heparinus.fa, out=foo.vcf]

Loading reference.
Time:   0.097 seconds.
Processing input files.
Time:   0.018 seconds.
Memory: max=207024m, free=194063m, used=12961m

Processing variants.
Time:   0.002 seconds.

Writing output.
Time:   0.019 seconds.

0 of 0 variants passed filters (NaN%).

Substitutions:  0       NaN%
Deletions:      0       NaN%
Insertions:     0       NaN%
Variation Rate: 0/5167383
Homozygous:     0       NaN%

Time:                           0.195 seconds.
Reads Processed:         200    1.03k reads/sec
Bases Processed:       30200    0.15m bases/sec

As predicted, the resulting VCF was really small and had zero false positives.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

0

Entering edit mode

But, is it suitable for machine learning?!?

ADD REPLY • link 8.1 years ago by WouterDeCoster 48k

1

Entering edit mode

I think the machine might eventually learn that this is not a very good approach.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

score 2 · Answer 2 · 2017-06-14

I also use Hadoop and Spark, are there any tools available in big data world?''

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155461

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

score 1 · Answer 3 · 2017-06-14

1

Entering edit mode

8.1 years ago

EagleEye 7.6k

Check my previous answer from below link and let us know if this is what you want to do, otherwise please explain in detail.

A: FASTQs to the VCF

ADD COMMENT • link 8.1 years ago by EagleEye 7.6k