Hadoop InputFormat for FASTA files?

2

Entering edit mode

9.4 years ago

alex.rubinsteyn ▴ 190

I'm interested in analyzing large FASTA files (like the human genome and proteome) in parallel using Spark or pydoop. Is there a library which implements FASTA parsing as a Hadoop InputFormat?

hadoop fasta • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by alex.rubinsteyn ▴ 190

0

Entering edit mode

"Hadoop FASTA reader" at gist.github.com/jflatow/45551 ?

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This looks like it works well for a FASTA file with many small records (since it seeks locally on each worker). However, for a FASTA file with large contigs (like the genome) this wouldn't perform very well.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by alex.rubinsteyn ▴ 190

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 3009 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6