Question

Data managment (Hadoop/Spark) and sequencing data

1

Entering edit mode

5.2 years ago

Carambakaracho ★ 3.2k

Hi all

I read (high level) on Hadoop and Spark. A sort of yet distant goal is to be able to efficiently handle large amounts of sequencing data for a future project. With large I mean more than 10k fastq files from single cell sequencing. Though the initial goal is of course the primary analysis (which is not completely specified afaik) the data shall remain accessible to the department for future research projects.

From my experience, efficient data handling for much smaller sized projects can be quite a struggle, so I was wondering if anyone had already made some experience with the application of Hadoop or Spark for efficient management and handling of fastq and/or bam or any comparable data? Is it actually applicable or complete nonsense? My reasoning is that fastq, bam, etc. can be considered inherently unstructured data.

To be specific, my goal is no solution, but to fish a bit for opinions

EDIT: For future reference, I missed an old thread here on biostars

next-gen RNA-Seq • 1.4k views

ADD COMMENT • link 5.2 years ago by Carambakaracho ★ 3.2k

2

Entering edit mode

The thing with using Hadoop/Spark is that few bioinformatics tools support them. So you can use stuff like HDFS (the Hadoop file system) but you're then not really using Hadoop for processing. I wouldn't invest too much effort on this front at this point in time unless you really do have big data (like 1000 genomes project size). Otherwise I worry that the time spent will be wasted due to the underlying technology being obsolete by the time you actually do have a need for it.

For scRNA-seq, the trick is to not make 10k fastq files. We routinely sequence around that many cells, but they're then in a single fastq file (and BAM file), since (A) that's still not that large, (B) all tools support it and (C) cell barcodes and UMIs can be tracked and handled with off-the-shelf tools.

ADD REPLY • link 5.2 years ago by Devon Ryan 104k

1

Entering edit mode

We did some analysis on Hadoop. Turns out the RNA-seq analysis ran quicker on 4 servers (think Cassandra?) rather than one, something like 23 versus 16 minutes, but the cost of chunking and copying the data onto the workers (that is, the map reduce step), then reconstituting the results, was about 7 minutes. That is, we put in a lot of dev effort and gained no reward.

Looking into the literature, I think the tipping point where it makes sense to use Hadoop map-reduce strategies was around 400 GB input files. To be clear, 1 200GB R1 and 1 200GB R2.

In most bioinformatics use cases, we are a long way from that since we can divide by sample efficiently.

I haven't looked into Spark, but it might be that similar constraints apply - it seems to me that big data as seen in eg looking at massive twitter dumps (i.e. 1TB+ input files) has not really arrived in NGS bioinformatics yet ?

Also Devon's point about software being lacking is quite true, but I think there are not many Hadoop clusters accessible and available to the NGS community either.

ADD REPLY • link 5.2 years ago by colindaven 6.4k

1

Entering edit mode

I'd guess that hadoop would still be quicker if you had smaller files (<200GB) BUT had a bunch of them AND could make use of data locality in the cluster AND the output is either small or could also be kept local to the compute. That'd avoid flooding the network with IO at least. That's a lot of conditions to meet though.

ADD REPLY • link 5.2 years ago by Devon Ryan 104k

1

Entering edit mode

Agreed, that matches my feeling after all I read so far. Mapping raw fastq doesn't seem to be the best match for the map/reduce strategies. In any case the lack of tools, the required investment in infrastructure and custom development don't match the expected (and observed) benefit.

I'm cautiosly following the ADAM project. They report outstanding performance increases on... MarkDuplicates and flagstats (see their 2013 tech report, admitted it's not the only thing). Yay. Both approaches seem quite well suited for parallellization of any type, and don't suffer much from merging of chunks of results.

I can't say I'm not impressed, quite the contrary. On the other hand, I don't see myself justifiing investments in Hadoop/Spark hardware infrastucture and adapted code to have that insane performance gain in MarkDuplicates or flagstats.

ADD REPLY • link 5.2 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Meh, just use sambamba for duplicate marking and claim you've saved the company a boat-load of money. If you get a bonus from that you owe us a cut :)

ADD REPLY • link 5.2 years ago by Devon Ryan 104k

0

Entering edit mode

deal ;-) Though duplicate marking wasn't exactly the big problem I hoped to solve in the first place

ADD REPLY • link 5.2 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

Thanks Devon, that explains why Hadoop/Spark are rarely mentioned.

With my limited experience, I sort of expected rather full sized fastq for the scRNA-seq. Barcoding sounds like a good trick, this is similar to something I did with metagenomes WGS fastq prior to assembly/analysis - just the size got pretty big with only a few dozen time points and/or individuals.

ADD REPLY • link 5.2 years ago by Carambakaracho ★ 3.2k