I read (high level) on Hadoop and Spark. A sort of yet distant goal is to be able to efficiently handle large amounts of sequencing data for a future project. With large I mean more than 10k fastq files from single cell sequencing. Though the initial goal is of course the primary analysis (which is not completely specified afaik) the data shall remain accessible to the department for future research projects.
From my experience, efficient data handling for much smaller sized projects can be quite a struggle, so I was wondering if anyone had already made some experience with the application of Hadoop or Spark for efficient management and handling of fastq and/or bam or any comparable data? Is it actually applicable or complete nonsense? My reasoning is that fastq, bam, etc. can be considered inherently unstructured data.
To be specific, my goal is no solution, but to fish a bit for opinions
EDIT: For future reference, I missed an old thread here on biostars