Question: Genome data on hadoop chromosome level
I am an IT student doing some work on hadoop in human genome project. My first trouble is how do I store the genome data in hadoop cluster ? How do I store data Chromosome wise ?
We do have cluster of 30 machines with hadoop. The problem is we are planning to process the human genome project using hadoop. Here the data is in the form of BAM files. I know if I load the data to hdfs, it will automatically split it into chunks and store on the name nodes. Thats is the problem here. I couldn't split the data like that. Need to split the data chromosome wise so that we can perform bio algorithm computing on them.
Bio algorithm computing : for instance bisulfite methylation extraction.
Currently we use bismap ( python tool ). Is there a way to store the data chromosome wise on hadoop.and run the bismap tool command as map reduce jobs