I had to come up with a code last week to solve this small problem and thought it may be another nice idea for a code golf.
Here are the requirements:
Given a fasta file (potentially huge), return a fasta file containing smaller sequences of length 'n' that represent all non-overlaping subsequences of the original sequences. Rename the new sequences according to the scheme 'oldSeqName_00001" and up, where 'oldSeqName' is the original name for the sequence.
Test file: please use http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr1.fa.gz
example run : digest_fasta(fasta_in, 60, fasta_out)
inputs: - a number of nucleotides for each sub-sequence (n) - a fasta file
>XYZ_gene CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA GGAGACGGAGAGACACTTCATCATGCAGG
outputs: - a fasta file
EXAMPLE (after digesting in 60pb long fragments):
>XYZ_gene_00001 CGTGAAGACGCAGTACCAGCAGAGATGGCTGGCCATCGACGGCAACGCTCGCCGCGAGAT >XYZ_gene_00002 CAAGAACTACGTTTTACAGACTCTGGGCACGGAGACGTACCGGCCCAGCTCGGCGTCGCA >XYZ_gene_00003 GTGCGTCGCCGGCATCGCCTGCGCTGAGATCCCCGTTAACCAGTGGCCCGAGCTGATCCC >XYZ_gene_00004 ACAGCTGGTGGCCAACGTCACGGACCCGTCCAGCACCGAACACATGAAGGAGTCCACGTT >XYZ_gene_00005 GGAGGCCATCGGGTACATCTGCCAGGACATCGACCCGGAGCAGCTGCAGGAGAACGCCAA >XYZ_gene_00006 CCAGATCCTGACGGCCATCATCCAGGGCATGAGGAAGGAGGAGCCCAGTAACAACGTGAA >XYZ_gene_00007 GCTGGCCGCGACTAACGCTCTGCTCAACTCGCTGGAGTTCACTAAAGCCAACTTTGACAA >XYZ_gene_00008 GGAGACGGAGAGACACTTCATCATGCAGG
And so on for all the sequences in the 'fasta_in' file!
For this week's requirement, I'll try to be clearer: The program should run under linux.
Use whatever language, approach, library you favor! Submit multiple answers if they are significantly different!
I'll accept the most voted answer on thursday around 16h Eastern Canada. Do vote for any answer you find interesting, even if they come in late!
Just like last week, diversity will be very appreciated!
Can't wait to see you code ;)
CONCLUSION: Thanks to all who participated with their answers or discussions! It's a pleasure to see all the different approaches and learn new tricks :)