Entering edit mode
8.9 years ago
maghaee
•
0
Hi,
I am a noob and totally out of my depth but I still want to try this. So I have a file that lists different species by their gene sequences which are roughly 4000 bp long.
Example:
Species A ACGGTTGACGTTAAA. . . .etc
Species B ACGGGTTTTTTTGGGGGGGGCCCCCCAAAATT . . .etc
What I need to do is to split the sequences across taxon by new alignments. For gene1: 0-451, gene 2: 452-987, etc etc
My question is how do I do this? I have been taught the basics of python, with splitting, splicing, dictionaries, loops and stuff but I have no idea how to start tackling this.
It seems you have the tools you need to fo this, so I will describe how I would proceed:
1. write (and test) the code for reading and parsing the sequences into an appropriate data structure (I would choose a hash with {key=>sequence_name, value=>sequece}, an array would work too)
2. write (and test) the code for spliting the sequences into appropriate sizes and storing them in a new data structure
3. write (and test) the code for writing the output.
VoilĂ , you have your small working script. At this point, I would start to think about how to do this better, possible problems (e.g. file sizes loaded into memory, etc), and maybe how to improve the script, but most likely I would just promise myself next time I will do better.