Splitting dataset by taxon and gene alignments
0
0
Entering edit mode
8.9 years ago
maghaee • 0

Hi,

I am a noob and totally out of my depth but I still want to try this. So I have a file that lists different species by their gene sequences which are roughly 4000 bp long.

Example:

Species A ACGGTTGACGTTAAA. . . .etc

Species B ACGGGTTTTTTTGGGGGGGGCCCCCCAAAATT . . .etc

What I need to do is to split the sequences across taxon by new alignments. For gene1: 0-451, gene 2: 452-987, etc etc

My question is how do I do this? I have been taught the basics of python, with splitting, splicing, dictionaries, loops and stuff but I have no idea how to start tackling this.

python alignment sequencing • 1.3k views
ADD COMMENT
0
Entering edit mode

It seems you have the tools you need to fo this, so I will describe how I would proceed:

1. write (and test) the code for reading and parsing the sequences into an appropriate data structure (I would choose a hash with {key=>sequence_name, value=>sequece}, an array would work too)

2. write (and test) the code for spliting the sequences into appropriate sizes and storing them in a new data structure

3. write (and test) the code for writing the output.

VoilĂ , you have your small working script. At this point, I would start to think about how to do this better, possible problems (e.g. file sizes loaded into memory, etc), and maybe how to improve the script, but most likely I would just promise myself next time I will do better.

ADD REPLY

Login before adding your answer.

Traffic: 3195 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6