Question

Splitting dataset by taxon and gene alignments

0

Entering edit mode

8.9 years ago

maghaee • 0

Hi,

I am a noob and totally out of my depth but I still want to try this. So I have a file that lists different species by their gene sequences which are roughly 4000 bp long.

Example:

Species A ACGGTTGACGTTAAA. . . .etc

Species B ACGGGTTTTTTTGGGGGGGGCCCCCCAAAATT . . .etc

What I need to do is to split the sequences across taxon by new alignments. For gene1: 0-451, gene 2: 452-987, etc etc

My question is how do I do this? I have been taught the basics of python, with splitting, splicing, dictionaries, loops and stuff but I have no idea how to start tackling this.

python alignment sequencing • 1.3k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by maghaee • 0

0

Entering edit mode

It seems you have the tools you need to fo this, so I will describe how I would proceed:

1. write (and test) the code for reading and parsing the sequences into an appropriate data structure (I would choose a hash with {key=>sequence_name, value=>sequece}, an array would work too)

2. write (and test) the code for spliting the sequences into appropriate sizes and storing them in a new data structure

3. write (and test) the code for writing the output.

Voilà, you have your small working script. At this point, I would start to think about how to do this better, possible problems (e.g. file sizes loaded into memory, etc), and maybe how to improve the script, but most likely I would just promise myself next time I will do better.

ADD REPLY • link 8.9 years ago by h.mon 35k