Hello guys, I'm a beginner in bioinformatics and I have the following problem:
I have a input file like this:
Sequence_1,RT_RH_A17 Sequence_1,RH_A17 Sequence_1,INT_PHD Sequence_1,PR_Ap2 Sequence_1,Gag Sequence_1,RH_A17 Sequence_1,INT_rve Sequence_2,INT_rve Sequence_3,INT_rve Sequence_4,INT_rve Sequence_5,INT_rve Sequence_6,RT_like Sequence_6,RH_A17
It represents an annotation of transposon domain per sequence, as we can see, have several lines for the same sequence (e.g. sequence_1). I want to create an output file like this:
Sequence_1,1,1,1,1,1 Sequence_2,0,0,0,0,1 Sequence_3,0,0,0,0,1 Sequence_4,0,0,0,0,1 Sequence_5,0,0,0,0,1 Sequence_6,0,0,1,1,0
The output represents a model to use in iTol (a tool to annotate phylogenetic trees). So I want the order of annotation as Gag,PR,RT,RH,INT, sequences that have one of those domains receive '1' in domain position or '0' if doesn't have.
I wrote a python script to try resolve this:
# open the table domains=open('bel_domains.csv','r') # create a output file output = open('output.txt','w+') # function to detect domains in lines def found_domain(a,b): if a in b: output.write('1'+',') else: output.write('0'+',') for line in domains: line_split = line.split(',') output.write(line_split+',') found_domain('Gag',line_split) found_domain('PR',line_split) found_domain('RT',line_split) found_domain('RH',line_split) found_domain('INT',line_split) output.close()
I know that I have a huge problem in logic and maybe a hugest problem in programming skills, because my output its:
Briefly, I need to detect wich lines are equal in the position of sequence name, and read&annotate domains in the following order (Gag,PR,RT,RH,INT), but I have no idea how I can implement this in my script.
Can someone help me?