I have a augustus output file, which i further edited and now it looks like this.
gdna.0.1 gnl|Hsal_3.3|scaffold48 AUGUSTUS gene 3912 10875 0.01 - . ID=g1 gnl|Hsal_3.3|scaffold48 AUGUSTUS transcript 3912 10875 0.01 - . ID=g1.t1;Parent=g1 gnl|Hsal_3.3|scaffold48 AUGUSTUS transcription_end_site 3912 3912 . - . Parent=g1.t1 gdna.0.2 gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 3912 5899 . - . Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS stop_codon 4604 4606 . - 0 Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS CDS 4604 5656 0.89 - 0 ID=g1.t1.cds;Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS start_codon 5654 5656 . - 0 Parent=g1.t1 gdna.0.3 gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 9919 10875 . - . Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS transcription_start_site 10875 10875 . - . Parent=g1.t1 gdna.0.4 gnl|Hsal_3.3|scaffold112 AUGUSTUS gene 9196 31470 0.01 + . ID=g2 gnl|Hsal_3.3|scaffold112 AUGUSTUS transcript 9196 31470 0.01 + . ID=g2.t1;Parent=g2 gnl|Hsal_3.3|scaffold112 AUGUSTUS transcription_start_site 9196 9196 . + . Parent=g2.t1 gdna.0.5 gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 9196 9505 . + . Parent=g2.t1 gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 9623 10706 . + . Parent=g2.t1 gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 10757 11547 . + . Parent=g2.t1 gnl|Hsal_3.3|scaffold112 AUGUSTUS exon 12892 13859 . + . Parent=g2.t1
I have a list of gene ids like this:
gdna.0.2 gdna 0.4
I want to get the annotations of these gene ids alone into a new file from the augustus output file i specified above, i initially wanted to use a awk to do this, but since the input file is huge and also the number of gene ids for which i needed the annotation entries were also huge, i understood its best to write a script.
the output should look like this
gdna.0.2 gnl|Hsal_3.3|scaffold48 AUGUSTUS exon 3912 5899 . - . Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS stop_codon 4604 4606 . - 0 Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS CDS 4604 5656 0.89 - 0 ID=g1.t1.cds;Parent=g1.t1 gnl|Hsal_3.3|scaffold48 AUGUSTUS start_codon 5654 5656 . - 0 Parent=g1.t1 gdna.0.4 gnl|Hsal_3.3|scaffold112 AUGUSTUS gene 9196 31470 0.01 + . ID=g2 gnl|Hsal_3.3|scaffold112 AUGUSTUS transcript 9196 31470 0.01 + . ID=g2.t1;Parent=g2 gnl|Hsal_3.3|scaffold112 AUGUSTUS transcription_start_site 9196 9196 . + . Parent=g2.t1
I tried using dictionaries for this purpose, but have no luck in moving further after this.
from collections import defaultdict #this will make your life simpler f = open('test.txt','r') list=defaultdict(str) name = '' for line in f: if line.startswith('gdna.'): name = line[1:-1] continue list[name]+=line.strip() print list
Can anyone suggest if there is a more effective way to do this.
Thanks