Question: parsing alignment file
3
gravatar for a.rex
2.7 years ago by
a.rex190
a.rex190 wrote:

I am having some difficulty producing a script for parsing an alignment file in the following format generated from RepeatMasker. For example:

665  28.45  2.93  5.02  g5129s420  7350  7882  (1924)  C  MIR#SINE/MIR  (1)  261  28  3

  g5129s420         7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397   
                           v     v           i i  i v     viv    v i v v  v
C MIR#SINE/MIR       261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212    

  g5129s420         7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443   
                           vi  i iv   i        i i   i  i    i  v    i     
C MIR#SINE/MIR       211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164    

  g5129s420         7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787   
                         iiv              v i         iii   v      i  i  i 
C MIR#SINE/MIR       163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123    

  g5129s420         7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837   
                          v i v  i      i v  v  v     ii     v      i  ii  
C MIR#SINE/MIR       122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74     

  g5129s420         7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882   
                          iivi    i iiii  i    i i         i  v     i  
C MIR#SINE/MIR        73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28     

Transitions / transversions = 1.96 (45 / 23)
Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8)

I would like to parse the file using BioPython such that I obtain the chromosome/scaffold name g5129s420, start + end (7350 7882), and the Transitions/transversions. Any ideas on how to write this script would be most welcome, as I am a complete novice to scripting.

parsing python genome • 900 views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by a.rex190
2
gravatar for ramesh.8v
2.7 years ago by
ramesh.8v200
United States
ramesh.8v200 wrote:

There may be some BioPython modules or some other parser exists for this job. Here is my solution based on Python 3.x if you know chromosome/scaffold name

seqname = ['g5129s420','g5129s421'] #list sequence names here
dic, trans = {}, {}
for names in seqname:
    dic[names] = []
    trans[names] = []
with open("file.txt") as f:
    for line in f:
        line = line.rstrip().lstrip().split(' ')
        line = [item for item in line if item is not '']
        if len(line)>0:
            #print(line)
            if line[0] in seqname:
                dic[line[0]].append(line[1])
                dic[line[0]].append(line[-1])
                seqname1 = line[0]
            elif line[0] == 'Transitions':
                trans[seqname1].append(line[4])
                trans[seqname1].append(line[5].strip('('))
                trans[seqname1].append(line[7].strip(')'))

for key, values in dic.items():
    if len(dic[key])>0:
        print(key,':', min(values), max(values), trans[key][0], trans[key][1], trans[key][2])

Output:

g5129s420 : 7350 7882 1.96 45 23
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by ramesh.8v200
0
gravatar for a.rex
2.7 years ago by
a.rex190
a.rex190 wrote:

Thank you so much for this - helpful to see examples of how to parse this file. Unfortunately, I wish to parse out all sequences in the file and do not have the names of particular scaffolds. Is there a simpler way of doing this using BioPython?

ADD COMMENTlink written 2.7 years ago by a.rex190

You can use above script to parse all the sequences in the file but you need sequence names. It would be very difficult to parse without seq names, some BioPython module may do.

ADD REPLYlink written 2.7 years ago by ramesh.8v200

Please use ADD COMMENT to reply to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLYlink written 2.7 years ago by WouterDeCoster39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1126 users visited in the last hour