Hi there,
I'm trying to parse this fna file (found here) into a bed file.
I get that the fna coordinats are 1-based and inclusive. So by joining the following coordinates:
>lcl|NC_029649.1_cds_XP_015810653.1_3 [db_xref=GeneID:107382813] [protein=ras-related protein Rab-33A-like] [protein_id=XP_015810653.1] [location=join(70623..70994,88563..88703,95063..95143,95215..95439)] [gbkey=CDS]
ATGACCAAGGATTCCCCGGAGGAGAGCCGCGCTGCGAGCGGAGGAGGAGGAAAAATCAGGAGAAACCGAGCCGACGACAA
TGTGACCATCCTGACATCCTCCATGGACTTCCACAGAGCCAGCAGGAGCCGAGCCAGCAGCGGCACTGATGCCGCCCCTA
GCGTCACCTCCTCCGTGGACCTGAGCACCTCCTCCCTGGAGATGAGCATCCAGACGCGGATCTTTAAGATCATCGTCATC
GGGGACTCCAACGTGGGGAAGACTTGCCTAACCTTCCGCTTCACCGGGGGAAGCTTCCCTGACAAGACCGAGGCCACGAT
CGGAGTGGATTTCAGGGAGAAGGCGGTGGAGATAGAAGGAGAAACCATTAAGGTGCAGGTGTGGGACACAGCAGGCCAGG
AACGCTTTCGGAAGTCCATGGTGGAACACTACTACCGGAATGTCCATGCTGTGGTCTTCGTGTACGATGTTACTAAGATG
GCTTCCTTCCGCAACCTGCAGACATGGATAGAGGAGTGTAACGGCCATCGGGTGTCTGCGTCTGTGCCTCGGGTTCTTGT
GGGAAACAAGTGTGACCTCGTGGATCAAATACAGGTGCCTTCCAACATGGCGCTGAAGTTCGCCGACGCCCACAACATGC
TGCTGTTCGAGACGTCCGCGAAGGACCCAAAGGAGACCCAGAATGTGGATTCCATCTTCATGTCGCTGGCCTGCCGCCTG
AAAGCCCAGAAGTCTCTGCTCTACAGAGACGTGGAGCGGGAGGACGGGAGGGTCAGGCTCTCACAGGAGACTGAAACCAA
AAGTAACTGTCCCTGTTGA
I can extract an identical sequence.
But lines like this fail me:
>lcl|NC_029649.1_cds_XP_015810238.1_10 [db_xref=GeneID:107382541] [protein=m7GpppX diphosphatase] [frame=2] [partial=5'] [exception=annotated by transcript or proteomic data] [protein_id=XP_015810238.1] [location=join(<612582..612786,613676..613826,613918..614063,614140..614253,614642..614752,614833..615099)] [gbkey=CDS]
GTGAGAGAAGGCGCTACAAACATGGCGGACGCTGTAGAAAGTGTTTATAGAGAAAAAGACGAGTTCTGTCAGGAGGCCAA
GAGATCGAAACCCGCCGACAGGGACGGACCAGAGTCTGAGAGTGAAAATATTTTAGCTGGATTTAAAACACCCAACGTGT
TGAGCGATTCTGCCCGGGAGAAAATCATCTTCATCCATGGAAAGATTGCAGATCAGGATGCTGTGGTCATCCTGGAGAAG
ACCCCCATCAGAGAAGATACGCTCCCTGAGCTCTTCAGTTGTTCCTCGCTCAGACTGGAGACGAGAAACGACATCTACGG
CTCCTATCGTCTCCAGGCCCCGCCCCACCTAAATGAGATTAAGACCACCGTTATTTATCCAGCCACAGAAAAGCACGTCA
AGAAGTATCAGCGTCAGGAGAACTTCCTGGTGGAGGAGACGGGAGAGGACTACGAGTCCATCACGCTGCCTTATATTCAG
CAGCAGAGTTTGAGCCTGCAGTGGGTTTACAACATCCTGGACAAGAAGGCTGAGGCCGACCGCGTCGTTTATGAAGACCC
AGACCCAAAGCTCGGCTTTGTCCTTCTCCCTGATTTAAAGTGGAACCAAAAGCAGGTGGACGACTTGTATCTGATTGCCG
TTGCTCATCAGAGAGACGTCAGAAGTCTTCGTGACCTGACGTCAGAGCACCTACCTTTGCTGCAGAACATCTTCCAGAAA
GGAAAGGAAGCCATCCTGCAGCGCTACAACCTTCCAAGCAGCAAGCTGAGGGTCTACCTGCACTACCAGCCCTCCTACTA
CCATCTCCACGTCCACTTCACCAAGTTGGGCTACGAGGCACCGGGCTGCGGCGTGGAGCGAGCCCACCTTCTGTCAGACG
TCATCCAGAACCTCCAGGCCAACCCGCAGTTCTACAAAACCCGAACAATGTACTTCCCTCTGAGGGCCGACGACGGGCTG
CTCGACAGGTTCAGAGAGGCGGGCAGGATGTGA
If (hypothetically of course) I ignore the left angle bracket preceding the starting base in the first coordinates, <612582..612786, I get an extracted sequence which is one base longer then the reported one.
So maybe due to frame=2 in the header, it means that the ORF starts one base downstream, at 612583.
But then, there are also cases when frame=3, or when there's a right angle bracket following the ending base, or when there are angle brackets but no annotation of frame different than 1.
So, how should coordinates and angle brackets be read in a fna file?
Thanks in advance.
Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.An educated guess is that brackets are telling you that either start or stop codons are missing. Other than that, the second sequence has expected length and translates into the annotated protein. I think maybe you try ignoring the brackets and see if that creates frame-shifting in your codons. If not, chances are that brackets are only telling you that the expected codons are missing at either end.
Pretty educated guess :)
Found today this documentation in the
README.txt
file in the above-mentioned folder: