python: Parsing fasta file
1
0
Entering edit mode
7.6 years ago
Am.A ▴ 20

Hi all

How I parse FASTA file to get information about gene location ( i.e. get numbers start of gene and the end)?

 >lcl|NC_000913.3_cds_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA

>lcl|NC_000913.3_cds_NP_414547.1_6 [gene=yaaA] [protein=peroxide resistance protein, lowers intracellular iron] [protein_id=NP_414547.1] [location=complement(5683..6459)]
ATGCTGATTCTTATTTCACCTGCGAAAACGCTTGATTACCAAAGCCCGTTGACCACCACGCGCTATACGC
TGCCGGAGCTGTTAGACAATTCCCAGCAGTTGATCCATGAGGCGCGGAAACTGACGCCTCCGCAGATTAG
gene • 6.5k views
ADD COMMENT
5
Entering edit mode

you can find exactly what you need in previous question
Correct Way To Parse A Fasta File In Python

bonus

read this

https://github.com/mdshw5/pyfaidx

ADD REPLY
3
Entering edit mode

Okay, you have my permission to do so.

But what is the question? Have you tried googling?

ADD REPLY
0
Entering edit mode

But you don't have my permission to give OP permission :-)

Unless OP edited the question after you wrote your comment it does appear to have a reasonably clear description. On a serious note, can we have more of what @Medhat did and less of these comments?

ADD REPLY
0
Entering edit mode

Indeed, the post was edited and didn't contain a question at all when I placed my comment asking about what the question would be. I realize that my answer (with the edited original post) makes me look like a douche.

ADD REPLY
0
Entering edit mode

@Am.a: It generally helps to be explicit about the output you want when you write the original post. For example in this case do you only need

thrL       190..255
yaaA     5683..6459
ADD REPLY
2
Entering edit mode
7.6 years ago
second_exon ▴ 210

If I understood your question correctly, this solution with Python 3.x might help you,

with open("seq.fa") as f:
    for line in f:
        line = line.rstrip()
        if line.startswith('>'):
            line1 = line.split()
            print(": ".join([line1[0], line1[-1].strip('[location=complement()]')])) #add characters you want to strip

Output:

>lcl|NC_000913.3_cds_NP_414542.1_1: 190..255
>lcl|NC_000913.3_cds_NP_414547.1_6: 5683..6459
ADD COMMENT
1
Entering edit mode

Don't write a parser if it already exists... in this case the answer is SeqIO from Biopython

ADD REPLY

Login before adding your answer.

Traffic: 2900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6