Question: get organism name from fasta file header (PYTHON)
0
gravatar for biohacker_tobe
3 days ago by
biohacker_tobe40 wrote:

For example, is it possible to get the organism name as one gets the sequence and gene id for example:

>sp|Q09305|AAR2_CAEEL Protein AAR2 homolog OS=Caenorhabditis elegans GN=F10B5.2 PE=3 SV=1
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCS
IKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYE
NYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE

>sp|Q18007|ACM1_CAEEL Probable muscarinic acetylcholine receptor gar-1 OS=Caenorhabditis elegans GN=gar-1 PE=2 SV=3
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVS
NRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLI
TADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAEC
SAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL

When taking out the following code I can get the sequence and protein id but not the organism name, how can this be done? :)

from Bio import SeqIO
import re
import pandas as pd

input_file = "Streptomyces_Uniprot.fasta" 
pattern = "\|(.*?)\|"
substring = re.search(pattern, s).group(1)
sequence_list = []
id_list = []

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    fasta_id, sequence, description = fasta.id, str(fasta.seq), fasta.description
    fasta_id = re.search(pattern, fasta_id).group(1)
    print (fasta_id)

How would I pull when OS="ORGANISM_NAME" for example from the description?

sequence fasta genome • 70 views
ADD COMMENTlink modified 3 days ago by Joe17k • written 3 days ago by biohacker_tobe40

Hello biohacker_tobe!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/63922667/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3 days ago by Pierre Lindenbaum130k
2
gravatar for Joe
3 days ago by
Joe17k
United Kingdom
Joe17k wrote:

If the OS and GN identifiers are reliable, you could try this:

from Bio import SeqIO
import sys, re

new_recs = []
for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    rec.description = re.search(r'OS=(.*?) GN=', rec.description).group(1)
    new_recs.append(rec)


for new_rec in new_recs:
    print(">" + new_rec.description + "\n" + new_rec.seq)

Output:

>Caenorhabditis elegans
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCSIKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYENYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE
>Caenorhabditis elegans
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVSNRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLITADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAECSAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL
ADD COMMENTlink modified 3 days ago • written 3 days ago by Joe17k

Thanks, this is exactly what I was looking for :D

ADD REPLYlink written 2 days ago by biohacker_tobe40
1
gravatar for Pierre Lindenbaum
3 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

assuming all the fields are in the same order:

awk -F= '/^>/ {X=$2;gsub(/[^ ]*$/,"",X);print X}' in.fasta
ADD COMMENTlink written 3 days ago by Pierre Lindenbaum130k

With linux I am aware on how to do it, but is it possible directly with python?

ADD REPLYlink written 3 days ago by biohacker_tobe40

Since there is no formal specification for fasta headers (beyond > and some identifier that follows), you can't do it using Biopython. You will likely have to do this by pattern search after you grab the header.

ADD REPLYlink modified 3 days ago • written 3 days ago by genomax89k

Fair enough, thanks for the pointer :)

ADD REPLYlink written 3 days ago by biohacker_tobe40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 625 users visited in the last hour