Question: Find Sequences In Fasta File From List Of Id'S
1
gravatar for arronslacey
5.7 years ago by
arronslacey230
United Kingdom
arronslacey230 wrote:

Hi - I am using biopython (open to any other method however) to find sequences in a fasta file that match ID's from a .txt file. Just by looking at the two files I can see there are matches, but the following script will not find them:

#!/usr/bin/env python

import sys
from Bio import SeqIO

input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

here is a snippet of the fasta file (.fasta):

>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1
MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI
TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE
RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR
>gnl|TC-DB|P06550 1.A.1.1.2 LctB protein - Bacillus stearothermophilus.
MDYAFLGVVTAVLLGSITSLWTASVQATHRLSLDSLWVLVQWYGTMLLGFAMIYMILQAN
GHHVFTPSPSSAAGNRLSMLEDSLYLSGMTLLSVGYGDVTPVGIGRWIAIAEALLGYIMP
AVIVTRTVFDSDRR

and the id file (.txt)

Q09666 
Q7PMK5 
P98161 
A0CX44 
Q23K98 
P29993 
Q9VLS3 
Q14643 
Q9H5I5 
Q4E330 
P29995

I don't think there is a problem with the python script, but maybe the fasta file. can anyone spot anything strange. I know for a fact all id's match.

Thanks very much,

Arron.

bioperl biopython • 3.2k views
ADD COMMENTlink modified 10 months ago by santhoshhegde2780 • written 5.7 years ago by arronslacey230
3
gravatar for Xtof
5.7 years ago by
Xtof170
Paris
Xtof170 wrote:

Hi Arron

I don't use Biopython but I think that for the first sequence of your fasta file for example r.id gives you "gnl|TC-DB|P0A334". But into the id file only the last part is written. So I think you have to split your r.id if you want it to match to the id file

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id.split('|')[2] in wanted)

Also I wonder why you split the id into the line

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))

this code should work as well

wanted = set(line.rstrip("\n") for line in open(id_file))

Hope that helps

Chris

ADD COMMENTlink written 5.7 years ago by Xtof170

thanks chris, yes the website i downloaded the fast files from (TCDB) has included this in the id which is annoying. thanks for pointing this out.

ADD REPLYlink written 5.7 years ago by arronslacey230
2
gravatar for Jeroen Van Goey
5.7 years ago by
Jeroen Van Goey2.2k
Ghent, Belgium
Jeroen Van Goey2.2k wrote:

r.id in the line `

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)

gives back something from the form 'gnl|TC-DB|P0A334'. So, just changing that line to

records = (r for r in SeqIO.parse(input_file, "fasta") if r.id.split('|')[2] in wanted)

will work. I will assume that your snippet of the fasta file is badly chosen, because none of the ids is anywhere to be found in that snippet. But if you add P0A334 to your list of ids, the record will now be saved.

ADD COMMENTlink written 5.7 years ago by Jeroen Van Goey2.2k

thanks very much, worked like a charm.

ADD REPLYlink written 5.7 years ago by arronslacey230
0
gravatar for santhoshhegde278
10 months ago by
santhoshhegde2780 wrote:

I have a genome in fasta format and I have a sequence of length 5000bp I want to find if the sequence is present in that genome or not? If it present then what is the exact position of that sequence in genome or overlapping gene id? How to get it in python? please can anyone suggest me

ADD COMMENTlink written 10 months ago by santhoshhegde2780
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1442 users visited in the last hour