Hi - I am using biopython (open to any other method however) to find sequences in a fasta file that match ID's from a .txt file. Just by looking at the two files I can see there are matches, but the following script will not find them:
#!/usr/bin/env python import sys from Bio import SeqIO input_file = sys.argv id_file = sys.argv output_file = sys.argv wanted = set(line.rstrip("\n").split(None,1) for line in open(id_file)) print "Found %i unique identifiers in %s" % (len(wanted), id_file) records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted) count = SeqIO.write(records, output_file, "fasta") print "Saved %i records from %s to %s" % (count, input_file, output_file) if count < len(wanted): print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
here is a snippet of the fasta file (.fasta):
>gnl|TC-DB|P0A334 1.A.1.1.1 Voltage-gated potassium channel OS=Streptomyces lividans GN=kcsA PE=1 SV=1 MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLI TYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQE RRGHFVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR >gnl|TC-DB|P06550 1.A.1.1.2 LctB protein - Bacillus stearothermophilus. MDYAFLGVVTAVLLGSITSLWTASVQATHRLSLDSLWVLVQWYGTMLLGFAMIYMILQAN GHHVFTPSPSSAAGNRLSMLEDSLYLSGMTLLSVGYGDVTPVGIGRWIAIAEALLGYIMP AVIVTRTVFDSDRR
and the id file (.txt)
Q09666 Q7PMK5 P98161 A0CX44 Q23K98 P29993 Q9VLS3 Q14643 Q9H5I5 Q4E330 P29995
I don't think there is a problem with the python script, but maybe the fasta file. can anyone spot anything strange. I know for a fact all id's match.
Thanks very much,