If I've understood the actual request properly (I think this is an XY
question), my general purpose fasta fetching script will work, with some slight modification (we need to test seq.descriptions
instead of seq.id
since yours have spaces in). If I am correct though, essentially this is probably the #2 most asked question on the forum (maybe even tied with "reformatting fasta headers").
#!/usr/bin/env python
from Bio import SeqIO
import sys, six
import argparse
def getKeys(args):
"""Turns the input key file into a list."""
with open(args.keyfile, "r") as kfh:
keys = [line.rstrip('\n').lstrip('>') for line in kfh]
return keys
def main():
"""Takes a string or list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
# Parse arguments from the commandline:
try:
parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
parser.add_argument(
'-f',
'--fasta',
action='store',
required=True,
help='The multifasta to search.')
parser.add_argument(
'-k',
'--keyfile',
action='store',
help='A file of header strings to search the multifasta for. Must be exact. Must be one per line.')
parser.add_argument(
'-s',
'--string',
action='store',
help='Provide a string to look for directly, instead of a file.')
parser.add_argument(
'-o',
'--outfile',
action='store',
default=None,
help='Output file to store the new fasta sequences in. Just prints to screen by default.')
parser.add_argument(
'-v',
'--verbose',
action='store_true',
help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
parser.add_argument(
'-i',
'--invert',
action='store_true',
help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
args = parser.parse_args()
except:
print('An exception occured with argument parsing. Check your provided options.')
sys.exit(1)
# Main code:
# Call getKeys() to create the list of keys from the provided file:
if not (args.keyfile or args.string):
print('No key source provided. Exiting.')
sys.exit(1)
elif not args.keyfile:
keys = args.string
else:
keys = getKeys(args)
if args.verbose is not False:
if args.invert is False:
print('Fetching the following keys: ' + args.keyfile)
if isinstance(keys, six.string_types):
print(keys)
elif isinstance(keys, (list,)):
for key in keys:
print(key)
else:
print('Ignoring the following keys, and retreiving everything else from: ' + args.fasta)
for key in keys:
print(key)
# Parse in the multifasta and assign an iterable variable:
seqIter = SeqIO.parse(args.fasta, 'fasta')
if args.verbose is not False:
print("Found Sequences with Descriptors:")
for seq in seqIter:
print(seq.description)
# For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
for seq in seqIter:
if args.invert is False:
if seq.description in keys:
print(seq.format("fasta"))
if args.outfile is not None:
SeqIO.write(seq, outfile, "fasta")
else:
# If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
if seq.id not in keys:
print(seq.format("fasta"))
if args.outfile is not None:
SeqIO.write(seq, outfile, "fasta")
if __name__ == "__main__":
main()
Call it as follows:
python fastafetcher.py -f fileB.fasta -k fileA.txt
It will find all the entries in the sequence file which have a corresponding entry in the header file. In the case of your test data above, that's just the first one as I said before:
$ python ~/bin/bioinfo-tools/fastafetcher.py -f fileB.fasta -k fileA.txt
>TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MRMRMIKMLLWRSISVRDYSMAKTKPVNCRFTLR
It sounds like you actually have 0 FASTA files, since FASTA is a specific (though loose) format, and neither of those seems like it fits the bill - at least as you've described.
Can you show us the result of
head fasta_a
andhead fasta_b
so we know what the data looks like?Why are you trying to write a .csv if you want the output to be appended fastas (which would mean fasta format no?)?
Are you sure this question isn't better phrased as "I want to find all the fasta entries in file B, that have matching headers in fileA"? Or am I misinterpreting something? I'm not sure why you want to put the sequences in file A?
If file B has:
That's the same as the first header in file A:
So all we need to do is grab that entry and put it in a new file?