Question

Merge fasta files

0

Entering edit mode

5.7 years ago

js8226 • 0

Hello! I have two fasta files; 'fasta_a' contains gene names, and 'fasta_b' contains gene names, sequences, lengths, paths, etc. If the gene name from 'fasta_b' exists in 'fasta_a' I would like to append 'fasta_a' to contain the sequence found in 'fasta_b'.

So far, I have created two dictionaries using SeqIO and have attempted to append fasta_a with the sequences of fasta_b but my output is giving me a string of all the values from fasta_b.

I need to append fasta_a with the sequences from fasta_b. Please let me know if you know of a better way to do this or if you can figure out what I am doing wrong....thank you in advance for your help!

fasta_a_dict = SeqIO.to_dict(SeqIO.parse(fasta_a_path, "fasta"))
fasta_b_dict = {rec.id : rec.seq for rec in SeqIO.parse(fasta_b_path, "fasta")}

for key in fasta_a_dict:
    if key in fasta_b_dict:
        fasta_a_dict[key].features.append(fasta_b_dict[key])

with open(output_path, "w") as output:
    df = pd.DataFrame.from_dict([fasta_a_dict])
    df.to_csv(output, header=False, index=True, mode='a')

python fasta biopython • 3.0k views

ADD COMMENT • link updated 5.7 years ago by finswimmer 16k • written 5.7 years ago by js8226 • 0

2

Entering edit mode

I have two fasta files; 'fasta_a' contains gene names, and 'fasta_b' contains gene names, sequences, lengths, paths, etc.

It sounds like you actually have 0 FASTA files, since FASTA is a specific (though loose) format, and neither of those seems like it fits the bill - at least as you've described.

Can you show us the result of head fasta_a and head fasta_b so we know what the data looks like?

Why are you trying to write a .csv if you want the output to be appended fastas (which would mean fasta format no?)?

ADD REPLY • link 5.7 years ago by Joe 21k

0

Entering edit mode

head fasta_a:

    >TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2] 

    >TR4_c0_g1_i1_2 [303 - 428] len=477 path=[958:0-213 959:214-237 960:238-476] [-1, 958, 959, 960, -2]

    >TR5_c0_g1_i1_2 [117 - 509] len=509 path=[1022:0-361 1023:362-385 1024:386-508] [-1, 1022, 1023, 1024, -2]

    >TR7_c0_g1_i1_2 [315 - 37] (REVERSE SENSE) len=380 path=[786:0-319 787:320-343 788:344-344 789:345-379] [-1, 786, 787, 788, 789, -2]

ADD REPLY • link updated 5.7 years ago by GenoMax 142k • written 5.7 years ago by js8226 • 0

0

Entering edit mode

head fasta_b:

>TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MRMRMIKMLLWRSISVRDYSMAKTKPVNCRFTLR
>TR1_c0_g1_i1_2 [147 - 248] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MAAGAYEDKKSRRKKEKKKKKKKERKGGKRKEKK
>TR1_c0_g1_i1_3 [115 - 249] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MTVRYILSVIKWLRVLTKTKKAEERKKRKRRKRKKERGGKGRKRK

>TR2_c0_g1_i1_1 [264 - 404] len=404 path=[816:0-319 817:320-403] [-1, 816, 817, -2]
MFRSFIAYNDLYLKVKKHEQNIWQQKSGFHIKINCGGFSYLQRKNLG

>TR3_c0_g1_i1_1 [535 - 443] (REVERSE SENSE) len=651 path=[1282:0-484 1283:485-508 1284:509-650] [-1, 1282, 1283, 1284, -2]
MLNFPTCQTLSAHVRTHQILRKLLARTEYFP

ADD REPLY • link updated 5.7 years ago by GenoMax 142k • written 5.7 years ago by js8226 • 0

0

Entering edit mode

Are you sure this question isn't better phrased as "I want to find all the fasta entries in file B, that have matching headers in fileA"? Or am I misinterpreting something? I'm not sure why you want to put the sequences in file A?

If file B has:

>TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MRMRMIKMLLWRSISVRDYSMAKTKPVNCRFTLR

That's the same as the first header in file A:

>TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]

So all we need to do is grab that entry and put it in a new file?

ADD REPLY • link 5.7 years ago by Joe 21k

score 1 · Answer 1 · 2018-08-23

If I've understood the actual request properly (I think this is an XY question), my general purpose fasta fetching script will work, with some slight modification (we need to test seq.descriptions instead of seq.id since yours have spaces in). If I am correct though, essentially this is probably the #2 most asked question on the forum (maybe even tied with "reformatting fasta headers").

#!/usr/bin/env python
from Bio import SeqIO
import sys, six
import argparse

def getKeys(args):
    """Turns the input key file into a list."""

    with open(args.keyfile, "r") as kfh:
        keys = [line.rstrip('\n').lstrip('>') for line in kfh]

    return keys

def main():
    """Takes a string or list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
    # Parse arguments from the commandline:
    try:
        parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
        parser.add_argument(
            '-f',
            '--fasta',
            action='store',
            required=True,
            help='The multifasta to search.')
        parser.add_argument(
            '-k',
            '--keyfile',
            action='store',
            help='A file of header strings to search the multifasta for. Must be exact. Must be one per line.')
        parser.add_argument(
            '-s',
            '--string',
            action='store',
            help='Provide a string to look for directly, instead of a file.')
        parser.add_argument(
            '-o',
            '--outfile',
            action='store',
            default=None,
            help='Output file to store the new fasta sequences in. Just prints to screen by default.')
        parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
        parser.add_argument(
            '-i',
            '--invert',
            action='store_true',
            help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
        args = parser.parse_args()

    except:
        print('An exception occured with argument parsing. Check your provided options.')
        sys.exit(1)
# Main code:
    # Call getKeys() to create the list of keys from the provided file:

    if not (args.keyfile or args.string):
        print('No key source provided. Exiting.')
        sys.exit(1)
    elif not args.keyfile:
        keys = args.string
    else:
        keys = getKeys(args)

    if args.verbose is not False:
        if args.invert is False:
            print('Fetching the following keys: ' + args.keyfile)
            if isinstance(keys, six.string_types):
                print(keys)
            elif isinstance(keys, (list,)):
                for key in keys:
                    print(key)
        else:
            print('Ignoring the following keys, and retreiving everything else from: ' + args.fasta)
            for key in keys:
                print(key)

    # Parse in the multifasta and assign an iterable variable:
        seqIter = SeqIO.parse(args.fasta, 'fasta')
    if args.verbose is not False:
        print("Found Sequences with Descriptors:")
        for seq in seqIter:
            print(seq.description)
    # For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
    for seq in seqIter:
        if args.invert is False:
            if seq.description in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outfile, "fasta")
        else:
    # If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
            if seq.id not in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outfile, "fasta")

if __name__ == "__main__":
    main()

Call it as follows:

python fastafetcher.py -f fileB.fasta -k fileA.txt

It will find all the entries in the sequence file which have a corresponding entry in the header file. In the case of your test data above, that's just the first one as I said before:

$ python ~/bin/bioinfo-tools/fastafetcher.py -f fileB.fasta -k fileA.txt
>TR1_c0_g1_i1_1 [5 - 106] len=249 path=[478:0-82 479:83-106 480:107-248] [-1, 478, 479, 480, -2]
MRMRMIKMLLWRSISVRDYSMAKTKPVNCRFTLR

score 0 · Answer 2 · 2018-08-27

Hello,

here is a non-python solution. If I understood correct your fasta_a contains only the header lines of the sequences that you want to match in fasta_b.

1. Create a file that have the header lines without the > at the beginning:

$ cut -c2- fasta_a > names.txt

2. Grep the sequences in fasta_b which header line matches those in names.txt using seqkit:

$ seqkit grep -n -f names.txt fasta_b

fin swimmer