EDIT:

Question

How to concatenate FASTA files in a specific order based on sequence ID

1

Entering edit mode

3.1 years ago

Apex92 ▴ 280

I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

I have two fasta files and what I want to do is to merge them but in a way that similar sequence ids should be printed back to back - and any sequence ids in one of the files that do not exist in another file then those should not be printed.

cat f1.fa

>seq1
ATCGTCA
>seq2
AAAAACT
>seq3
AACATCA
>seq71
CCCGA

cat f2.fa

>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

so the desired output should look like :

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

I'd prefer to use python as that is the language I'm learning but any solution will suffice.

Thanks for helping.

sequence fasta awk python pipeline • 3.3k views

ADD COMMENT • link 3.1 years ago by Apex92 ▴ 280

score 3 · Answer 1 · 2021-03-30

3

Entering edit mode

3.1 years ago

Pierre Lindenbaum 161k

using linearizefasta.awk

join -t $'\t' -1 1 -2 1 \
         <(awk -f linearizefasta.awk in1.fa | sort -t $'\t' -k1,1) \
         <(awk -f linearizefasta.awk in2.fa | sort -t $'\t' -k1,1) |\
awk '{printf("%s\n%s\n%s\n%s\n",$1,$2,$1,$3);}'

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

nice one Pierre Lindenbaum , but for some reason it always gives a 'duplication' of the first entry of the first file ?

ADD REPLY • link 3.1 years ago by lieven.sterck 15k

1

Entering edit mode

nice catch it's because a seq1 is present twice in the second file. So a uniqshould be added.

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre thank you for your help - could you please also share how I can run it? I am very basic with this stuff, sorry

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

score 2 · Answer 2 · 2021-03-30

The following script should do what you want as long as you don't have duplicate sequence names in the same file (as you do in f2.fa in your example but hopefully not in your real data). You will need to install the pyfaidx module (e.g. with pip3 install faidx --user).

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(*fasta_files):
    fas = list()
    all_contigs = set()
    for f in fasta_files:
        faidx = Fasta(f)
        fas.append(faidx)
        contigs = set(faidx.keys())
        all_contigs.update(contigs)
    for c in sorted(all_contigs):
        for f in fas:
            if c in f:
                print(">" + f[c].long_name)
                print(f[c])


if __name__ == '__main__':
    if len(sys.argv) < 3:
        sys.exit("Usage: {} f1.fa f2.fa ...fN.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:])

Because you technically shouldn't have two fasta records with the same sequence name in the same file you might want to consider checking for duplicates and adding a suffix to any duplicate sequences too.

EDIT:

I missed the part of your question

and any sequence ids in one of the files that do not exist in another file then those should not be printed

In which case:

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(f1, f2):
    fas1 = Fasta(f1)
    fas2 = Fasta(f2)
    overlapping_contigs = set(fas1.keys()).intersection(set(fas2.keys()))
    for c in sorted(overlapping_contigs):
        for f in (fas1, fas2):
            print(">" + f[c].long_name)
            print(f[c])


if __name__ == '__main__':
    if len(sys.argv) != 3:
        sys.exit("Usage: {} f1.fa f2.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:]

Should do what you want.

score 1 · Answer 3 · 2021-03-30

if the order of the elements with the same names does not really matters you could do the following:

sort -V <(cat <yourFile1> |paste - -) <(cat <yourFile2> |paste - -) | sed 's/\t/\n/g'

[replace <yourFile> parts by your input fasta files]

the part between the brackets is a bash subprocess and will transform the fasta file to ID and seq on 1 line, for both files. Then sort will order the combination of both files and finally the sed part will transform the output back to fasta format (where ID on a line and next line is the seq)

OK, missed this part as well :

and any sequence ids in one of the files that do not exist in another file then those should not be printed

then it's all for Pierre Lindenbaum 's solution :)

score 1 · Answer 4 · 2021-03-30

1

Entering edit mode

3.1 years ago

cpad0112 21k

with OP data

$ cat seq1.fa seq2.fa | paste - -  | sort -sk1,1 | tr "\t" "\n"                                                                                                                     

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq3
AACATCA
>seq71
CCCGA
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

If fasta files are not linearized, try with seqkit:

$ seqkit fx2tab seq1.fa seq2.fa | sort -sk1,1 | seqkit tab2fx

Round about is way due to limitations in seqkit sort. Seqkit doesn't allow duplicate headers while sorting.

ADD COMMENT • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

cpad0112 seq3 and seq 81 are not common between the two files and should not be printed - could you please help with that?

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

1

Entering edit mode

$ paste - - < seq1.fa < seq2.fa| sort -sk1,1 | awk -v OFS="\t" '{print $2,$1}' | uniq -D -f 1 | awk -v OFS="\n" '{print $2,$1}'

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

seq 71 is common.

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

uniq: illegal option -- D usage: uniq [-c | -d | -u] [-i] [-f fields] [-s chars] [input [output]]

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

0

Entering edit mode

what OS are you using? You would need GNU coreutils 8.30 and I am on ubuntu 20.04 (focal)

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

my system is mac

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

0

Entering edit mode

I am not sure how I can fix this -D problem - is there another way of doing it? I am in a time shortage otherwise I would consider using GNU

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

0

Entering edit mode

see if this works (try it on example files in OP):

$ join -1 1 -2 1 <(paste - - <seq1.fa) <(paste - - <seq2.fa) -o 1.1 | uniq  | grep -A 1 --no-group-separator -hf - seq1.fa seq2.fa | paste - -  | sort -sk1,1 | awk -v OFS="\n" '{print $1,$2}'

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

join is due to common sequences in between seq1 and seq2.fa. If you want sequences from seq1.fa only,

$ grep ">" seq1.fa  | grep -A 1 --no-group-separator -hf - seq1.fa seq2.fa | paste - -  | sort -sk1,1 | awk -v OFS="\n" '{print $1,$2}'

ADD REPLY • link 3.1 years ago by cpad0112 21k

0

Entering edit mode

No, not unfortunately it is not working - I get errors with options of join and grep

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

0

Entering edit mode

error with --no-group-separator

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280