Question: Combining two fasta sequences into one
2
gravatar for Lille My
2.5 years ago by
Lille My30
Lille My30 wrote:

I have two fasta files, with the same headers/names for the sequences but different sequences. I would like to combine them into one file, so that each sequence has the same name but is a combination of both sequences. My preferred language is bash script, but I'm open to other suggestions. thanks.

sequence • 2.9k views
ADD COMMENTlink modified 2.5 years ago by Pierre Lindenbaum121k • written 2.5 years ago by Lille My30

with the same headers/names for the sequences but different sequences

uhh ?

would like to combine them into one file, s

an example is needed

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum121k

Like this?

File_1:

>Seq_1
ACGCTAGCTA
>Seq_2
CGCTAGCTC

File_2:

>Seq_1
GCTGAT
>Seq_2
TTACTC

File_1 + File_2 = File_3

>Seq_1
ACGCTAGCTAGCTGAT
>Seq_2
CGCTAGCTCTTACTC
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax69k

yes, your example is exactly what I need to do.

ADD REPLYlink written 2.5 years ago by Lille My30

Does this make biological sense?

ADD REPLYlink written 2.5 years ago by WouterDeCoster40k

Sometimes it does, depends on what kind of sequences you have.

ADD REPLYlink written 2.5 years ago by Lille My30
3
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

assuming there are only twho lines per sequence (title/dna) and they are ordered the same way.

paste  f1.fa f2.fa | sed -e 's/\t>.*//' -e 's/\t//'
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum121k

there are more lines, but I can change them into one liners. I will try this out. thanks!

ADD REPLYlink written 2.5 years ago by Lille My30
2
gravatar for shenwei356
2.5 years ago by
shenwei3564.7k
China
shenwei3564.7k wrote:

A solution using seqkit, csvtk and shell sed.

Sample files (not in same order, can be multiple lines):

$ cat 1.fa
>seq1
aaa
aa
>seq2
ccc
cc
>seq3
ggg
gg

$ cat 2.fa
>seq3
TTT
TT
>seq2
GGG
GG
>seq1
CCC
CC

Just one command:

$ seqkit concat 1.fa 2.fa
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT

Step 1. Convert FASTA to tab-delimited (3 columns, the 3rd column is blank (no quality for FASTA)) file:

$ seqkit fx2tab 1.fa > 1.fa.tsv
$ seqkit fx2tab 2.fa > 2.fa.tsv

$ cat -A 1.fa.tsv 
seq1^Iaaaaa^I$
seq2^Iccccc^I$
seq3^Iggggg^I$

Step 2. Merge two table files:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | cat -A
seq1^Iaaaaa^I^ICCCCC^I$
seq2^Iccccc^I^IGGGGG^I$
seq3^Iggggg^I^ITTTTT^I$

Step 3. Note that there are two TAB between the two sequences, so we can remove them to join the sequences

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//'
seq1    aaaaaCCCCC
seq2    cccccGGGGG
seq3    gggggTTTTT

Step 4. Convert tab-delimited file back to FASTA file:

$ csvtk join -H -t 1.fa.tsv 2.fa.tsv | sed 's/\t\t//' | seqkit tab2fx
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT

All in one command:

$ csvtk join -H -t <(seqkit fx2tab 1.fa) <(seqkit fx2tab 2.fa) | sed 's/\t\t//' | seqkit tab2fx

ADD COMMENTlink modified 22 months ago • written 2.5 years ago by shenwei3564.7k

thanks! I will try this out.

ADD REPLYlink written 2.5 years ago by Lille My30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 636 users visited in the last hour