Question: How to concatenate clustalw aligned fasta sequence?
0
gravatar for Dineshkumar K
8 months ago by
Kasaragod, Kerala, India
Dineshkumar K40 wrote:

I need to concatenate clustalw aligned multifasta alignment file. I have tried many concatenating tools, which is not recognising the gaps (---) of the alignment and ultimately end up with error stating that sequence dissimilar length. Is there any way to do the same. For example I have alignment file a.fasta and b.fasta as given below,

a.fasta
>or1
ATGTCT-----TGA
>or2
-----TGATAG-----
>or3
TGATA-----TAGTT

b.fasta
>or2
ATGATGTATGATGATA
>or1
GTAGATAGATAGAG
>or3
ATGCTAGATAGATAG

The expected output is given below,

c.fasta
>or1
ATGTCT-----TGAGTAGATAGATAGAG
>or2
-----TGATAG-----ATGATGTATGATGATA
>or3
TGATA-----TAGTT ATGCTAGATAGATAG
ADD COMMENTlink modified 8 months ago by shenwei3565.5k • written 8 months ago by Dineshkumar K40
1

are all those sequences (aligned ones) on a single line ?

if so then someting like this should work:

paste <(cat a.fasta | paste - - | sort) <(cat b.fasta | paste - - | sort) | awk '{ print $1,$2$4 }' | sed 's/ /\n/g'
ADD REPLYlink modified 8 months ago • written 8 months ago by lieven.sterck8.7k

Dear @ lieven.sterck, Here, I showed an example sequence. But, in reality, I have multiple lined fasta file. All the sequences are not in single line.

ADD REPLYlink written 8 months ago by Dineshkumar K40

you can first linearise them if that does not hinder the downstream analysis? eg using this:

https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

but I must admit that shenwei356 solution using the seqkit concat is much neater ;)

ADD REPLYlink modified 8 months ago • written 8 months ago by lieven.sterck8.7k

Thank you @lieven.sterck for your valuable suggestion.

ADD REPLYlink written 8 months ago by Dineshkumar K40

-update: I didn't realize that b.fasta was aligned. The following method only works if the second fasta file is unaligned. If you have the raw sequences for b.fasta you can use this method. Specially if you repeatedly want to align sequences with a.fasta you can have a profile alignment of a.fasta (a.hmm) and use hmmalign to align other sequences to a.fasta.

You can use hmmbuild to build a profile from a.fasta and then use hmmalign to align b.fasta to the hmm profile.

hmmbuild a.hmm a.fasta

Or even better: Convert aligned fasta to stockholm

hmmbuild a.hmm a.sto
hmmalign with -mapali option
hmmalign -o c.sto --mapali a.sto a.hmm b.fasta
hmmalign --outformat A2M -o c.fasta --mapali a.fasta a.hmm b.fasta

I think hmmalign works best with stockholm format but you can try it with aligned fasta (a.fasta) and see if it works.

ADD REPLYlink modified 8 months ago • written 8 months ago by Fatima730

Thank you @Fatima, However it shows following error,

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ hmmbuild a.hmm a.fasta 
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             a.fasta
# output HMM file:                  a.hmm
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen     W eff_nseq re/pos description
#---- -------------------- ----- ----- ----- ----- -------- ------ -----------
Alignment input parse error:
   sequence or2 has alen 16; expected 14
   while reading aligned FASTA file a.fasta
   at or near line 5
ADD REPLYlink modified 8 months ago by genomax91k • written 8 months ago by Dineshkumar K40

All sequences in an aligned fasta should have the same length. or1 is 14 but or2 is 16.

ADD REPLYlink written 8 months ago by Fatima730

hi Fatima ,

can you elaborate on how (or why) this is answering the original question?

if the b.fasta alignments do not cover the same region as a.fasta, then the hmm-align will also not align them, no?

ADD REPLYlink written 8 months ago by lieven.sterck8.7k

My bad. Because b.fasta didn't have any gaps I thought it's an unaligned fasta.

ADD REPLYlink written 8 months ago by Fatima730
4
gravatar for shenwei356
8 months ago by
shenwei3565.5k
China
shenwei3565.5k wrote:

seqkit concat

seqkit concat a.fasta b.fasta > c.fasta
ADD COMMENTlink written 8 months ago by shenwei3565.5k

Thank you @shenwei365. It works fine. But, I could not output the result in new file. It shows result in terminal only, when I output the result using this syntax > c.fasta, which show following error.

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ ~/Documents/tools/seqkit/seqkit concat a.fasta b.fasta > c.fasta 
[INFO] read file: a.fasta
[INFO] 3 records loaded
[INFO] read file: b.fasta
[INFO] 3 records loaded

And, when I am not specifying the output, the results showed on the terminal as follows,

ga@pls:~/Documents/data/xap95/amphora_anal/clustal_codon/gblock_withgab/gblock_gapfree/test$ ~/Documents/tools/seqkit/seqkit concat a.fasta b.fasta 
[INFO] read file: a.fasta
[INFO] 3 records loaded
[INFO] read file: b.fasta
[INFO] 3 records loaded
>or1
ATGTCT-----TGAGTAGATAGATAGAG
>or2
-----TGATAG-----ATGATGTATGATGATA
>or3
TGATA-----TAGTATGCTAGATAGATAG
ADD REPLYlink modified 8 months ago by shenwei3565.5k • written 8 months ago by Dineshkumar K40
2

Did you look to see if c.fasta was made? In first case that is all you should see on screen. That would be the default behavior of the program when things work.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax91k

Sorry for my misinterpretation. @shenwei356 suggestion works well.

ADD REPLYlink written 8 months ago by Dineshkumar K40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1199 users visited in the last hour