Question

Software for Unaligning Sequences

0

Entering edit mode

5.5 years ago

printingdots • 0

The dataset I have consists of aligned sequences in fasta format. However, I am developing a certain multiple sequence alignment method, for which I need to run my program on unaligned sequences. Is there any software for converting from aligned fasta sequences to unaligned sequences in fasta format?

alignment • 2.9k views

ADD COMMENT • link updated 5.5 years ago by Joe 21k • written 5.5 years ago by printingdots • 0

0

Entering edit mode

please post an example of what you mean "aligned sequences". FASTA as a format is not an alignment format, hence it is not clear what you are looking for

ADD REPLY • link 5.5 years ago by Istvan Albert 100k

0

Entering edit mode

This is part of what my aligned fasta file contains:

>>
> SDEA
ATT-------AG-------------------------GAGGGT-----------------
-----ACCATTGT--------------ACCGGTTGGT-T---------------------
---------ATCAAA---------------------TCG------CCG------------
-GA-----------------------------------------AACCTA----T-----
------------------GTGC----------CCG-------CAGCG-------------
----TGATTTC----A-----------------------CGATGGT------C-------
---------------------CC---T--------------------ATGTGGGC-----
--CA---------------------------A----TCCCGA---------------CAA
AGGCCGAAAA------A-ATT-GAC-G---AC---------TATGT--------------
---------------GAG-----GTGC--------------------GTC-----AA---
-----------------C------------TGT------------------A--------
-----TG------CCTCAAAGTAGCC-----A---------------TGAGTT-------
------------------------CC----------TT----CGCAGGC-----------
--------------------------TTCCCCCCATAGGGGTTT--GTA--TTAC-----
-----TGAA------GATCT----CTAT----AC-------AA-----------------
--GGGAAGTTTTC-----------------------------------------------
G--GT-------------AAGG---CG----------TC------------------CTG
C------------------------------------------G-A-T----------G-
--G------G--------------------CG--------------------------CC
GATTGA--------TGCT------------------------------------------
-------CATG---------------------------------T---------------
---T--------------------------------------AGCAGCGG---------A
--------AATTG--TCTCCG-----------------CGC--------------GTC--
---------A------GAGTGAAAACTCCTC------------CA---------------
------CTATT-----------------------------AGCCCCA-------------
--------------------------AGGCGGT-------AG---A--------------
------TGT-------------------------C-----GATG-------TGG------
---------------------------------------------------------GT-
------------TAAAT-------------------------------------------
--------------TGA---C-CG-----AA-----------------------------
----------GC

ADD REPLY • link updated 5.5 years ago by Istvan Albert 100k • written 5.5 years ago by printingdots • 0

1

Entering edit mode

Run it through awk and delete - gap characters from the sequence string.

ADD REPLY • link 5.5 years ago by Alex Reynolds 35k

4

Entering edit mode

5.5 years ago

cpad0112 21k

$ cat test.fa 
>seq-1-1
a-t-g----c
>seq2
--g--c---
>seq_3
-gatga-tg

$ sed '/>/! s/-//g' test.fa 
>seq-1-1
atgc
>seq2
gc
>seq_3
gatgatg

ADD COMMENT • link 5.5 years ago by cpad0112 21k

2

Entering edit mode

5.5 years ago

Joe 21k

In practice I’d use a lot of the solutions above, but a ‘safe’ option would be to use a dedicated parser like BioPython. For instance, Istvan’s solution (I think) would edit your headers if they had - in:

from Bio import AlignIO
import sys

aln = AlignIO.read(sys.argv[1], 'fasta')
for s in aln:
    print('>{}\n{}'.formats.id, str(s.seq).replace('-',''))

ADD COMMENT • link 5.5 years ago by Joe 21k

1

Entering edit mode

To be further safe, onr can use either degap from scikitbio or ungap from Bio.seq module instead of python for replacing the gaps. In addition, there is an extra ) at the end of the code.

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fasta"):
    print(">", record.id, "\n", record.seq.ungap("-"), sep="")

output:

$ python3 test.py test.fa 
>seq-1-1
atgc
>seq2
gcatgc
>seq_3
gatgatgatgcctg

input:

$ cat test.fa
>seq-1-1
a-t-g----c
>seq2
--g--c---
atgc
>seq_3
-gatga-tg
----atgc---
-ctg

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

Ah that's good. I was sure I'd seen a method for gap removal somewhere before but couldn't find it again earlier.

Incidentally, casting to str() and then running a replace is exactly what ungap does internally ;) but always nice to use the designed methods.

From the source:

1245          if hasattr(self.alphabet, "gap_char"): 
1246              if not gap: 
1247                  gap = self.alphabet.gap_char 
1248              elif gap != self.alphabet.gap_char: 
1249                  raise ValueError( 
1250                      "Gap {0!r} does not match {1!r} from alphabet".format( 
1251                          gap, self.alphabet.gap_char)) 
1252              alpha = Alphabet._ungap(self.alphabet) 
1253          elif not gap: 
1254              raise ValueError("Gap character not given and not defined in " 
1255                               "alphabet") 
1256          else: 
1257              alpha = self.alphabet  # modify! 
1258          if len(gap) != 1 or not isinstance(gap, str): 
1259              raise ValueError("Unexpected gap character, {0!r}".format(gap)) 
1260          return Seq(str(self).replace(gap, ""), alpha)

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

5.5 years ago

Istvan Albert 100k

Delete the - characters then reformat to "nice" fasta with:

cat seq.fa | tr -d -  | seqtk seq -A

ADD COMMENT • link 5.5 years ago by Istvan Albert 100k

score 2 · Accepted Answer · 2018-10-20

2

Entering edit mode

5.5 years ago

Alex Reynolds 35k

$ awk '{ if ($0 !~ /^>/) { gsub(/-/, "", $0); } print $0; }' in.fa > out.fa

ADD COMMENT • link 5.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Error:

$ awk '{ if ($0 !~ /^>/) { gsub(//, '-', $0); } print $0; }' test.fa
awk: cmd. line:1: { if ($0 !~ /^>/) { gsub(//, -, $0); } print $0; }
awk: cmd. line:1:                               ^ syntax error

Modified:

 $ awk '{ if ($0 !~ /^>/) { gsub("-", "",$0); } print $0; }' test.fa 
>seq-1-1
atgc
>seq2
gc
>seq_3
gatgatg

awk version:

 $ awk --version
    GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)

input example fasta:

$ cat test.fa 
>seq-1-1
a-t-g----c
>seq2
--g--c---
>seq_3
-gatga-tg