Software for Unaligning Sequences
4
0
Entering edit mode
5.5 years ago

The dataset I have consists of aligned sequences in fasta format. However, I am developing a certain multiple sequence alignment method, for which I need to run my program on unaligned sequences. Is there any software for converting from aligned fasta sequences to unaligned sequences in fasta format?

alignment • 2.9k views
ADD COMMENT
0
Entering edit mode

please post an example of what you mean "aligned sequences". FASTA as a format is not an alignment format, hence it is not clear what you are looking for

ADD REPLY
0
Entering edit mode

This is part of what my aligned fasta file contains:

>>
> SDEA
ATT-------AG-------------------------GAGGGT-----------------
-----ACCATTGT--------------ACCGGTTGGT-T---------------------
---------ATCAAA---------------------TCG------CCG------------
-GA-----------------------------------------AACCTA----T-----
------------------GTGC----------CCG-------CAGCG-------------
----TGATTTC----A-----------------------CGATGGT------C-------
---------------------CC---T--------------------ATGTGGGC-----
--CA---------------------------A----TCCCGA---------------CAA
AGGCCGAAAA------A-ATT-GAC-G---AC---------TATGT--------------
---------------GAG-----GTGC--------------------GTC-----AA---
-----------------C------------TGT------------------A--------
-----TG------CCTCAAAGTAGCC-----A---------------TGAGTT-------
------------------------CC----------TT----CGCAGGC-----------
--------------------------TTCCCCCCATAGGGGTTT--GTA--TTAC-----
-----TGAA------GATCT----CTAT----AC-------AA-----------------
--GGGAAGTTTTC-----------------------------------------------
G--GT-------------AAGG---CG----------TC------------------CTG
C------------------------------------------G-A-T----------G-
--G------G--------------------CG--------------------------CC
GATTGA--------TGCT------------------------------------------
-------CATG---------------------------------T---------------
---T--------------------------------------AGCAGCGG---------A
--------AATTG--TCTCCG-----------------CGC--------------GTC--
---------A------GAGTGAAAACTCCTC------------CA---------------
------CTATT-----------------------------AGCCCCA-------------
--------------------------AGGCGGT-------AG---A--------------
------TGT-------------------------C-----GATG-------TGG------
---------------------------------------------------------GT-
------------TAAAT-------------------------------------------
--------------TGA---C-CG-----AA-----------------------------
----------GC
ADD REPLY
1
Entering edit mode

Run it through awk and delete - gap characters from the sequence string.

ADD REPLY
2
Entering edit mode
5.5 years ago
$ awk '{ if ($0 !~ /^>/) { gsub(/-/, "", $0); } print $0; }' in.fa > out.fa
ADD COMMENT
0
Entering edit mode

Error:

$ awk '{ if ($0 !~ /^>/) { gsub(//, '-', $0); } print $0; }' test.fa
awk: cmd. line:1: { if ($0 !~ /^>/) { gsub(//, -, $0); } print $0; }
awk: cmd. line:1:                               ^ syntax error

Modified:

 $ awk '{ if ($0 !~ /^>/) { gsub("-", "",$0); } print $0; }' test.fa 
>seq-1-1
atgc
>seq2
gc
>seq_3
gatgatg

awk version:

 $ awk --version
    GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)

input example fasta:

$ cat test.fa 
>seq-1-1
a-t-g----c
>seq2
--g--c---
>seq_3
-gatga-tg
ADD REPLY
0
Entering edit mode

Thanks, fixed the order of substitution.

ADD REPLY
4
Entering edit mode
5.5 years ago
$ cat test.fa 
>seq-1-1
a-t-g----c
>seq2
--g--c---
>seq_3
-gatga-tg

$ sed '/>/! s/-//g' test.fa 
>seq-1-1
atgc
>seq2
gc
>seq_3
gatgatg
ADD COMMENT
2
Entering edit mode
5.5 years ago
Joe 21k

In practice I’d use a lot of the solutions above, but a ‘safe’ option would be to use a dedicated parser like BioPython. For instance, Istvan’s solution (I think) would edit your headers if they had - in:

from Bio import AlignIO
import sys

aln = AlignIO.read(sys.argv[1], 'fasta')
for s in aln:
    print('>{}\n{}'.formats.id, str(s.seq).replace('-',''))
ADD COMMENT
1
Entering edit mode

To be further safe, onr can use either degap from scikitbio or ungap from Bio.seq module instead of python for replacing the gaps. In addition, there is an extra ) at the end of the code.

from Bio import SeqIO
import sys

for record in SeqIO.parse(sys.argv[1], "fasta"):
    print(">", record.id, "\n", record.seq.ungap("-"), sep="")

output:

$ python3 test.py test.fa 
>seq-1-1
atgc
>seq2
gcatgc
>seq_3
gatgatgatgcctg

input:

$ cat test.fa
>seq-1-1
a-t-g----c
>seq2
--g--c---
atgc
>seq_3
-gatga-tg
----atgc---
-ctg
ADD REPLY
0
Entering edit mode

Ah that's good. I was sure I'd seen a method for gap removal somewhere before but couldn't find it again earlier.

Incidentally, casting to str() and then running a replace is exactly what ungap does internally ;) but always nice to use the designed methods.

From the source:

1245          if hasattr(self.alphabet, "gap_char"): 
1246              if not gap: 
1247                  gap = self.alphabet.gap_char 
1248              elif gap != self.alphabet.gap_char: 
1249                  raise ValueError( 
1250                      "Gap {0!r} does not match {1!r} from alphabet".format( 
1251                          gap, self.alphabet.gap_char)) 
1252              alpha = Alphabet._ungap(self.alphabet) 
1253          elif not gap: 
1254              raise ValueError("Gap character not given and not defined in " 
1255                               "alphabet") 
1256          else: 
1257              alpha = self.alphabet  # modify! 
1258          if len(gap) != 1 or not isinstance(gap, str): 
1259              raise ValueError("Unexpected gap character, {0!r}".format(gap)) 
1260          return Seq(str(self).replace(gap, ""), alpha)
ADD REPLY
1
Entering edit mode
5.5 years ago

Delete the - characters then reformat to "nice" fasta with:

cat seq.fa | tr -d -  | seqtk seq -A
ADD COMMENT

Login before adding your answer.

Traffic: 1411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6