What is the most suitable file format for storing sequence alignments and associated annotations?
10
2
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

There are many file formats for storing sequence alignments. I am trying to figure out which is the most frequently used file format for storing an alignment and associated annotations such as features, taxa sets and character sets. Imagine having an alignment of multiple genomes from two different locations, each genome having multiple genes and features specific to certain genomes. Which alignment file format would be best to store all this information and to be sure that a reasonable number of software programs would be able to interpret the information?m

I have listed as responses all the different file formats I could think of that store sequence alignments. Please vote for the ones you use the most and find the most useful. Please add any additional alignment file formats I haven't thought of.

PS: We really need a polling option in biostar because it occurs to me that people are probably not going to vote for any of these options.

PPS: Anybody have any ideas on how I can get feedback from the bioinformatics community on this question?

nexus fasta phylip pir msf • 4.4k views
2
Entering edit mode

Well, technically, each "answer" is an answer to your poll, and users can add their own answers, and they can comment on an answer, so in many ways it's superior to a standard website poll :P

As for getting feedback on the bioinformatics community, asking questions in forums like this is the only way. That and IRC. There is no directory of bioinformaticians like there are for plumbers, so you can't do a proper randomised sample. In fact many people disagree what a bioinformatician even is. So I think this is as good as you're going to get :)

Finally, i should say that looking for the most popular of anything in Bioinformatics is generally a bad idea. You should use what is most suitable. If you want to support whatever your users are using, first build the tool then listen to the feedback. Otherwise you're going to have to add another 10+ formats to this list, such as FASTQ (and the far superior uQ which is superior because i wrote it and i'm totally biased). Also CRAM, BAM, THANKYOUMAM, and whatever else is the publication du jour.

3
Entering edit mode
5.6 years ago

SAM/BAM is probably the most common format for storing alignments (at least with NGS data).

2
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

FASTA is a great file format that can store an alignment. The sequence is sometimes wrapped over multiple lines and overtimes not. It is sometimes called FASTA align or A2M in emboss as explained by @ddiez in the comments below.

The following phylogenetic programs accept FASTA as input although they expect PHYLIP: PhyML and RaxML. Most alignment programs accept FASTA as input: Jalview, ClustalX, Clustalw, t-coffee, Clustalo, MAFFT.

It looks like this when wrapped over multiple lines

>seqid1
ATGCTGATCGTAGCTGACTG
AAAAATGGGTTGGTTGGTG
>seqid2
ATG--GATCGTAGCTGACTT
AAAAATGGGTTGGTTGGTG
>seqid3
ATG--GATCGTAGCTGACTT
AA---TGGGTTGGTTGGTG


Or on a single line which makes more sense

>seqid1
ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG
>seqid2
ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG
>seqid3
ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG

0
Entering edit mode

????

FASTA format is designed to store sequence strings, not alignments.

3
Entering edit mode

One can have pair-wise or multiple sequence alignments in aligned FASTA format. They used to be very common in past.

1
Entering edit mode

I'm aware of this format but it's not FASTA, rather a non-standard variant. It requires that all sequences be the same length (which is not a requirement of FASTA), it tolerates any non-alphanumeric character in unmatched positions (so their meanings have to be defined independently), and it doesn't even adhere to the FASTA format for gaps (the link uses '.' instead of '-').

Only the '>identifier' line seems to be the same ;-).

2
Entering edit mode

I use to call it FASTA align- I think it is referred like that in the documentation of some bioinformatics tool I used but cannot find which one it is. In some other places it seems to be called A2M, for example in emboss. For me this format feels like a natural extension of the FASTA format: simple and easy to parse. Oh well, unless people include complicated annotations in the header without following consistent rules...

2
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

The JSON format which can be very rich. I am not sure as to whether there is a standard JSON format for sequence alignments. The following has been exported from Jalview using the BioJSON file format. All the same annotation as the above NEXUS format are included. Presumably it would also be possible to associated other information such as a newick tree. It would be interesting to find out from the community which programs use this as input/output and how frequently it is used.

    {
"seqs":${ "name":"seqid1/1-39", "start":1, "end":39, "id":"367323668", "seq":"ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG", "order":1 }, { "name":"seqid2/1-37", "start":1, "end":37, "id":"931822522", "seq":"ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG", "order":2 }, { "name":"seqid3/1-34", "start":1, "end":34, "id":"636867707", "seq":"ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG", "order":3 }$,
"appSettings":{
"globalColorScheme":"None",
"webStartUrl":"http://www.jalview.org/services/launchApp",
"application":"Jalview",
"showSeqFeatures":"true",
"version":"2.10.0"
},
"seqGroups":${ "displayText":true, "startRes":0, "sequenceRefs":\[ "367323668", "931822522"$,
"groupName":"GroupAAA",
"endRes":38,
"colourText":false,
"description":"Group with homopolymer insert",
"showNonconserved":false,
"displayBoxes":true,
"colourScheme":"None"
}
\],
"alignAnnotation":,
"svid":"1.0",
"seqFeatures":${ "fillColor":"#8c25cd", "score":0, "sequenceRef":"367323668", "featureGroup":"homopolymer", "description":"nucleotide repeat", "xStart":20, "xEnd":25, "type":"Homopolymer region" }, { "fillColor":"#8c25cd", "score":0, "sequenceRef":"931822522", "featureGroup":"homopolymer", "description":"nucleotide repeat", "xStart":20, "xEnd":25, "type":"Homopolymer regien" }, { "fillColor":"#8c25cd", "score":0, "sequenceRef":"636867707", "featureGroup":"homopolymer", "description":"nucleotide repeat", "xStart":20, "xEnd":22, "type":"Homopolymer regien" }$
}

0
Entering edit mode

I think this is a really good format although not very nice on the eye, it enables extensive annotations and is easy to parse.

1
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

The NEXUS file format is much richer. It has many different blocks but the most important is the MATRIX block. The example block below shows how different sequences can be grouped together in is taxa sets (taxset) and different sites can be grouped in character sets (charset). This for example would allow for the annotation of different genes in an alignment thus providing different partitions which could be analysed using different models in a phylogenetic reconstruction.

The following programs take NEXUS as an input: BEAST, Mesquite, ModelTest, MrBayes, PAUP*, GARLI, GDA, MacClade.

  #nexus

begin data;
dimensions ntax=3 nchar=39;
format datatype=dna missing=? gap=-;
matrix
seqid1   ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG
seqid2   ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG
seqid3   ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG
[----+--10|----+--20|----+--30|----+----]
;
end;

begin sets;
charset homopolymer = 21-25;
charset allsites = 1-39;
charset nohomopolymer = 1-20 26-39;
taxset GroupAAA = seqid1 seqid2;
end;


The NEXUS format also allows for tree blocks, making it easy to keep the alignment and phylogenetic tree together in one file.

1
Entering edit mode

If you have assembled each genome from reads, so that you have one sequence string per genome, and you have performed multiple sequence alignment on all assembled genomes, then the nexus format is best suited for downstream purposes, such as tree inference and annotation correlations. However, from reads to assembly is one problem, from assembly to multiple alignment is another problem, from m.al. to annotations specific to certain genome or common across genomes yet another. I don't believe a single file format captures uncertainty in all of these steps. Nexus is best in terms of being standard as input to many complex phylogenetic programs.

1
Entering edit mode
5.6 years ago
ddiez ★ 1.9k

Possibly not very popular outside its domain but the Stockholm format is the MSA file format used for alignments in the Pfam and Rfam databases and the HMMER tool. This format enables annotation of residues and can include embedded trees. An brief example from the wikipedia article:

# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal
#=GF SS    Published; PMID 9223489
#=GF RN    [1]
#=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//

0
Entering edit mode

Thanks for your contribution. How would you embed a tree in the Stockholm format? Do you know whether you can annotate groups of sequences?

1
Entering edit mode

According to the wikipedia page (which is referred in the Pfam/Rfam web site, so I guess it is somewhat trustable), this are the fields used for embedding trees (to be included in #=GF areas, which allow free text):

For embedding trees:
----------------
NH  New Hampshire                A tree in New Hampshire eXtended format.
TN  Tree ID                      A unique identifier for the next tree.


So, I guess the tree is just included in the NH line. Something like #=GC NH ((A,B),C);

1
Entering edit mode

For the group annotations the following description is in the wikipedia page:

#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Residue annotation, exactly 1 char per residue>


There are per-file, per-column, per-sequence and per-residue annotations. So I guess the answer is no, you need to include the annotation line for each sequence (using GS/GR). But I am not 100% sure about this.

0
Entering edit mode

To be honest, I am not completely sure about how with the trees. Just show it in the wikipedia page and remembered reading about it in the documentation. But it has been a long time since the last time I used this format. For groups of sequences you mean giving an annotation line for a group of sequences rather than including the same line for each sequence individually?

0
Entering edit mode

I mean setting taxa set like in the NEXUS file format.

0
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

The NBRF and PIR format, more detail about the format can be found here: http://www.bioinformatics.nl/tools/crab_pir.html. I think there is probably other information that can be encoded in PIR.

 >N1;seqid1
seqid1 39 bases
ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG*
>N1;seqid2
seqid2 37 bases
ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG*
>N1;seqid3
seqid3 34 bases
ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG*

0
Entering edit mode
5.6 years ago
igor 12k

I don't think there are popular file format for storing sequence alignments and associated annotations because those are very different data types and should be separate.

For example, let's say you are aligning reads. Lots of reads will align to a single gene. If each of those reads has annotation info, that would be wasteful. You can have another file (BED or GTF, for example) that specifies which regions correspond to a particular gene. That structure additionally allows for easy update of annotations without modifying the alignment file.

0
Entering edit mode

I generally agree with your point, but I would argue that aligned SAM/BAM contains some annotation (chromosome, start, and stop positions). It could even be argued that the CIGAR string is annotation, in the sense that it reflects the degree of similarity to the reference. And RNA-Seq data that's aligned to the transcriptome contains gene information as well. So the answer depends on the user's annotation requirements.

0
Entering edit mode

Sure, you could think of reads alignment info as a type of annotation. For me, annotation is equivalent to genome browser tracks. For a single reference sequence, you can have infinite annotation types. The reads are just another layer. If each layer is a separate file, you can easily mix and match them. You can also easily add and remove them. If everything is in one file, that process becomes much more difficult. Also, the file becomes extremely large.

0
Entering edit mode

I believe the Stockholm file format (see my answer) was developed precisely to be able to include sequence annotations together with multiple sequence alignments. But it was designed for MSA of protein families in the Pfam database, which is a different problem than the case of read alignments you mention (and in which case I agree with your answer).

0
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

I think that the .aln format from CLUSTAL is probably is very frequently used. It is an output format for many multiple sequence aligners such as CLUSTALW, CLUSTALX, CLUSTALO, MAFFT, t-Coffee and probably many more. It looks like this but I believe it can also be found wrapped over multiple lines (interleaved).

CLUSTAL

seqid1/1-39     ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG
seqid2/1-37     ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG
seqid3/1-34     ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG

0
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

The bcl file format where the sequences are vertical. Designed for easy concatenation and subsetting the alignment from the command line. I don't think this allows for the annotations of features or annotation of groups of sequences. I don't believe many programs use import/export this format type but I am trying to be as complete as possible.

>seqid1
>seqid2
>seqid3
* iteration 1
AAA
TTT
GGG
C--
T--
GGG
AAA
TTT
CCC
GGG
TTT
AAA
GGG
CCC
TTT
GGG
AAA
CCC
TTT
GTT
AAA
AAA
AA-
AA-
AA-
TTT
GGG
GGG
GGG
TTT
TTT
GGG
GGG
TTT
TTT
GGG
GGG
TTT
GGG
*

0
Entering edit mode
5.6 years ago
Joseph Hughes ★ 2.9k

The PHYLIP file format is probably quite popular. It is the expected input for PhyML, RAxML and obviously the suit of PHYLIP tools. Sometimes this format can be interleaved (wrapped over multiple lines). In my experience I have always had lots of problems with compatibility between programs with this file format as there are many different flavours of it that are not always catered for.

3 39
seqid1    ATGCTGATCGTAGCTGACTGAAAAATGGGTTGGTTGGTG
seqid2    ATG--GATCGTAGCTGACTTAAAAATGGGTTGGTTGGTG
seqid3    ATG--GATCGTAGCTGACTTAA---TGGGTTGGTTGGTG