Question

Clustal Processing Massive Dataset

1

Entering edit mode

2.2 years ago

Laura ▴ 50

Hello wonderful beings of bioinformatics!

I'm new to this world and could use some help.

My job is to run multiple sequence alignment on a large dataset. I am looking into the L1 family of genes and wanting to compare 7,525 elements of full length sequences. Each sequence is ~6,000 base pairs in length. I was hoping to create a giant visual output of 7,525 x 6,000. I have all of the sequences in a fasta file, but it seems to be too large to use in any of the multiple sequence alignment platforms. I've tried using mafft, clustal omega, and clustalw so far.

My complete fasta file is ~46,000kb. Is it really too big for processing?

I figured the easy solution was to split up the data, so I divided my 1 fasta file into 10 fasta files, all under 5000kb each (the maximum on clustalw). This produces an output, but it is not what I expected. I want to get an old-school looking output like:

chr10_35478-41492          --------AAATGGAATAGGAACAGCTCTGGTCTACA-GCTCCCAGCGTG
chr10_1221417-1227555      TTTATTATTATTATTATACTTTAAGTTTTAGGGTACATGTGCACAACGTG
                                    * *   ***     ** * * *  **** *  * ** ****

(except much larger)

I can generate this output when I copy/paste a handful of samples, but when I use a bigger fasta file, there is no old-school visual output for comparison. I get this:

 CLUSTAL 2.1 Multiple Sequence Alignments


Sequence type explicitly set to DNA
Sequence format is Pearson
Sequence 1: chr10_100406331-100412500           6169 bp
Sequence 2: chr10_100454491-100460565           6074 bp
Sequence 3: chr10_10069375-10075458             6083 bp

etc.

I've looked through the arguments and options, but am not understanding what it is I need to do to get the output I want.

The slogan for clustal omega got me excited: "the last alignment program you'll ever need", but the file size limit here is even less than clustalw. I'm confused, because it says it can "allow hundreds of thousands of sequences to be aligned in only a few hours". Am I using it wrong?

I think the most reasonable answer is probably to subset my data even further. However, then I'll have the problem of trying to merge it all back together to compare.

Overall, I am hoping to find out if 1) what I am doing wrong and stupid, 2) what parameters I need to generate the old-school clustal "plots", 3) what program I should be using, if I am using one that isn't a good fit for my purposes, 4) any suggestions on how to visualize this many regions of this length. I have spent almost 25 hours on this already, and probably should have asked for help sooner!

THANK YOU <3

clustal multiple sequence alignment clustalomega mafft • 2.9k views

ADD COMMENT • link 2.2 years ago by Laura ▴ 50

score 1 · Answer 1 · 2022-02-17

1

Entering edit mode

2.2 years ago

M__ ▴ 200

No 46000 is not a lot of sequences and there is nothing wrong with aligning the whole lot. Furthermore, 6000bp is not large.

Your idea is correct, but you are inferring a file size of 4600 sequences at 6000bp. The limitation is either the multiprocessing or the computer, e.g. low RAM. You shouldn't need to shrink the data sets so much. How many cores do you have on your machine?

I used the Clustal family for sometime but switched for the sort of issues you are encountering. It is possible Clustal O is part an issue, there are better (below).

Anyway, simply google FAMSA. You need to keep in mind that both speed and accuracy are criteria.

My personal advice is align under MAFFT or FAMSA and switch the formats to Clustal. You can do this via coding (Biopython) or simply load and save in Clustal to get the correct format.

Clustal Omega was upgraded in 2018 and that publication is worth reading, it was fairly astonishing that not until that point the clustalo implement specific aligning for DNA, previously it was shoe-horned from proteins. They also euloguise the developments of Muscle for alignment accuracy which they implemented.

MAFFT is the aligner that is generally recognised for both speed and accuracy now, if you look at the FAMSA and Clustal Omega publication of their benchmarks you'll find that MAFFT performs very well.

FAMSA is built for speed, it is however for coded for proteins and DNA is shoe-horned. FAMSA is fast and multiprocesses very well. It claims to retain accuracy over large alignments, whereas its benchmarks show other alignment algorithms including MAFFT and Clustal O drop in performance accuracy.

I would say its down to two alignment programs. There is still room for improvement however, I code around this area at present.

If you are sticking with Clustal O, then its maximising the margins, thus multiprocessing is essential using a machine with maximum cores. Its certainly doable, I would look at increasing the number of sequences per file however, like 2 files at 6000bp. Minimising 'redundancy' depends on what your question is, because the frequency of a given allele might be part of the investigation

ADD COMMENT • link 2.2 years ago by M__ ▴ 200

1

Entering edit mode

Thank you much for your help!

I'm happy to hear my dataset isn't big enough to break the programs! I really appreciate the logical questioning about working with the data the way I have it. My boss is interested in looking at really subtle changes, which I imagine is why the request included so much redundancy. We'll see how far I get with this, I guess!

I was successful (I think?) using clustalo from the terminal. The GUIs all gave me size limit problems. I used:

clustalo --in=merged.fa --out=mergedout.aln --force --outfmt=clustal --threads=6 --verbose

resutling in

Using 6 threads
Read 7525 sequences (type: DNA) from merged.fa
Using 165 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 7525 sequences)
Calculating pairwise ktuple-distances...
Ktuple-distance calculation progress done. CPU time: 14863.37u 14.81s 04:07:58.18 Elapsed: 00:46:20
mBed created 117 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)
Distance calculation within sub-clusters done. CPU time: 3274.19u 3.45s 00:54:37.64 Elapsed: 00:10:29
Guide-tree computation (mBed) done.
Progressive alignment progress done. CPU time: 147495.51u 240.46s 41:02:15.97 Elapsed: 12:48:02
Alignment written to mergedout.aln

and got a giant output that looks mostly like:

CLUSTAL O(1.2.4) multiple sequence alignment


chr10:100406331-100412500                  ------------------------------------------------------------
chr10:100454491-100460565                  ------------------------------------------------------------
chr10:10069375-10075458                    ------------------------------------------------------------

About 4 million pages later (I kid, somewhat) I can see some of the dna:

chr1:221383065-221389208                   ------------------------------------------------------------
chr1:221396164-221403863                   tggtagaatacaaagctcc------actgattgctt-----------------tccctcc
chr1:221503243-221509958                   ------------------------------------------------------------

but I am unsure how to interpret all of this. I have read that "*" means that the residues or nucleotides are identical in all sequences in the alignment, ":" means that conserved substitutions are observed, and "." means that semi-conserved substitutions have been observed.

but I have dashes! Does that mean these are all gaps? Is it something about my inputs? Does it have something to do with the output:

mBed created 117 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)

?

Any suggestions or directions are so appreciated! Thank you so much.

ADD REPLY • link 2.2 years ago by Laura ▴ 50

1

Entering edit mode

The '-' are gaps or 'indels' insertion/deletions. That means the alignment has worked. The next thing is to get the alignment inframe so you can translate it - that is certainly what your boss wants. However, this might prove difficult because alignment errors will result in frameshifts.

What you might want to do is supplement this with an amino acid alignment separately, because translating and getting everything inframe isn't always straight-forward.

Thanks for the thanks, but upvotes welcome.

ADD REPLY • link 2.2 years ago by M__ ▴ 200

1

Entering edit mode

Forgive me, I am not sure what "get the alignment inframe" means. What would supplementing with an amino acid alignment look like? Is it similar to what Mensur suggested below about trimming the non-conserved columns? Or does "translating" mean something different? TY!!

ADD REPLY • link 2.2 years ago by Laura ▴ 50

0

Entering edit mode

Yes this is different, it is possible that you are using amino acids already, but you definitely said 'DNA'.

So I mean is using the triplet codon table to convert the sequence to amino acids so you can examine the protein mutations in the gene. It absolutely the next step your supervisor will want.

The issue is you probably have 'junk' given all the gaps at the beginning of the alignment and Mansaur is suggesting a tool to remove these, to clean up the "noisy ends"

ADD REPLY • link 2.2 years ago by M__ ▴ 200

1

Entering edit mode

and got a giant output

Are you saying you didn't believe my math when I told you that your output would be tens of thousands of pages?

Kidding aside, you got exactly the expected output. And yes, that is how it should look like, and how voluminous it should be when you have 7+ thousand sequences with 6 thousand letters each. I honestly don't know what kind of subtle changes one can see by studying this kind of output, but I think you have done the part your boss asked for.

One thing that may help is trimming out the non-conserved columns, which will at least somewhat reduce the output size. I still think it will be beyond what a person can reasonably analyze, but who knows. This trimming can be done using multiple programs, one of which is trimAl.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Hahaha, no doubt! Loving your sense of humor.

I am glad to hear that I got the exact expected output! I looked into trimming the non-conserved columns with trimAl, but am getting an error:

./trimal -in /mnt/d/2bit/mergedout.aln -out /mnt/d/2bit/trimalout.aln -automated1

ERROR: The sequence "chr10" (24701) does not have the same number of residues fixed by the alignment (27607).
ERROR: Alignment not loaded: "/mnt/d/2bit/mergedout.aln" Check the file's content.

I searched for the error but didn't find any more information other than what the error already says. Do you have any idea what I might do here? Should this be its own post, perhaps?

ADD REPLY • link 2.2 years ago by Laura ▴ 50

1

Entering edit mode

It is difficult to know for sure without seeing the alignment - and please don't post it here. I am guessing that trimAl is reading only the part of sequence name up to the column, so many of your sequences end up having the same name (chr10, chr1, etc). If my guess is correct, this could be solved by making all sequence names unique.

I suggest you copy your alignment to unique.aln and run the following command:

perl -pi -e 's/chr10\:/chr10_/g' unique.aln

This will convert chr10:100406331-100412500 into chr10_100406331-100412500 and similarly all other chr10: occurrences into chr10_, which I think will make sequence names unique. The thing is that you will need to run the above command for all chromosome numbers, and then maybe the file will work with trimAl. If not, you may need to start a new thread with more details.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I ran the perl command for each chromosome and checked that the name changed worked. However, I'm getting a similar error.

./trimal -in /mnt/d/2bit/unique.aln -out /mnt/d/2bit/trimalout.aln -automated1

ERROR: The sequence "chr10_GL383545v1_alt" (15983) does not have the same number of residues fixed by the alignment (87176).

I understand it may be time to ask about this in a new post, but I thought I'd see if you had any suggestions regarding the "number of residues fixed by the alignment."

EDIT: I see that some of these names have additional colons, like chr10_GL383545v1_alt:13600-19716 , so I'll edit the perl line to swap : for _ after the word "alt" - right?

ADD REPLY • link 2.2 years ago by Laura ▴ 50

1

Entering edit mode

so I'll edit the perl line to swap : for _ after the word "alt" - right?

Right.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Ok, it worked! Thank you so much for your help.

ADD REPLY • link 2.2 years ago by Laura ▴ 50

1

Entering edit mode

Thank you very much for your help! Turns out I have 6 cores on my PC, and 12 on my laptop (that's probably normal, but it was surprising to me). I have updated this adventure in my post below. I'd love to hear any additional thoughts you have about it. Thanks again!

ADD REPLY • link 2.2 years ago by Laura ▴ 50

0

Entering edit mode

Beyond this you are into quite specialist approaches - you need to discuss your research objectives with your supervisor

ADD REPLY • link 2.2 years ago by M__ ▴ 200

score 1 · Answer 2 · 2022-02-17

ClustalO should be able to handle 7525 x 6000 output, but I would ask myself whether I needed all those sequences. It is unlikely that you have 7525 completely unique L1s, or that you truly want to inspect 7525 L1s even if they were unique. Think about it: how would you visualize an alignment that is 7525 x 6000 characters? Assuming that you have short sequence headers and that there are not very many indels, and let's say 3000 characters per page, that would still be in excess of 15000 pages. Even if you were to convert this into a phylogenetic tree, the output would still be in hundreds of pages.

But don't worry, many people ask about aligning thousands or tens of thousands of sequences that are even longer than yours, without thinking how exactly they are going to look at it. My point is that a smaller number of carefully chosen sequences is usually easier to inspect, and carries the same information. That means clustering at various identity thresholds, such as 99%, 95% or even 90%. I suspect that the latter would cut down your sequence number at least in half, which is probably still too many. CD-HIT will quickly cluster both protein and DNA sequences and help remove the redundancy.

Starting from a redundant dataset, I suggest you run clustalo with all the threads you have, and make sure to specify --seqtype=DNA --infmt=fa --outfmt=clu --threads=8 (use actual number) --verbose in addition to input and output switches. The verbose switch will give you an update as it goes so hopefully you have some idea when it gets close to the end. It will help if you have lots of RAM, many threads and a speedy computer.