MSA and distance matrix in R
1
0
Entering edit mode
2.3 years ago
anasjamshed ▴ 120

I have a fasta file with 6660787 sequences and I want to do load it in R and then perform MSA and then generate a distance matrix. After getting the distance matrix I want to apply tsne

Is it possible?

tSNE R • 2.4k views
ADD COMMENT
1
Entering edit mode
2.3 years ago
Mensur Dlakic ★ 27k

I don't think this can be done in a reasonable amount of time, and that goes for all the steps. So even if this is possible, I would advise against it.

Getting a multiple alignment or distance matrix for 6+ million proteins sounds at least very difficult and time-consuming, even if possible.

To the best of my knowledge there is no t-SNE implementation at present that can handle 6+ million x 6+ million matrix. And even if there is one that could, that also sounds very difficult and extremely time-consuming. There are ways to reduce the matrix width such as truncated SVD, but it would still have 6+ million data points. UMAP could potentially handle a sparsified matrix, but again 6+ million data points will be a challenge.

If you are still not sure, my suggestion is to do this on a smaller scale, say for 100,000 proteins. I don't think t-SNE will work with a symmetric matrix of that size either. In case it does, it may give you a clue as to what you are up against.

ADD COMMENT
0
Entering edit mode

Can I use google cloud to do this?

ADD REPLY
2
Entering edit mode

It doesn't seem like you have a clear understanding of this problem's magnitude, and it also doesn't seem like my responses have been sobering enough. I will try one last time with a more illustrative example.

I just did t-SNE on a 200,000 x 136 matrix with 20 CPUs, and that took about 55 minutes. Now, let's pretend that t-SNE is possible for any matrix size (unlikely), and that embedding time scales linearly (it doesn't). I will let you calculate the exact ratio of a 6 million x 6 million matrix versus the one I used, but let's say that it is about million times larger. That means even if you had 2000 CPUs available (100x more than I used), it would still take you ~10,000 x as long with your matrix as it took with mine. That roughly translates into 9200 hours, with all the rosy assumptions I made, and without factoring in the alignment time (also non-trivial for 6+ million proteins). Hopefully this answers your question.

To be extra clear, even though I have already said this in my response to your initial query about this topic: even if you could surmount all the time and resources issues, there is absolutely no guarantee that t-SNE with 6+ million data points would give a biologically meaningful embedding. In fact, I strongly suspect that the embedding would be continuous, with many clusters not clearly defined.

ADD REPLY
0
Entering edit mode

Thanks but now I am trying to just use 38 sequences : code

But after running the alignment function it got stuck. I have Core I7 processor with 8 GB Ram. How long it should take?

ADD REPLY
1
Entering edit mode

It appears that you have DNA sequences, and that they are prokaryotic genome-sized. Assuming that it can even be done, it would take you approximately 234 lifetimes to align 6+ million genome-sized DNA sequences, let alone to do t-SNE aterwards. Is it too much to ask to provide some of these details when asking questions, so I don't waste my time explaining something that absolutely can't be done?

Embedding of a 38 x 38 matrix takes probably a couple of seconds. Aligning 38 genome-sized sequences, a lot longer. This is not a way to do t-SNE of genomic DNA sequences, but I am really in no position to explain this procedure to you step by step. Below are some links that describe the proper way of doing it. You may need to find equivalent R programs on your own.

ADD REPLY
0
Entering edit mode

What do you mean by prokaryotic genome sized? These are only 38 genomes

ADD REPLY
0
Entering edit mode

I am trying to align 38 sequences but one hour passed and nothing happens

ADD REPLY

Login before adding your answer.

Traffic: 2815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6