Dimensionality reduction on data sets with variable dimensions
1
0
Entering edit mode
4 months ago
rtrende ▴ 80

My lab has some RNA secondary structure data on a number of virus RNA segments from a method called SHAPE-MaP, which gives each nucleotide in a sequence a reactivity value. We would like to make a 2D plot that clusters segments with similar reactivity value profiles together. However, each segment has a different number of nucleotides, and so we can't figure out how to use standard tools like PCA or t-SNE because each dataset has a different number of dimensions. Is there a way to perform dimensionality reduction on a data set where each point has a different number of dimensions?

PCA dimensionality_reduction RNA_secondary_structure • 484 views
ADD COMMENT
1
Entering edit mode
4 months ago
Mensur Dlakic ★ 27k

Not sure whether what you want to do will be informative. That aside, UMAP works with sparse data. Simply insert missing values for shorter sequences to make the length identical to the longest sequence.

ADD COMMENT
0
Entering edit mode

I see what you mean that this method isn’t a perfect way to compare segments but I’ve had a very hard time finding other similarity metrics to use. Any idea what might be a better way to quantify or visualize how similar our data sets are?

ADD REPLY
1
Entering edit mode

You can try DNA/RNA language models and extract their embeddings based on your sequences. Those should be of the same length and presumably can be used for dimensionality reduction.

ADD REPLY

Login before adding your answer.

Traffic: 2806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6