Why Next Generation Sequencing Data Is High-Dimensional ( Or Ultra High Dimensional) ??
3
1
Entering edit mode
8.7 years ago
lavya3 ▴ 40

Hi all,

I m Lavya. I m working on dimensionality reduction techniques.

I have previously worked on computer vision problems and but willing to explore with bioinformatics data. In vision number of dimensions can stretch upto 1million because a typicall 1024X1024 can be reshaped to one 10^6 vector and each image can be represented by a single point in the space having 10^6 dimensions.

I cannot correlated the same analogy with NGS data.

1. Why do we have high dimensionality ? It would help me understand if you can give mathematical illustration like I mentioned for computer vision. For simplicity lets assume that Number of short reads are 50 million.

2: How each read can be represented in a vector space that is high dimensional ?

next-gen ngs • 3.5k views
1
Entering edit mode
8.7 years ago

The number of reads don't have anything to do with the dimensionality of the data. The reads only represent the measurements which have been made. In order to analyze quantitative sequencing data (be it from DNA like ChIP-Seq, MethylCap-Seq, etc or RNA-Seq), you need to first define the variables which you have measured. For RNA-Seq this is relatively simple, as the transcriptome of the organism (i.e. all know sequences of the genome which are transcribed to RNA) can be well defined. Therefore each transcript can represent a variable. For the human transcriptome there are about 32,000 "curated" transcripts in the RefSeq database. If your 50 million reads are coming from an RNA-Seq experiment, you would have to first determine from which transcript each and every one of these reads originates, and only then you will be able to estimate your expression levels.

For quantitative DNA-Seq methods this problem is slightly more complex, as the human genome in total is perhaps 40x the size of the transcriptome (exons/introns/ncRNAs/etc) and there is no strict way to organize it into discrete regions/variables. One approach is e.g. to arbitrarily organize the genome into windows of a given size (say 500bp) and use a sliding window (of say 250bp steps) and count the occurrence of aligned qDNA-Seq reads in the given window. Using settings like this, you would end up with more than 12 million variables for the human genome...

1
Entering edit mode
8.7 years ago

You are correct. Statements about dimensionality make sense in the context of vector spaces where dimensionality can be defined as the max. number of linear independent vectors. Therefore, to make any statement about the dimensionality of a set of strings, which sequencing data ultimately are (NGS or not), they need to be transformed into a vector space, possibly like in this paper. If understood that correctly, they select N prototypes and map each string to the vector space by computing the edit distance to each prototype yielding a real vector of N elements.

Formally, if we denote a set of strings (over an alphabet A) by X ⊆ A∗ and a set of prototypes by P = {p 1,...,pn}⊆X, the transformation t P n : X → R n is deﬁned as a (not necessarily injective) function, where t P n (x) = (d(x, p1),...,d(x, pn)) and d(x, pi )is the edit distance between the strings x and pi . Obviously, the dimension of the vector space equals the number of prototypes.

I am not sure, but there seems to be no rigid proof, that the result really is a vector space (I doubt that this holds for arbitrary prototypes). However, if we assume that this is correct, then we could say, we need a large number of prototypes to fully represent sequencing data. However, if we take this position, then we can conclude that dimensionality of NGS data should be less or equal to that of any possible substring of the genome (possibly of a certain length) (edit: is this true by the way??), because all NGS reads are indeed substrings of the genome (+ some errors). That again would indicate that there is nothing special about the dimensionality of NGS data if one only wishes to be able to represent every possible outcome, but that whoever made this statement want to point out that the data is simple "large" or "of high-volume" ignoring its meaning in the context of linear algebra.

Another concept to consider in the context of machine learning techniques on strings are the string kernels which also have some applications in bioinformatics (e.g. Leslie et al.).

0
Entering edit mode
8.7 years ago

High dimensional data in NGS is usually associated with RNA-seq data where multiple samples are sequenced. You can consider each sample as a dimension or each gene as a dimension. Considering each gene as a dimension will probably produce tens of thousands of dimensions.

So an analogy to image processing: An image where each pixel is a dimension and rbga values are the data vs an experiment where each sample is a dimension and a vector list of expression values for each gene are the data vs an experiment where each gene is a dimension and vector list of expression value from each sample for the specific gene are the data.

1
Entering edit mode

I don't think that your answer is quite correct. In principle, every variable is a dimension - in your RNA-Seq example, e.g. every identified transcript represents a variable with the associated read count being its value.

2
Entering edit mode

You are right. I've gotten it opposite way around actually. I've fixed my post.