Question

Best way to normalize occurance of a nucleotide subsequence

1

Entering edit mode

7.4 years ago

snamjoshi87 ▴ 40

If I have a set of nucleotide sequences of varying lengths and I want to find the number of times a certain subsequences occurs for each sequence in this set, what is the best way to normalize so I can compare the number of hits each sequence has? I feel that longer sequences will (maybe) be more likely to contain more hits so I need to take gene length into account. My first instinct is to just divide by gene length (or by kilobases) to get a frequency per x comparison but I don't know if this is meaningful. How is this kind of comparison normally done?

A related question I have is how I can normalize by position. Say I want to see if in general more hits tend to occur at the beginning of the coding sequence or end of the 5'UTR. How can I normalize so each gene is say, of length 1-100 (including UTRs) that way I can plot a distribution of sequence occurrence by (scaled) position? I know how this could be done but I just want to make sure I do it in a way that is generally considered acceptable.

sequence normalization • 2.3k views

ADD COMMENT • link updated 7.4 years ago by Macherki M E ▴ 120 • written 7.4 years ago by snamjoshi87 ▴ 40

1

Entering edit mode

There is many type of normalization you do. See seqinr package in R. It can provide some tools like z score,rho... For analyse I think that you have use X2 to test independancy or simply you can use Ecludic or correlation distance

ADD REPLY • link 7.4 years ago by Macherki M E ▴ 120

0

Entering edit mode

If I were to use Rho, how might I approach cases where there are multiple options? The Rho() function will give me over- and under-representation for combinations of a specified size. But what if I want to know the Rho for two different sequence? Or ones with N's? For example, what if my sequence is "CCWGA?. This one is actually "CCAGA" or "CCTGA". For another case, what about "RACNNNGC"? How can combine Rho values for multiple cases?

ADD REPLY • link 7.4 years ago by snamjoshi87 ▴ 40

0

Entering edit mode

The Rho is absolute calculation of independency between a sequence with alphabet z. As i know ,you can edit the alphabet in the sequence. Missing value takes absolutly p=1. For other upac use the sum of letters frequency. Try then to modify the rho function to get setting options.

ADD REPLY • link 7.4 years ago by Macherki M E ▴ 120

score 2 · Accepted Answer · 2016-11-28

2

Entering edit mode

7.4 years ago

Macherki M E ▴ 120

Rho index is a generic function calculation in bioinformatics. kralin S. report that the abundance shame of a subsequence can be calculated using the Probability of independent evens:

P(ab)=P(a)*P(b)  ===> P(ab….n)=P(a)*P(b)*…P(n)

Rho is set as the ratio between join evens "ab..n" and individual evens a,b,…n Then Rho=P(ab..n)/P(a)*P(b)*P(n) P(a)*P(b)*…P(n) denoted the expected value of the event ab..n

Example : we have the target sequence of size k=11"atwnnnvcggg"P(a)=`the frequency of 'a', the same manner for t,c and g

P(w)=P(a)+P(t),P(v)=1-P(t) and P(n)=1

Expected (P(atwnnnvcggg))=P(a)*P(t)* P(w)*P(n)* P(n)* P(n)*P(v)*P(c)*P(g) *P(g) *P(g)

frequency(atwnnnvcggg)=number of occurrence of atwnnnvcggg /N-k-1

where N denote the whole sequence size,then;

Rho(atwnnnvcggg)=frequency(atwnnnvcggg)/ Expected (P(atwnnnvcggg))

In seqinr package, rho function doesn't take care the IUPAC coding then we have to edit the function to do the work. This action is not recommended because the number of all possible kmers combinations is too big. To this work, we have to calculate separately: - The number of occurrence of the target sequence: gregexpr2 biostrongs package - Their expected value as denoted previously See also Computational Genome Analysis, Richard C. Deonier, Simon Tavaré, Michael S. Waterman

ADD COMMENT • link 7.4 years ago by Macherki M E ▴ 120

1

Entering edit mode

Thank you! I'm beginning to understand how this works. For non IUPAC coding, can I do something like this: CCWGA means either "CCAGA" OR "CCTGA". Would that be:

freq(CCWGA) = freq(CCAGA) + freq(CCTGA)

expected(p(CCWGA) = p(CCAGA) + p(CCTGA)

Rho = freq(CCWGA) / expected(p(CCWGA)

Is that valid?

ADD REPLY • link 7.4 years ago by snamjoshi87 ▴ 40

0

Entering edit mode

A think yes. Is just a simple algebra. You can also use the reverse complement to your sequence,your sample will more extinded see karlin s. works.

ADD REPLY • link 7.4 years ago by Macherki M E ▴ 120

0

Entering edit mode

Great! I went ahead and made a function that computes Rho for two separate motifs to account for other IUPAC coding. This will get around my problem and I can move forward. Thanks for your help!

ADD REPLY • link 7.4 years ago by snamjoshi87 ▴ 40