Question: How Should I Encode Dna Into A Piddle?
2
gravatar for Flies
8.0 years ago by
Flies100
Flies100 wrote:

I'm going to be doing some non-linear regression (with a huge and messy residual function), and I am thinking of using PDL::Fit::LM (I had some trouble getting Levmar to install).

The explanatory variables for my fit are DNA sequence (which I'm feeding into a position-specific weight-matrix). What's the easiest way to put a DNA sequence into a piddle? Given that the function i'm working with is a big mess, performance is a consideration.

Since my weight-matrix is constrained so that the sum of weights at a given position comes to zero, my plan is currently to represent each nucleotide as a vector of three elements A -> [1,0,0], C -> [0,1,0], G -> [0,0,1], T -> [-1,-1,-1]. This way I can take a subsequence of my total sequence and just multiply it with my weight-matrix and get the score.

perl • 1.5k views
ADD COMMENTlink modified 8.0 years ago by Dr. Mabuse47k • written 8.0 years ago by Flies100

+1 for the most amusing BioStar title to date.

ADD REPLYlink written 8.0 years ago by Casey Bergman18k

What's your question? Seems like you've answered it yourself.

ADD REPLYlink written 8.0 years ago by Qdjm1.9k

Others have successfully used PDL for encoding alignments and other DNA related stuff. Too bad the PDL documentation terrible

ADD REPLYlink written 8.0 years ago by Martin A Hansen3.0k

@qdjm I'm just guessing that I'm not the first person to do this, and I'm wondering what solutions people have come up with. I mention my current idea as a point of reference.

ADD REPLYlink written 8.0 years ago by Flies100
1
gravatar for Dr. Mabuse
8.0 years ago by
Dr. Mabuse47k
Bergen, Norway
Dr. Mabuse47k wrote:

If I did understand your question correctly you want to store the nucleotide sequence in a PDL data structure, is that correct? If not, then please update your question, it is a bit confusing. I do not immediately see the advantage of doing this, instead of sticking with a normal string. The question is then, why would you want to do this? Anyway, you could eventually use PDL::Char

something along the lines

use PDL;
use PDL::Char;
my $pchar = PDL::Char->new( ['ACGT', 'ATGT', 'TGAA']);

As you don't have control over the storage size of a variable (could think of using a 2bit encoded format, but there is no bit-pdl) this might already be the most efficient way meomory-wise.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Dr. Mabuse47k

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input.

ADD REPLYlink written 8.0 years ago by Flies100

As to the reason why, it's because I have a quantitative model that uses DNA sequence as input, and I want to do the calculation as efficiently as possible.

ADD REPLYlink written 8.0 years ago by Flies100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 916 users visited in the last hour