Question

How does one construct the RNASeq expression (count) matrix in general?

0

Entering edit mode

9.6 years ago

Amusement Pork ▴ 60

I have RNASeq data from a de novo transcriptome assembly. The data is not given to me in SAM/BAM files, but instead in some kind of proprietary format where each contig is represented nucleotide by nucleotide, and for each nucleotide you are given the number of hits for this nucleotide over the samples in the experiment. So to sketch it:

Contig6

Nuc     Sample1     Sample 2     Sample 3     Sample 4
A       1           0            3            4
C       21          15           18           17

This describes Contig6, for which the assembled sequence was "AC". In order to get from here to a matrix which describes the number of reads per contig for each sample, my first reaction is to sum each column and normalize by the sequence length, then round to nearest integer, yielding

rowname     Sample 1     Sample 2     Sample 3     Sample 4
Contig6     11           7            10           10

This example is made up, the real contigs are several hundreds to thousands of nucleotides long, but that doesn't change the approach.

Is there any consensus about how to do this correctly? I'm fairly sure it's introducing some sort of bias into the counts on several occasions.

Thanks a lot!

RNA-Seq • 3.0k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Amusement Pork ▴ 60

0

Entering edit mode

Is it possible for you to just get the raw data? That'd be vastly easier to deal with than some made-up that someone dreamed up.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hey Devon, this is literally what my data looks like, I have no access to the actual assembly.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Amusement Pork ▴ 60

Ram · Answer 1 · 2014-09-22

Whoever gave this to you owes you copious amounts of beer.

If you know anything about the average read length then you'll be best off just dividing the column sums by that. If there's any soft-clipping or deletions then you'll still get some fractional counts, but that's OK. I wouldn't recommend rounding things, just use limma::voom instead, since you can treat these as expected counts.

If you don't know anything about the average read length, then you can probably guesstimate it from the data that you have (it'll give the highest number of integers using the aforementioned method).