Question: How to TMM after HTSeq-count?
0
gravatar for tinmad
23 months ago by
tinmad10
Korea, Republic Of
tinmad10 wrote:

I work on RNA sequencing at the moment.
I've used tophat and cufflinks before, so now I'd like to do it in different way.
My new RNA-seq workflow is like below

BWA -> HTSeq-count -> TMM

However, I have no idea how to TMM after I finished BWA and HTSeq-count.
I've found that I have to use edgeR package in R and the function for TMM is calcNormFactors().
I've tried it already but there were some errors as I'm not much familiar with it.

Now I have about 10 bam files as a result of BWA, and 10 HTSeq-count results from that 10 bam files.
Each HTSeq-count result has two columns, one for gene symbols and the other for the counts.
(and last 4 rows are values about no feature, ambiguous, too low aQual, not aligned, alignment not unique)

I wonder

1) How to TMM with HTSeq-count results.

2) What is the input format that I can make from those 10 HTSeq-count results for calcNormFactors() in R.

tmm edger rna-seq htseq • 1.5k views
ADD COMMENTlink modified 23 months ago by h.mon9.6k • written 23 months ago by tinmad10
2

You should use a splice-aware aligner for RNA-Seq data (e.g. STAR, Tophat, HiSat,...) 

ADD REPLYlink written 23 months ago by Nicolas Rosewick5.8k

Do you mean that I sould use Tophat or something else than BWA?

ADD REPLYlink written 23 months ago by tinmad10

Yes because bwa does not support reads alignment over splice junction

ADD REPLYlink written 23 months ago by Nicolas Rosewick5.8k

Do you mean "TPM" or "CPM"? edgeR has a function for for calculating CPM.

ADD REPLYlink written 23 months ago by Michael Dondrup43k
1

As I know, I can put some options like calcNormFactors(object, method="TMM"). Is it TMM normalization? 

ADD REPLYlink written 23 months ago by tinmad10

Don't specify anything and you'll get TMM normalization.
 

ADD REPLYlink written 23 months ago by Devon Ryan73k

If you want to perform TMM normalization then use edgeR, since that's where TMM comes from. If that gives you an error then post it.

calcNormFactors() takes a matrix of counts.

ADD REPLYlink modified 23 months ago • written 23 months ago by Devon Ryan73k

I tried to put 10 HTSeq-count results together, so I made one matrix with them.
It has 11 columns; 1st column for gene symbols and other columns are about count values of 11 HTSeq-count results.
First error I got before was about that column(x): 'x' must be numeric.

ADD REPLYlink written 23 months ago by tinmad10
1

Try putting the gene names into rownames, e.g. if your data is in my.counts:

rownames(my.counts) <- my.counts[,1]

mycounts <- my.counts[,-1]

The count matrix must be numeric.

ADD REPLYlink written 23 months ago by Michael Dondrup43k

Thank you for your help, I tried it, and got 1 column with 10 rows as a result like below.

"x"
"1"  1.00593967930787
"2"  0.993971788680494
"3"  0.999906336660367
 ...

It's quite confusing because I expected that the result will be a matrix which has the same number of rows and columns with input data (about 30,000 rows and 11 rows, in this case) with normalized values... Is this result right? or am I mistaken?..

ADD REPLYlink written 23 months ago by tinmad10

No, something is wrong here. What does dim(x) and head(x) give before these operations?

ADD REPLYlink written 23 months ago by Michael Dondrup43k

before these operations, dim(x) gives [1] 33126   10 and head(x) gives
                sample1    sample2    sample3    .....
A1BG
             27           42            53         .....
A1BG.AS1       19           25            21         .....
A1CF              8            21            17         .....
  .....               .....           .....            .....         .....

ADD REPLYlink written 23 months ago by tinmad10

Another little issue, and I am not sure how HTseq count deals with it, but did you check that the count files were all compatible, that means that the rownames are identical and in the same order in all files? You are probably fine, but better to check...

 

ADD REPLYlink written 23 months ago by Michael Dondrup43k

This is the default, isn't it? Btw, TMM (trimmed mean of M-values) normalization is

Robinson, M.D. and Oshlack, A. (2010). A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biology 11, R25.

I guess I have never changed the defaults...

ADD REPLYlink modified 23 months ago • written 23 months ago by Michael Dondrup43k

Did you follow through edgeR's manual? Did you create a DGEList?

I've tried it already but there were some errors as I'm not much familiar with it.

What are the errors you are getting?

Finally, NicoBxl is right, use STAR (which will give you counts) or Subread+featureCounts or HISAT2 to do the mapping.

ADD REPLYlink modified 23 months ago • written 23 months ago by h.mon9.6k
1
gravatar for h.mon
23 months ago by
h.mon9.6k
Brazil
h.mon9.6k wrote:

Create the DGEList as per edgeR manual:

x <- read.delim( "fileofcounts.txt", row.names = "Symbol" )
# With group you specify your experimental factors
group <- factor( c( 1, 1, 2, 2 ) )  
y <- DGEList( counts = x, group = group )
y <- calcNormFactors( y )

Here x deserves a bit more of explanation: there is one header line, the first column is named Symbol and has gene names, the other columns have sample names and counts.

Symbol    Sample1    Sample2    Sample3    Sample4
RCAN1    10    14    40    50
KCNE1    11    12    41    51
...

 

ADD COMMENTlink written 23 months ago by h.mon9.6k

I just want to normalize the counts, not want to find DEGs. Should I do the grouping by DGEList?

What should I do if there is no reason to do the grouping? (should I make 10 groups for 10 samples?)

Also, I got the result after create DGEList (as a trial) but

1) there is no change in values. Is it possible to happen?

2) when I try to write the result, I got the error; Error in as.data.frame.default(x): cannot coerce class "structure("DGEList", package="edgeR")" to a data.frame. How can I save the result?

ADD REPLYlink written 23 months ago by tinmad10

No, you do not need to do grouping, you may create the DGEList as y <- DGEList( counts = x ). But you can easily test if different groups change the scaling factor, if you are curious.

You ought to read edgeR manual more carefully:

We call the product of the original library size and the scaling factor the effective library size. The effective library size replaces the original library size in all downsteam analyses.

Raw counts are still raw counts, they are left unchanged, what changes is y$samples$norm.factors.

A DGEList is a list, it is not a data frame nor it is easily coerced into one. You can save some of its elements (e.g. y$counts) with write.table - which is what you tried I assume.

ADD REPLYlink modified 23 months ago • written 23 months ago by h.mon9.6k

Thank you for your help.

I just checked norm.factors and fouund that it has just one row with 10 values.

It's quite confusing as I expected that the result will be a matrix which has the same number of rows and columns with input data (about 30,000 rows and 11 rows, in this case) with normalized values.
What I am really interested in is the 'normalized vlaues' by TMM.
For example, if I want to compare a plot from the original count with a plot from the normalized values,
what element should I extract from the result of calcNormFactors()? Is norm.factors the right thing for this case?

 

 

ADD REPLYlink written 23 months ago by tinmad10

Each sample has a normalization factor, there's no reason for it to output a matrix.
 

ADD REPLYlink written 23 months ago by Devon Ryan73k

okay... I think I've been mistaken about it..! thank you for the great help!

Then if I want to have a matrix with normalized values which is generated using normalization factors from TMM, should I use cpm()?

Also, a result from HTSeq-count includes the number of 'no feature', 'ambiguous', 'too low aQual', 'not aligned', and 'alignment not unique' in its last 5 rows. Do I have to exclude them from the matrix to be normalized?

ADD REPLYlink written 23 months ago by tinmad10

I have no idea what you want to use the values for, so cpm may or may not be appropriate.

Yes, remove the last 5 rows from the count table. Actually, make sure to do that before computing the normalization factors.

ADD REPLYlink written 23 months ago by Devon Ryan73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 957 users visited in the last hour