How to TMM after HTSeq-count?
1
0
Entering edit mode
5.8 years ago

I work on RNA sequencing at the moment.
I've used tophat and cufflinks before, so now I'd like to do it in different way.
My new RNA-seq workflow is like below

BWA -> HTSeq-count -> TMM

However, I have no idea how to TMM after I finished BWA and HTSeq-count.
I've found that I have to use edgeR package in R and the function for TMM is calcNormFactors().
I've tried it already but there were some errors as I'm not much familiar with it.

Now I have about 10 bam files as a result of BWA, and 10 HTSeq-count results from that 10 bam files.
Each HTSeq-count result has two columns, one for gene symbols and the other for the counts.
(and last 4 rows are values about no feature, ambiguous, too low aQual, not aligned, alignment not unique)

I wonder

1) How to TMM with HTSeq-count results.

2) What is the input format that I can make from those 10 HTSeq-count results for calcNormFactors() in R.

RNA-Seq TMM edgeR HTSeq • 4.5k views
2
Entering edit mode

You should use a splice-aware aligner for RNA-Seq data (e.g. STAR, Tophat, HiSat,...)

0
Entering edit mode

Do you mean that I sould use Tophat or something else than BWA?

0
Entering edit mode

Yes because bwa does not support reads alignment over splice junction

0
Entering edit mode

Do you mean "TPM" or "CPM"? edgeR has a function for for calculating CPM.

1
Entering edit mode

As I know, I can put some options like

calcNormFactors(object, method="TMM")


Is it TMM normalization?

0
Entering edit mode

Don't specify anything and you'll get TMM normalization.

0
Entering edit mode

If you want to perform TMM normalization then use edgeR, since that's where TMM comes from. If that gives you an error then post it.

calcNormFactors() takes a matrix of counts.

0
Entering edit mode

I tried to put 10 HTSeq-count results together, so I made one matrix with them.
It has 11 columns; 1st column for gene symbols and other columns are about count values of 11 HTSeq-count results.
First error I got before was about that column(x): 'x' must be numeric.

1
Entering edit mode

Try putting the gene names into rownames, e.g. if your data is in my.counts:

rownames(my.counts) <- my.counts[,1]
mycounts <- my.counts[,-1]


The count matrix must be numeric.

0
Entering edit mode

Thank you for your help, I tried it, and got 1 column with 10 rows as a result like below.

"x"
"1"  1.00593967930787
"2"  0.993971788680494
"3"  0.999906336660367
...


It's quite confusing because I expected that the result will be a matrix which has the same number of rows and columns with input data (about 30,000 rows and 11 rows, in this case) with normalized values... Is this result right? or am I mistaken?..

0
Entering edit mode

No, something is wrong here. What does dim(x) and head(x) give before these operations?

0
Entering edit mode

before these operations, dim(x) gives [1] 33126 10 and head(x) gives

                sample1    sample2    sample3    .....
A1BG            27         42         53         .....
A1BG.AS1        19         25         21         .....
A1CF            8          21         17         .....
.....           .....      .....      .....      .....

0
Entering edit mode

Another little issue, and I am not sure how HTseq count deals with it, but did you check that the count files were all compatible, that means that the rownames are identical and in the same order in all files? You are probably fine, but better to check...

0
Entering edit mode

This is the default, isn't it? Btw, TMM (trimmed mean of M-values) normalization is

Robinson, M.D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.

I guess I have never changed the defaults...

0
Entering edit mode

Did you follow through edgeR's manual? Did you create a DGEList?

I've tried it already but there were some errors as I'm not much familiar with it.

What are the errors you are getting?

Finally, NicoBxl is right, use STAR (which will give you counts) or Subread+featureCounts or HISAT2 to do the mapping.

1
Entering edit mode
5.8 years ago
h.mon 33k

Create the DGEList as per edgeR manual:

x <- read.delim( "fileofcounts.txt", row.names = "Symbol" )
# With group you specify your experimental factors
group <- factor( c( 1, 1, 2, 2 ) )
y <- DGEList( counts = x, group = group )
y <- calcNormFactors( y )


Here x deserves a bit more of explanation: there is one header line, the first column is named Symbol and has gene names, the other columns have sample names and counts.

Symbol    Sample1    Sample2    Sample3    Sample4
RCAN1    10    14    40    50
KCNE1    11    12    41    51
...

0
Entering edit mode

I just want to normalize the counts, not want to find DEGs. Should I do the grouping by DGEList?

What should I do if there is no reason to do the grouping? (should I make 10 groups for 10 samples?)

Also, I got the result after create DGEList (as a trial) but

1. there is no change in values. Is it possible to happen?
2. when I try to write the result, I got the error:

Error in as.data.frame.default(x): cannot coerce class "structure("DGEList", package="edgeR")" to a data.frame.


How can I save the result?

0
Entering edit mode

No, you do not need to do grouping, you may create the DGEList as y &lt;- DGEList( counts = x ). But you can easily test if different groups change the scaling factor, if you are curious.

You ought to read edgeR manual more carefully:

We call the product of the original library size and the scaling factor the effective library size. The effective library size replaces the original library size in all downsteam analyses.

Raw counts are still raw counts, they are left unchanged, what changes is y$samples$norm.factors.

A DGEList is a list, it is not a data frame nor it is easily coerced into one. You can save some of its elements (e.g. y\$counts) with write.table - which is what you tried I assume.

0
Entering edit mode

I just checked norm.factors and found that it has just one row with 10 values.

It's quite confusing as I expected that the result will be a matrix which has the same number of rows and columns with input data (about 30,000 rows and 11 rows, in this case) with normalized values.

What I am really interested in is the normalized values by TMM.

For example, if I want to compare a plot from the original count with a plot from the normalized values,

what element should I extract from the result of calcNormFactors()? Is norm.factors the right thing for this case?

0
Entering edit mode

Each sample has a normalization factor, there's no reason for it to output a matrix.

0
Entering edit mode

okay... I think I've been mistaken about it..! thank you for the great help!

Then if I want to have a matrix with normalized values which is generated using normalization factors from TMM, should I use cpm()?

Also, a result from HTSeq-count includes the number of 'no feature', 'ambiguous', 'too low aQual', 'not aligned', and 'alignment not unique' in its last 5 rows. Do I have to exclude them from the matrix to be normalized?

0
Entering edit mode

I have no idea what you want to use the values for, so cpm may or may not be appropriate.

Yes, remove the last 5 rows from the count table. Actually, make sure to do that before computing the normalization factors.