Question

How to use htseq-count with several samples ?

0

Entering edit mode

4.8 years ago

takoyaki ▴ 120

Does anyone know how to use htseq-count with several samples ?

We can use htseq-count like : htseq-count sample1.sam reference.gtf > result.count.txt

We can get sample1's count data by above command. But, it is usual that we have more than two sample. So, we have to run htseq-count for each sample's sam file. Do many people combine result matrix after running each htseq-count by sample ? or Can we make expression matrix with several samples at the same time ?

Also, I think there is some difference between samples like total expression amount or reads number. How many people do any normalization or correction between samples ?

Thank you.

RNA-Seq next-gen gene • 4.3k views

ADD COMMENT • link updated 2.4 years ago by VBer ▴ 200 • written 4.8 years ago by takoyaki ▴ 120

2

Entering edit mode

You have to run it separately for each sample. One you get the counts you can use R to create a unique matrix as

res <- mclapply(dir(pattern="*.counts", full.names=TRUE), function(fil){
                      read.delim(fil, header=FALSE, stringsAsFactors=FALSE)
                   }, mc.cores=16)

names(res) <- gsub("*.counts", "" , dir(pattern="*.counts"))

#Then we extract the additional info that HTSeq writes at the end of every file detailing 
addInfo <- c("__no_feature","__ambiguous",
             "__too_low_aQual","__not_aligned",
             "__alignment_not_unique")

Hope this help!

ADD REPLY • link 4.8 years ago by Lila M ★ 1.2k

0

Entering edit mode

Sorry, last sentence is wrong.

This is correct.

How do many people do normalization or correction between samples ?

ADD REPLY • link 4.8 years ago by takoyaki ▴ 120

0

Entering edit mode

You can edit your post and correct that sentence.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

score 6 · Accepted Answer · 2019-07-12

6

Entering edit mode

4.8 years ago

WouterDeCoster 47k

If you would use htseq count you would run it separately for each sample. Probably a better tool for this would be featureCounts.

If you use htseq count you can import that directly into DESeq2 (you did not tell us what your goal is, but I'll assume differential expression analysis). See here in the documentation.

Also, I think there is some difference between samples like total expression amount or reads number. How many people do any normalization or correction between samples ?

Again, we should know what you want to achieve, but I would say everyone should use normalization. But if you go on and use DESeq2 then you don't have to worry about it, as DESeq2 will take care of normalizing your samples.

ADD COMMENT • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Oh is this the case?

I ran htseq-count like this:

htseq-count -f bam A.bam B.bam C.bam Mus_musculus.GRCm39.104.gtf  >counts.txt

And the output looks fine:

ENSMUSG00000000001      3       2       0      
ENSMUSG00000000003      0       0       0       
ENSMUSG00000000028      0       0       0       
ENSMUSG00000000031      30      23      10      
ENSMUSG00000000037      0       0       0      
ENSMUSG00000000049      9       6       1

Now I'm wondering if the program worked as intended. I will run the program individually and check if there is any difference.

Edit: Just ran the first sample all by itself, same values were generated.

ADD REPLY • link 2.4 years ago by VBer ▴ 200