Error in extracting Counts from the bam file using htseq-count
1
2
Entering edit mode
3.7 years ago
Biologist ▴ 220

Hi,

I'm trying to extract counts for each gene from the sorted.bam file using htseq-count. But I end with an error. Command I used:

htseq-count –s reverse –f bam –q sample.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt


I have this error:

/path/soft/apps/HTSeq/0.6.1p1-goolf-1.7.20-Python-2.7.11/bin/htseq-count: Error: Please provide two arguments.

Call with '-h' to get usage information.


For aligning reads to the genome I used Hisat2 tool. I didn't use any gtf file here. Following is the hisat2 command I used:

hisat2 -p 8 --dta --rna-strandness RF -x /grch38_snp_tran/genome_snp_tran -1 sample.1.fastq.gz -2 sample.2.fastq.gz | samtools view -Sb - > sample.bam; samtools sort -T /tmp/sample.sorted -o sample.sorted.bam sample.bam


Did I go wrong with the alignment step? Why htseq-count is showing the Error?

RNA-Seq htseq counts bam • 2.4k views
0
Entering edit mode

No, you definitely didn't do anything wrong in the alignment. The problem is arising because the command line interpreter is not parsing correctly the BAM and the GFF argument that you provide. Sorry to ask but, is that the FULL command that you ran? Or did you rewrite it to show it to us?

Also, who installed the program? The sysadmin or is it installed by you with some kind of wrapper? Perhaps that is the reason why the arguments are not parsed despite being declared.

Another reason might be that you are specifying filenames that do not exist or require additional path information.

0
Entering edit mode

I didn't mention the paths here. The command for htseq-count I gave above is the right one. I didn't install the program.

0
Entering edit mode

So if for:

• sample.sorted.bam
• gencode.v27.primary_assembly.annotation.gtf

you check the path, the file is there? This is the cause of 90% of the issues, regardless of the experience of the researcher.

1
Entering edit mode
3.7 years ago
h.mon 33k

If your htseq command above is exactly the one you are running, the error stems from using paragraph marks –s instead of hyphens -s to pass the command-line arguments. Did you copy-pasted the command from some internet page?

htseq-count -s reverse -f bam -q sample.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

0
Entering edit mode

I actually Gave this way and it worked but I don't see any counts. htseq-count –s reverse -q –f bam sample1.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

ENSG00000000003.14      0
ENSG00000000005.5       0
ENSG00000000419.12      0
ENSG00000000457.13      0
ENSG00000000460.16      0
ENSG00000000938.12      0
ENSG00000000971.15      0
ENSG00000001036.13      0
ENSG00000001084.10      0
ENSG00000001167.14      0
ENSG00000001460.17      0


Everything is 0. In the end of the file I see:

__no_feature    148352155
__ambiguous     344
__too_low_aQual 1986678
__not_aligned   2543185
__alignment_not_unique  91979962

1
Entering edit mode

Are chromosome names matching in your alignment file and the GTF file you are using?

0
Entering edit mode

You are right. They look different in the sorted.abm file it is without "chr" and in gtf file I see "chr". I gave it like this cat gencode.v27.primary_assembly.annotation.gtf | sed 's/chr//' > gencode.v27.primary_assembly.annotation_nochr.gtf Using htseq-count now.

1
Entering edit mode

Be careful with all of this. Make sure your GTF file is matching the genome build of the genome used for the alignments.

0
Entering edit mode

Basically I haven't used any gtf file in the alignment step. I'm using a standard genome index from hisat2 website. And now I stripped the "chr" from the gtf file. I will use this gtf file for stringtie and other steps. Is this ok? Or should I build own genome index?

1
Entering edit mode

I was only referring to the counting step.

0
Entering edit mode

small question. Using "htseq-count" command for each sample it is taking 3 hours runtime to extract counts. Why is taking so much time? Any idea?

1
Entering edit mode

htseq-count is comparatively slow. You should use featureCounts (http://bioinf.wehi.edu.au/featureCounts/ ). You will find it much faster.

0
Entering edit mode

Sure Thankyou. When I used htseq-count First I see Ensembl Ids with number of read counts and in that last I see some assignments like following. Do I need to bother anything about this for differential analysis steps?

Assigned: 99428670 (40.6%)
Ambiguous: 8859783 (3.6%)
Alignment not unique: 91979962 (37.6%)
No Feature: 40066091 (16.4%)
Too Low aQual: 1986678 (0.8%)
Not Aligned: 2543185 (1.0%)

0
Entering edit mode

Hi, Could you please tell me about my previous comment. Thank you !!

1
Entering edit mode

Not much to say. Looks like you have multimappers (I assume htseq has not counted those) that are about the same size as the assigned read (do you have rRNA in your data?).

Another plug for using featureCounts is you don't need to sort your BAM files and you can feed all of them in at one time (in the order you want them in matrix). You will then get a ready to use matrix of counts for all of your data.

0
Entering edit mode

No I don't have rRNA data. Lets say After alignment I have bam files and for Stringtie I use sorted.bam file. So, for featureCounts should I use sorted or unsorted bam?

1
Entering edit mode

You can use either. If your files are sorted then there is an option to let featureCounts know that they are.

0
Entering edit mode

Thankyou very much for the information !!

0
Entering edit mode

-s: are your data "reverse"? What happens when you use "no", or "yes"?

-t: maybe you should provide the right feature type?

-r: is the order by name or by position?