Question: Error in extracting Counts from the bam file using htseq-count
0
gravatar for Biologist
18 months ago by
Biologist150
Biologist150 wrote:

Hi,

I'm trying to extract counts for each gene from the sorted.bam file using htseq-count. But I end with an error. Command I used:

htseq-count –s reverse –f bam –q sample.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

I have this error:

/path/soft/apps/HTSeq/0.6.1p1-goolf-1.7.20-Python-2.7.11/bin/htseq-count: Error: Please provide two arguments.

Call with '-h' to get usage information.

For aligning reads to the genome I used Hisat2 tool. I didn't use any gtf file here. Following is the hisat2 command I used:

hisat2 -p 8 --dta --rna-strandness RF -x /grch38_snp_tran/genome_snp_tran -1 sample.1.fastq.gz -2 sample.2.fastq.gz | samtools view -Sb - > sample.bam; samtools sort -T /tmp/sample.sorted -o sample.sorted.bam sample.bam

Did I go wrong with the alignment step? Why htseq-count is showing the Error?

rna-seq counts bam htseq • 1.3k views
ADD COMMENTlink modified 18 months ago by h.mon27k • written 18 months ago by Biologist150

No, you definitely didn't do anything wrong in the alignment. The problem is arising because the command line interpreter is not parsing correctly the BAM and the GFF argument that you provide. Sorry to ask but, is that the FULL command that you ran? Or did you rewrite it to show it to us?

Also, who installed the program? The sysadmin or is it installed by you with some kind of wrapper? Perhaps that is the reason why the arguments are not parsed despite being declared.

Another reason might be that you are specifying filenames that do not exist or require additional path information.

ADD REPLYlink modified 18 months ago • written 18 months ago by Macspider2.9k

I didn't mention the paths here. The command for htseq-count I gave above is the right one. I didn't install the program.

ADD REPLYlink written 18 months ago by Biologist150

So if for:

  • sample.sorted.bam
  • gencode.v27.primary_assembly.annotation.gtf

you check the path, the file is there? This is the cause of 90% of the issues, regardless of the experience of the researcher.

ADD REPLYlink modified 18 months ago • written 18 months ago by Macspider2.9k
1
gravatar for h.mon
18 months ago by
h.mon27k
Brazil
h.mon27k wrote:

If your htseq command above is exactly the one you are running, the error stems from using paragraph marks –s instead of hyphens -s to pass the command-line arguments. Did you copy-pasted the command from some internet page?

htseq-count -s reverse -f bam -q sample.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt
ADD COMMENTlink written 18 months ago by h.mon27k

I actually Gave this way and it worked but I don't see any counts. htseq-count –s reverse -q –f bam sample1.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

ENSG00000000003.14      0
ENSG00000000005.5       0
ENSG00000000419.12      0
ENSG00000000457.13      0
ENSG00000000460.16      0
ENSG00000000938.12      0
ENSG00000000971.15      0
ENSG00000001036.13      0
ENSG00000001084.10      0
ENSG00000001167.14      0
ENSG00000001460.17      0

Everything is 0. In the end of the file I see:

__no_feature    148352155
__ambiguous     344
__too_low_aQual 1986678
__not_aligned   2543185
__alignment_not_unique  91979962
ADD REPLYlink written 18 months ago by Biologist150
1

Are chromosome names matching in your alignment file and the GTF file you are using?

ADD REPLYlink written 18 months ago by genomax70k

You are right. They look different in the sorted.abm file it is without "chr" and in gtf file I see "chr". I gave it like this cat gencode.v27.primary_assembly.annotation.gtf | sed 's/chr//' > gencode.v27.primary_assembly.annotation_nochr.gtf Using htseq-count now.

ADD REPLYlink written 18 months ago by Biologist150
1

Be careful with all of this. Make sure your GTF file is matching the genome build of the genome used for the alignments.

ADD REPLYlink written 18 months ago by genomax70k

Basically I haven't used any gtf file in the alignment step. I'm using a standard genome index from hisat2 website. And now I stripped the "chr" from the gtf file. I will use this gtf file for stringtie and other steps. Is this ok? Or should I build own genome index?

ADD REPLYlink written 18 months ago by Biologist150
1

I was only referring to the counting step.

ADD REPLYlink written 18 months ago by genomax70k

small question. Using "htseq-count" command for each sample it is taking 3 hours runtime to extract counts. Why is taking so much time? Any idea?

ADD REPLYlink written 18 months ago by Biologist150
1

htseq-count is comparatively slow. You should use featureCounts (http://bioinf.wehi.edu.au/featureCounts/ ). You will find it much faster.

ADD REPLYlink written 18 months ago by genomax70k

Sure Thankyou. When I used htseq-count First I see Ensembl Ids with number of read counts and in that last I see some assignments like following. Do I need to bother anything about this for differential analysis steps?

Assigned: 99428670 (40.6%)
Ambiguous: 8859783 (3.6%)
Alignment not unique: 91979962 (37.6%)
No Feature: 40066091 (16.4%)
Too Low aQual: 1986678 (0.8%)
Not Aligned: 2543185 (1.0%)
ADD REPLYlink modified 18 months ago • written 18 months ago by Biologist150

Hi, Could you please tell me about my previous comment. Thank you !!

ADD REPLYlink written 18 months ago by Biologist150
1

Not much to say. Looks like you have multimappers (I assume htseq has not counted those) that are about the same size as the assigned read (do you have rRNA in your data?).

Another plug for using featureCounts is you don't need to sort your BAM files and you can feed all of them in at one time (in the order you want them in matrix). You will then get a ready to use matrix of counts for all of your data.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax70k

No I don't have rRNA data. Lets say After alignment I have bam files and for Stringtie I use sorted.bam file. So, for featureCounts should I use sorted or unsorted bam?

ADD REPLYlink written 18 months ago by Biologist150
1

You can use either. If your files are sorted then there is an option to let featureCounts know that they are.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax70k

Thankyou very much for the information !!

ADD REPLYlink written 18 months ago by Biologist150

-s: are your data "reverse"? What happens when you use "no", or "yes"?

-t: maybe you should provide the right feature type?

-r: is the order by name or by position?

ADD REPLYlink modified 18 months ago • written 18 months ago by Macspider2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 855 users visited in the last hour