Question: Counting number of genes using htseq-count tool.
0
gravatar for nalandaatmi
3.8 years ago by
nalandaatmi60
United States
nalandaatmi60 wrote:

I am interested in finding number of reads associated to each human genes based on the gtf file. I used the below code for running htseq-count.

htseq-count -f bam -s yes -i gene_id accepted_hits.bam genes.gtf >> SampleDH78-4_Genes_count.txt

100000 GFF lines processed.
200000 GFF lines processed.

….

869204 GFF lines processed.
Warning: Read HISEQ:137:C6W39ACXX:5:1216:16699:92182 claims to have an aligned mate which could not be found in an adjacent line.
100000 SAM alignment record pairs processed.

….

124300000 SAM alignment record pairs processed.
124400000 SAM alignment record pairs processed.
Warning: 120128572 reads with missing mate encountered.
124460592 SAM alignment pairs processed.

Sample output from htseq-count 

HBA1    6323
HBA2    68466
HBB     16677
HBBP1   1
HBD     2
HBE1    0
HBEGF   0
HBG1    840
HBG2    379
HBM     0
HBP1    119
HBQ1    0

Warning: 120128572 reads with missing mate encountered from htseq-count.

% of reads associated to globin=sum of all the above reads for globin genes divided by the total number of reads *100.

or 

% of reads associated to globin=sum of all the above reads for globin genes / 124460592 *100.

Only half of the reads pairs were processed. 

 

rna-seq rna htseq count • 6.8k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 3.8 years ago by nalandaatmi60
0
gravatar for tiago211287
3.8 years ago by
tiago2112871.1k
USA
tiago2112871.1k wrote:

The default of HTSeq-count is for a stranded specific library, if you run the default with a unstranded library you will lose 50% of your Data. Check this and then, try again.

As in the manual:

Important: The default for strandedness is yes. If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set the option --stranded=no unless you have strand-specific data!

Also, if your data is stranded yes, check this post  that can maybe explain your warning.

 

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by tiago2112871.1k

Thanks Tiago. I will check about the strand specific protocol with the incharge person. Will I be able to get strand details (sense and antisense) from htseq-count. In my above command, I selected for strand as yes.

Bam files are generated from Tophat. does it have to do anything with these warnings?

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by nalandaatmi60

After the alignment, did you process the bam files in any way? Merge | Sort? 

ADD REPLYlink written 3.8 years ago by tiago2112871.1k

Bam files generated by tophat are untouched. Just I provided the bam files as input to htseq.

ADD REPLYlink written 3.8 years ago by nalandaatmi60

If you used paired end data , bam file should be sorted by name. 

samtools sort -n  in.bam out.srt

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by leipinji10

Thanks Leipinji. Yeah I am using paired end data. As you mentioned, I am sorting my accepted_hits.bam file from tophat. After sorting, do you want me to just check HISEQ:137:C6W39ACXX:5:1216:16699:92182 in sorted file?

I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by nalandaatmi60

yes, samtoool sort by name will sort paired end reads forward and reverse together. 

ADD REPLYlink written 3.8 years ago by leipinji10

Hello Leipinji, then do you know how to reunite .forward.bam and .reverse.bam, previously mapped, together to process to counting ?

Many thanks

ADD REPLYlink written 22 months ago by lucilepain0

Hi Leipinji, I came across this post while trying to figure out the problem for myself. Could you please explain why do we need to provide a bam file which is sorted by name in HTseq? Can we provide a coordinate-sorted file and then introduce the argument "-r name" in the HTseq command instead? Is that logically okay? If possible, could you help me resolve an error I am encountering in the htseq stage. I have posted up a question. Thank you so much.

ADD REPLYlink written 4 months ago by bnayer260

Its suggested to use uniquely mapped reads for downstream analysis.
 

ADD REPLYlink written 3.8 years ago by geek_y9.8k

Thanks Goutham. I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by nalandaatmi60

I think, HTS-seq only counts uniq by default, or not?

ADD REPLYlink written 3.8 years ago by tiago2112871.1k
From manual:
skip all reads with alignment quality lower than the given minimum value (default: 10 — Note: the default used to be 0 until version 0.5.4.)
ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by geek_y9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1657 users visited in the last hour