Question

Htseq Count

0

Entering edit mode

12.0 years ago

Varun Gupta ★ 1.3k

Hello Everyone

I am working on RNA seq data. I have some questions about htseq-count tool for counting the reads.

QUESTION1. The result of htseq-count produces gene name and its count. Now htseqcount says that

In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons.

Now i already had bam alignment file and used htseq count on it(ofcourse converting to sam first and providing hg19 gtf file). It gave me ensembl name and read count.

My question is suppose for a particular ensembl name read count is 13450. If i want to see these 13450 reads in sam/bam file, how can i do that???

Question2. i created my own cdna genome taking couple of genes and mapped using gsnap aligner(using a fastq file as input). The bam file which was already present for the whole genome for a particular fastq file (same as above fastq file ofcourse) gives very low count as compared to when i use my own cdna genome.

What could be the reason for this.

Hope to hear from you guys soon

Regards V

rna seq htseq rna-seq • 11k views

ADD COMMENT • link updated 11.7 years ago by Ying W ★ 4.2k • written 12.0 years ago by Varun Gupta ★ 1.3k

score 0 · Answer 1 · 2012-04-26

0

Entering edit mode

12.0 years ago

Ying W ★ 4.2k

Question1: I would load your bam file into a viewer such as IGV and visualise it that way. If you want to get the actual reads out of the bam file you can create a bed file of the exons for your gene of interest and use samtools view -l to output only the region specified by the bed file.

Question2: I am not sure what you mean by this, did you take the sequence around your genes of interest and used that subset to realigned with gsnap? If you did this and you are wondering why aligning to the subset of the genome gives higher coverage than aligning to whole genome then I believe the answer lies to how well your reads align to the genome. When you are only using a subset of the genome, you might get reads that align to the subset that would actually align better somewhere else in the whole genome so when you align against the whole genome, the read would show as being mapped to another location in the genome.

As a sidenote, I found out recently how important choosing the right gtf file is, if your gtf file has multiple isoforms of the same gene, when HTseq (by default) combines all exons based on a gene symbol, you will be summarising gene expression over all the isoforms.

ADD COMMENT • link 12.0 years ago by Ying W ★ 4.2k

0

Entering edit mode

hi ying About your sidenote on HTseq : So if the gtf file has multiple isoforms of same gene, and as you say it will be summarising gene expression over all the isoforms, will it mean that it would give a low read count for that particular gene.

ADD REPLY • link 12.0 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

I ment to say 'summing' instead of summarizing. So what HTseq-counts will do is take all of your features (default exons) and count then up for every id/attribute (default gene id), by changing the gene_id to something like transcript_id instead you would get the counts separate for isoforms. If you think isoforms are important you might want to change that option. Your summed gene ids wont be lower but might be less biologically meaningful, it depends on what you are interested in. For a list of options go here: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

some genes to check/consider (in humans):

IL9R (on both chrX and chrY)
TTL (has short and superlong isoform )
DUX2 (chr4 and chr10, gene in repeat element, ncbi only shows chr4 but my gtf file has both, ucsc also shows both)

ADD REPLY • link 12.0 years ago by Ying W ★ 4.2k