Question

HTSeq-count no feature is too high

0

Entering edit mode

8.4 years ago

sandeepmarla ▴ 10

Hello,

I'm using HTSeq-count to determine the number of transcripts for each sample. I have bam files,samtools flagstat analysis showed ~80% of the transcripts are aligned to the reference genome. Then I sorted those bam files by name and position, used the exon annotation gff3 file from sorghum phytozome v3.1 to determine the transcript counts using HTSeq-count. After analyzing my data, I realized no_feature was too high, I have a total of ~30 -36 million paired end reads and 15 million shows no_feature.

code:

htseq-count --stranded=no --type=exon --idattr=Parent --order=pos --format=bam --minaqual=30 --mode=union $bam $gff3 >$out

__no_feature    15928275
__ambiguous    3020434
__too_low_aQual    768292
__not_aligned    5910873
__alignment_not_unique    1756250

Further, I looked at the data and found that all the featured were on 6 chromosomes and 4 (chr3,4,6 and 7) were completely absent in the HTSeq counts data. I looked at the gff3 file and found that exon data was present for all the chromosomes.

Can any please help me in fixing this issue?

Thanks in advance,
Sandeep

htseq • 5.2k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.4 years ago by sandeepmarla ▴ 10

0

Entering edit mode

That's certainly the weirdest htseq-count issue I've heard of. Perhaps featureCounts will produce better results? It's much faster anyway.

ADD REPLY • link 8.4 years ago by Devon Ryan 104k

0

Entering edit mode

Are the chromosome names exactly the same between the gff file and the bam file ? It's the only reason I can think of to explain this issue...

ADD REPLY • link 8.4 years ago by Carlo Yague 8.7k

Ram · Answer 1 · 2015-12-11

Thanks for your responses. The problem is with the chromosome assembly in the reference genome sequence file, the newest version of the genome missed the chromosomes labeling. I'm re-doing the complete analysis with an updated assembly reference genome.

Lessons learnt from this experience, we need to make sure the analysis is correct at each step:

Check assembled genome if chromosome annotation is missing.
If sam and bam files generated contained all the chromosomes in the analysis.
Compare the gff3/gtf annotation file to validate if assembled chromosome labeling matches the annotation file.
If HTSeq-counts are too high, read the HTSeq manual to ascertain the used feature, mode and idattr match the data utilized for the analysis.