HTSeq-count no feature is too high
1
0
Entering edit mode
8.4 years ago
sandeepmarla ▴ 10

Hello,

I'm using HTSeq-count to determine the number of transcripts for each sample. I have bam files,samtools flagstat analysis showed ~80% of the transcripts are aligned to the reference genome. Then I sorted those bam files by name and position, used the exon annotation gff3 file from sorghum phytozome v3.1 to determine the transcript counts using HTSeq-count. After analyzing my data, I realized no_feature was too high, I have a total of ~30 -36 million paired end reads and 15 million shows no_feature.

code:

htseq-count --stranded=no --type=exon --idattr=Parent --order=pos --format=bam --minaqual=30 --mode=union $bam $gff3 >$out
__no_feature    15928275
__ambiguous    3020434
__too_low_aQual    768292
__not_aligned    5910873
__alignment_not_unique    1756250

Further, I looked at the data and found that all the featured were on 6 chromosomes and 4 (chr3,4,6 and 7) were completely absent in the HTSeq counts data. I looked at the gff3 file and found that exon data was present for all the chromosomes.

Can any please help me in fixing this issue?

Thanks in advance,
Sandeep

htseq • 5.2k views
ADD COMMENT
0
Entering edit mode

That's certainly the weirdest htseq-count issue I've heard of. Perhaps featureCounts will produce better results? It's much faster anyway.

ADD REPLY
0
Entering edit mode

Are the chromosome names exactly the same between the gff file and the bam file ? It's the only reason I can think of to explain this issue...

ADD REPLY
1
Entering edit mode
8.4 years ago
sandeepmarla ▴ 10

Thanks for your responses. The problem is with the chromosome assembly in the reference genome sequence file, the newest version of the genome missed the chromosomes labeling. I'm re-doing the complete analysis with an updated assembly reference genome.

Lessons learnt from this experience, we need to make sure the analysis is correct at each step:

  1. Check assembled genome if chromosome annotation is missing.
  2. If sam and bam files generated contained all the chromosomes in the analysis.
  3. Compare the gff3/gtf annotation file to validate if assembled chromosome labeling matches the annotation file.
  4. If HTSeq-counts are too high, read the HTSeq manual to ascertain the used feature, mode and idattr match the data utilized for the analysis.
ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6