Question

High number of no feature while using htseq-count on Tophat2 aligned as well as STAR aligned paired-end data

0

Entering edit mode

7.4 years ago

fibi_prog_omics • 0

Dear BioStars community,

I have paired-end (150bp) data for the chicken Ileum from Illumina TruSeq. I was told that it is first stranded. I removed the adapters through Trimmomatic tool. I then aligned it to the Chicken Genome (Ensembl) galGal4 using STAR aligner as well as Tophat2.

I wanted to quantify the reads using htseq-count, so I sorted the bam file by name with samtools sort.

I used the following parameters for htseq-count :

htseq-count -f bam -r name -s reverse -t exon -i gene_id -m union Sample15_STAR.sorted.bam Gallus_gallus.Gallus_gallus-5.0.86.gtf > Sample15_STAR_HTSeq_samout

At the end approx. 33600000 SAM alignment pairs were processed. The problem is that I get therefore a very high number of no feature.

__no_feature 13631575

__ambiguous 83338

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 4679523

Can anybody tell me, what is the problem?

Thanks a lot in advance! fibi_prog_omics

RNA-Seq • 2.4k views

ADD COMMENT • link updated 7.4 years ago by michael.ante ★ 3.8k • written 7.4 years ago by fibi_prog_omics • 0

score 0 · Answer 1 · 2016-12-08

You can try to run the same code without the -s option. If the number of no_feature reads drops dramatically, then your library is probably not first-stranded.

If that number of no_feature reads is not much affected by changing the -s option, then perhaps its the .gtf file that is the cause of the issue. The file could be broken or simply, the annotation of the chicken could be not so good. If a gene is not in the annotation, then HTseq-count can not assign reads to it.

score 0 · Answer 2 · 2016-12-09

Hi fibi_prog_omics,

You can use RSEQC'S read_distribution.py to compute where your reads are mapping. If you observe for instance many reads falling closely downstream to the 3'UTR, you'll might extend your annotation. If you see many intronic reads, you'll might check for genomic DNA contamination.

In case of intronic reads, you can re-run the Htseq-count with -t gene. You'll get a bit more ambiguous reads, but you can inspect the genes exhibiting the strongest difference between the two runs visually.

Cheers,

Michael