Question: High number of no feature while using htseq-count on Tophat2 aligned as well as STAR aligned paired-end data
0
gravatar for fibi_prog_omics
2.9 years ago by
fibi_prog_omics0 wrote:

Dear BioStars community,

I have paired-end (150bp) data for the chicken Ileum from Illumina TruSeq. I was told that it is first stranded. I removed the adapters through Trimmomatic tool. I then aligned it to the Chicken Genome (Ensembl) galGal4 using STAR aligner as well as Tophat2.

I wanted to quantify the reads using htseq-count, so I sorted the bam file by name with samtools sort.

I used the following parameters for htseq-count :

htseq-count -f bam -r name -s reverse -t exon -i gene_id -m union Sample15_STAR.sorted.bam Gallus_gallus.Gallus_gallus-5.0.86.gtf > Sample15_STAR_HTSeq_samout

At the end approx. 33600000 SAM alignment pairs were processed. The problem is that I get therefore a very high number of no feature.

__no_feature 13631575

__ambiguous 83338

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 4679523

Can anybody tell me, what is the problem?

Thanks a lot in advance! fibi_prog_omics

rna-seq • 1.2k views
ADD COMMENTlink modified 2.9 years ago by michael.ante3.5k • written 2.9 years ago by fibi_prog_omics0
0
gravatar for Carlo Yague
2.9 years ago by
Carlo Yague4.8k
Canada
Carlo Yague4.8k wrote:

You can try to run the same code without the -s option. If the number of no_feature reads drops dramatically, then your library is probably not first-stranded.

If that number of no_feature reads is not much affected by changing the -s option, then perhaps its the .gtf file that is the cause of the issue. The file could be broken or simply, the annotation of the chicken could be not so good. If a gene is not in the annotation, then HTseq-count can not assign reads to it.

ADD COMMENTlink written 2.9 years ago by Carlo Yague4.8k
0
gravatar for michael.ante
2.9 years ago by
michael.ante3.5k
Austria/Vienna
michael.ante3.5k wrote:

Hi fibi_prog_omics,

You can use RSEQC'S read_distribution.py to compute where your reads are mapping. If you observe for instance many reads falling closely downstream to the 3'UTR, you'll might extend your annotation. If you see many intronic reads, you'll might check for genomic DNA contamination.

In case of intronic reads, you can re-run the Htseq-count with -t gene. You'll get a bit more ambiguous reads, but you can inspect the genes exhibiting the strongest difference between the two runs visually.

Cheers,

Michael

ADD COMMENTlink written 2.9 years ago by michael.ante3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 922 users visited in the last hour