I am running a test run on HT-seq counts on a data set previously used by a member of my lab, using the galaxy tool. I am getting an unusually high amount of No Feature results. In the raw data, there are approximately 26 million forward reads and 26 million reverse. The model is Zebrafish, reference genome GRCz10.
Mode = Union, Stranded = Yes (The experiment was strand specific) , Minimum Alignment quality = 10, Feature type = Exon, ID Attribute = gene_id, Additional Bam = No, Force Sorting of BAM by name = Yes
The results of HT-seq Counts are:
no_feature 22001714, ambiguous 13105, too_low_aQual 0, not_aligned 0, alignment_not_unique 19024306,
Can anyone explain why the majority of the reads are reading as No Feature? Is this normal and can be explained by the use of the zebrafish or is something going wrong?
There are reads that are assigned to genes in the output, so I don't think there is a problem with chromosome names. Also I have not tried the reverse parameter for strandedness, I will give that a try and see what happens. Thank you.
So I tried using reverse and here are my results:
This seems to be better than before? Also there are reads that are being assigned to genes (many more actually).
It looks better but the use of
yesshould not depend on better results. You should ask the people who did the sequencing and use appropriately.
If you have a stranded protocol, it will make a world of difference. That's what stranded protocols are for, to separate the sense "signal" from antisense "noise". Changing the stranded value in htseq-count will cause it to quantitate signal, or quantitate noise.
Okay, Thanks for all the help. I will contact the person.
So I found out more information about the library. It is a pair end, strand specific assay using NEBNext First Strand Synthesis Module and NEBNext Second Strand Synthesis Module (with dUTP). My biggest question is now, in HT-seq counts do I use the stranded= Yes or Stranded = Reverse option.
If you are using the Illumina TruSeq stranded protocol, or any other first-strand-based protocol, then your alignments will always be annotated to the wrong strand. Unless you aligned the reads with Tophat using "--library-type fr-firststrand", which corrects this. Otherwise, you could edit the strand flag in the BAM, which is the best bet.
If your BAM is not corrected, then all strandwise operations will have to be reversed. HTseq-count will need "-s reverse" to get sense counts, bedtools will require "-S" instead of "-s", etc.
Thank you!! This is what i was looking for. I am rerunning tophat with fr- first strand and then using HT-seq counts and stranded=yes.
Ah hold on, I was only partially right. Even with "--library-type fr-firststrand" I think the strand will still be reversed in the bit flags. Tophat will just add the "XS" tag for strand -- which will contradict the bit flags -- but only Cufflinks reads this tag. Other software will still use the bit flags, and still get things backwards.
So you could edit the flags, or, just remember to run htseq / bedtools / etc with the "wrong" strand argument.
thanks for the update. I ended up running Ht-seq count as reverse.
Accept the answer if it helped. So that it will be a "closed" thread.