I am analysing some E. coli sRNA-seq data for a colleague. However, featurecounts output gives 0 counts for every gene, even though I had 85-95% of aligned reads. When I blast some aligned reads, I only get E. coli genomic regions. When I looked at the gtf file I used to count features, it seemed there were only 18 non-coding RNA transcripts, so perhaps it is not suprising that I got nothing. So I am not sure where to go from here. I have been suggested RNA central, however, coordinate files are not available for E. coli. I have never encountered an issue like that before, so I really don't know what to do.
From the literature one could easily gather more potential sRNA loci than 18. Have you considered assembling a larger query set, or using another method to determine what your reads actually do overlap with? When you say 0 counts for every gene, do you mean you have a full description of all genes of all types in e. coli, and none of those features were able to collect a single read? The GenomicRanges library in R is a powerful and fairly easy way to investigate what overlaps with what in terms of reads and genomic features, and allows you to investigate and quantify overlaps based on various parameters (strand, degree of overlap, etc.). You can do similar things with command line tools, but it gets a bit cumbersome.
Yep! None of the features collected a single read. I did find one or two articles where the authors discovered novel E coli. sRNA. However, I'm not sure how to transform that into a gtf. I did look at another gtf file and tried to build a similar structure, but it didn't work. Moreover, considering how little studied this class of RNA is in E coli., wouldn't I risk losing a lot of potentially novel transcripts?
I considered to simply count each of the reads even if they are not annotated anywhere. I could check it out later if I find any sRNA differentially expressed between the groups. Is this doable?