Hello,
Background:
I am analyzing data for an RNA-seq experiment made on a bacterial strain. I have a previously assembled genome that was annotated with PGAP. I processed my raw RNA-seq reads, aligned them to the genome with STAR, and quantified the alignments with HTseq. I fed the HTseq read counts to DESeq2 and now I have a table with the results, though I end up with a gene_id in the first column of the table and DESeq2 results.
Problem: The gene_id is not very informative. I want the gene_id, the CDS start and end position (to control for gene length bias), and also the transcript product (protein) or function. I have this information in a GTF annotation file of course file but I can't find any sort of tools or code to add this information to my DESeq2 results.
I thought it would be a simple task, but either it is not, or I am not understanding something properly. I have seen some tutorials using Biomart or other tools like that, but it seems like it's just for eukaryotes and it's using online databases. I guess I don't understand why one would not simply use the GTF annotation file?
My plan going forward is to annotate with GO terms or KEGG to do an enrichment analysis. The step I am stuck on seems easy but for some reason, I can't find a solution.
Any help is welcome, Thank you.
This information should be in the GTF/GFF file that you used. Can you show us a few lines of that file?
Yes, it is in that file. I just did not know how to join it with DESeq2 results. I will try the solution from post below and see if it works.
Here's a line:
I don't know how you did the read counting but
featureCounts
can assign the gene name using the ID you specify. You should have usedName
as the id.I used HTseq which is integrated into STAR. Perhaps I could try featureCounts yes.
With
htseq-count
you could have used-i Name
or--additional-attr
to add a column with names.