I am doing an mRNA analysis (mRNA from mice) and want to use GENE-counter for getting raw counts. The manual says that I need a reference genome and a annotation file in GFF3-format, which I will use to create a reference database in MySQL. Now I have tried reference and GFF from both refseq (from NCBI) and ensembl, and none of them works. I get an error telling me that there are duplicates in the GFF-file and hence it is not in GFF3-format (though it is according to the files). In the readme file coming together with the GFF-files from ensembl I read that: "Some validators may warn about duplicated identifiers for CDS features. This is to allow split features to be grouped." OKEJ so this is a known phenomenon, but what can I do about it? Any advice? To sort the file and use the uniq command did not help.
Update: I am pretty sure that the problem is that some of the entries in the CDS feature have the same attribute value. Do anyone know a way to make the m unique, or can I just remove the attributes from that feature? Will that impact the counts I get?
Which GFF file are you using from Ensembl? If you use the chr_patch_hapl_scaff file this will contain duplicated gene names due to the genes on the patches and haplotypes. Could this be the problem?
I am using the one called just Mus_musculus.GRCm38.99.gff3 and the primary assembly as reference genome. So I guess that is not the problem...
try Mus_musculus.GRCm38.99.chr.gff3.gz
You mean just the same file but zipped? Does not work... Thanks anyway!
Sorry, I put in the wrong thing. Edited my comment now. I meant the one with chr in the name, that should be just the primary assembly.
No worries! No difference I'm afraid. It gives me the same error. The weird thing is that their warn for this in the readme file, but there is no solution given...