Question: Make a GFF-file compatible for reference database
0
gravatar for viktoria.c.karlsson
6 days ago by
viktoria.c.karlsson0 wrote:

I am doing an mRNA analysis (mRNA from mice) and want to use GENE-counter for getting raw counts. The manual says that I need a reference genome and a annotation file in GFF3-format, which I will use to create a reference database in MySQL. Now I have tried reference and GFF from both refseq (from NCBI) and ensembl, and none of them works. I get an error telling me that there are duplicates in the GFF-file and hence it is not in GFF3-format (though it is according to the files). In the readme file coming together with the GFF-files from ensembl I read that: "Some validators may warn about duplicated identifiers for CDS features. This is to allow split features to be grouped." OKEJ so this is a known phenomenon, but what can I do about it? Any advice? To sort the file and use the uniq command did not help.

Update: I am pretty sure that the problem is that some of the entries in the CDS feature have the same attribute value. Do anyone know a way to make the m unique, or can I just remove the attributes from that feature? Will that impact the counts I get?

ADD COMMENTlink modified 4 days ago • written 6 days ago by viktoria.c.karlsson0

Which GFF file are you using from Ensembl? If you use the chr_patch_hapl_scaff file this will contain duplicated gene names due to the genes on the patches and haplotypes. Could this be the problem?

ADD REPLYlink written 6 days ago by Emily_Ensembl20k

I am using the one called just Mus_musculus.GRCm38.99.gff3 and the primary assembly as reference genome. So I guess that is not the problem...

ADD REPLYlink written 6 days ago by viktoria.c.karlsson0
1

try Mus_musculus.GRCm38.99.chr.gff3.gz

ADD REPLYlink modified 6 days ago • written 6 days ago by Emily_Ensembl20k

You mean just the same file but zipped? Does not work... Thanks anyway!

ADD REPLYlink written 6 days ago by viktoria.c.karlsson0

Sorry, I put in the wrong thing. Edited my comment now. I meant the one with chr in the name, that should be just the primary assembly.

ADD REPLYlink written 6 days ago by Emily_Ensembl20k

No worries! No difference I'm afraid. It gives me the same error. The weird thing is that their warn for this in the readme file, but there is no solution given...

ADD REPLYlink written 5 days ago by viktoria.c.karlsson0
0
gravatar for Juke-34
6 days ago by
Juke-343.3k
Sweden
Juke-343.3k wrote:

You could try agat_sp_gxf_to_gff3.pl from AGAT it should be able to fix your gff file.

ADD COMMENTlink written 6 days ago by Juke-343.3k

I really thought this would solve it, but now I have tried both the script you recomended and the one called remove redundants, and mysql still does not accept the file. When I run the redundant-script it says that there are no redundancies. I do not get this at all...

ADD REPLYlink written 4 days ago by viktoria.c.karlsson0

Still the same warning from GENE-counter?

ADD REPLYlink written 4 days ago by Juke-343.3k

Yep. I think the problem is that some of the CDS-features have the same attribute value. For this script that does not count as redundancy. Though I can not find away to make them unique.

ADD REPLYlink written 4 days ago by viktoria.c.karlsson0

Do you know if I can just delete the attributes for the CDS feature, or will that impact the counts I get?

ADD REPLYlink written 4 days ago by viktoria.c.karlsson0

Difficult to say I don’t know how works GENE-counter

ADD REPLYlink written 3 days ago by Juke-343.3k

OK, thank you for your response :)

ADD REPLYlink written 3 days ago by viktoria.c.karlsson0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1204 users visited in the last hour