Question: Make a GFF-file compatible for reference database
0
gravatar for viktoria.c.karlsson
8 months ago by
viktoria.c.karlsson0 wrote:

I am doing an mRNA analysis (mRNA from mice) and want to use GENE-counter for getting raw counts. The manual says that I need a reference genome and a annotation file in GFF3-format, which I will use to create a reference database in MySQL. Now I have tried reference and GFF from both refseq (from NCBI) and ensembl, and none of them works. I get an error telling me that there are duplicates in the GFF-file and hence it is not in GFF3-format (though it is according to the files). In the readme file coming together with the GFF-files from ensembl I read that: "Some validators may warn about duplicated identifiers for CDS features. This is to allow split features to be grouped." OKEJ so this is a known phenomenon, but what can I do about it? Any advice? To sort the file and use the uniq command did not help.

Update: I am pretty sure that the problem is that some of the entries in the CDS feature have the same attribute value. Do anyone know a way to make the m unique, or can I just remove the attributes from that feature? Will that impact the counts I get?

ADD COMMENTlink modified 8 months ago • written 8 months ago by viktoria.c.karlsson0

Which GFF file are you using from Ensembl? If you use the chr_patch_hapl_scaff file this will contain duplicated gene names due to the genes on the patches and haplotypes. Could this be the problem?

ADD REPLYlink written 8 months ago by Emily_Ensembl21k

I am using the one called just Mus_musculus.GRCm38.99.gff3 and the primary assembly as reference genome. So I guess that is not the problem...

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0
1

try Mus_musculus.GRCm38.99.chr.gff3.gz

ADD REPLYlink modified 8 months ago • written 8 months ago by Emily_Ensembl21k

You mean just the same file but zipped? Does not work... Thanks anyway!

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0

Sorry, I put in the wrong thing. Edited my comment now. I meant the one with chr in the name, that should be just the primary assembly.

ADD REPLYlink written 8 months ago by Emily_Ensembl21k

No worries! No difference I'm afraid. It gives me the same error. The weird thing is that their warn for this in the readme file, but there is no solution given...

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0
0
gravatar for Juke34
8 months ago by
Juke344.8k
Sweden
Juke344.8k wrote:

You could try agat_convert_sp_gxf2gxf.pl from AGAT it should be able to fix your gff file.

ADD COMMENTlink modified 5 months ago • written 8 months ago by Juke344.8k

I really thought this would solve it, but now I have tried both the script you recomended and the one called remove redundants, and mysql still does not accept the file. When I run the redundant-script it says that there are no redundancies. I do not get this at all...

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0

Still the same warning from GENE-counter?

ADD REPLYlink written 8 months ago by Juke344.8k

Yep. I think the problem is that some of the CDS-features have the same attribute value. For this script that does not count as redundancy. Though I can not find away to make them unique.

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0

Do you know if I can just delete the attributes for the CDS feature, or will that impact the counts I get?

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0

Difficult to say I don’t know how works GENE-counter

ADD REPLYlink written 8 months ago by Juke344.8k

OK, thank you for your response :)

ADD REPLYlink written 8 months ago by viktoria.c.karlsson0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1391 users visited in the last hour