Question: How to generate corresponding embl file with gff and fasta file of each scaffold in an assembly.
0
gravatar for Mehmet
17 months ago by
Mehmet460
Japan
Mehmet460 wrote:

Dear all,

I have an assembly which has ~1000 scaffold, and has a gff3 and gtf file. I would like to ask you how to generate embl file of each scaffold of the assembly?

gene sequence assembly genome • 1.2k views
ADD COMMENTlink modified 17 months ago by Juke-342.0k • written 17 months ago by Mehmet460
3
gravatar for Juke-34
17 months ago by
Juke-342.0k
Sweden
Juke-342.0k wrote:

EMBLmyGFF3 Have a look here, it is what you are looking for.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Juke-342.0k

Thank you. I tried, and I would like to ask you how to give scaffold names into IDs in the output file?

ADD REPLYlink written 17 months ago by Mehmet460

I have several questions about your script.

  1. How can I provide protein file to use --translate option? What argument does this option accepts?

  2. In output file how can I give unique IDs based on scaffold name?

an example of output file. I want to put scaffold00001 as ID, not XXX. how can do that?

ADD REPLYlink modified 17 months ago • written 17 months ago by Mehmet460
1

Hi, 1 - I don't get what you want to do ... Or maybe our explanations are nor clear enough. Let's clarify that. You should provide the fasta of the assembly, so DNA sequences. There is no way to pass a protein file to the tool. Using the --translate option will add the translation of the CDS contained in your GFF. It's a boolean nothing else to add.

2 - actually the tool is currenlty giving the accession as prefix for the ID. I' am currently modifying that to use a locus_tag option to define it, it will be clearer. But I don't plan to implement the possibility to use the scaffold name as part of the locus tag. Only the use of the locus_tag given by ENA is mandatory. The rest is an arbitrary choice we made (locus1, locus2, locus3 ...).

ADD REPLYlink written 17 months ago by Juke-342.0k

Hi,

For the second question, For instance in my fasta file:

scaffold001 ATGATG scaffold2 ACGTA

in my gff file scaffold01 scaffold01

what I would like to ask is how can ID in embl file can be done in order fasta file? ID scaffold01 ID scaffold02 etc.

ADD REPLYlink written 17 months ago by Mehmet460

Currently it's not possible, but I will see if I can implement something like that.

ADD REPLYlink written 17 months ago by Juke-342.0k

I was able to. Sorry for that.

ADD REPLYlink written 17 months ago by Mehmet460

Excellent, could you share your trick ?

ADD REPLYlink written 17 months ago by Juke-342.0k
1

Yes, of course. It is easy to use the script. The script saved my life. I have searched many scripts and tools to convert to embl format. Thank you for providing this script to us. It took ~5 minutes for 75 Mbp assembly and ~17000 gene models to generate embl file and did not give any error during running.

For explanation, more information can be given for beginners. For example;

explanation of "transl_table".

most important thing is IDs of scaffold that I mentioned before. If you can provide this, it would be much better to split big embl file into small files that can be used for downstream analyses.

Once again, thank you for writing this script.

ADD REPLYlink written 17 months ago by Mehmet460

Thank you very much for your feedback. I will try to improve the help. Actually I mixed up the different terms. When you said "ID" I thought "locus_tag". I just realised you were talking about the ID line... My fault. So, yes there is no way to have the contig name into the ID line otherwise it will break the EMBL rules and not be a valid emboli flat file. Nevertheless, what I can do easily is to add the contig name in the DE. It was previously like that. And it's compatible with the format. It could help to split big embl file into small files.

ADD REPLYlink written 17 months ago by Juke-342.0k
ID   XXX; SV 1; linear; genomic DNA; WGS; INV; 6279 BP.
XX
AC   XXX; 
XX
AC * _scaffold00001
XX
PR   Project:PRJXXXXXXX;
XX
DT   27-Oct-2017 (Rel. 133, Created)
DT   27-Oct-2017 (Rel. 133, Last updated, Version 1)
XX
DE   XXX
XX
KW   .
XX
OS   my species (worm)
XX
OC   Life;
XX
RN   [1]
RP   1-6279
RG   MYGROUP
RT   ;
RL   Submitted (27-OCT-2017) to the INSDC.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..6279
FT                   /mol_type="genomic DNA"
FT                   /organism="my species (worm)"
FT   gene            371..953
FT                   /locus_tag="XXX_locus1"
FT                   /note="source:EVM"
FT                   /note="ID:evm.TU.scaffold00001.1"
FT                   /standard_name="EVM prediction scaffold00001.1"
FT   mRNA            join(371..871,936..953)
ADD REPLYlink modified 17 months ago • written 17 months ago by Mehmet460

Could you format it properly, it's hard to see like that.

ADD REPLYlink written 17 months ago by Juke-342.0k
0
gravatar for colindaven
17 months ago by
colindaven1.1k
Hannover Medical School
colindaven1.1k wrote:

If it's a bacterium maybe put it into an annotated annotation platform ? Eg NCBIs automated analysis platform.

MAKER et al. will only generate a GFF3.

Actually prokka will generate an EMBL or has a conversion script:

https://github.com/tseemann/prokka

ADD COMMENTlink written 17 months ago by colindaven1.1k

No, it is an eukaryote.

ADD REPLYlink written 17 months ago by Mehmet460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 899 users visited in the last hour