Question

How to generate corresponding embl file with gff and fasta file of each scaffold in an assembly.

0

Entering edit mode

6.8 years ago

Mehmet ▴ 820

Dear all,

I have an assembly which has ~1000 scaffold, and has a gff3 and gtf file. I would like to ask you how to generate embl file of each scaffold of the assembly?

Assembly genome sequence gene • 4.5k views

ADD COMMENT • link updated 6.8 years ago by Juke34 8.6k • written 6.8 years ago by Mehmet ▴ 820

0

Entering edit mode

6.8 years ago

colindaven 6.5k

If it's a bacterium maybe put it into an annotated annotation platform ? Eg NCBIs automated analysis platform.

MAKER et al. will only generate a GFF3.

Actually prokka will generate an EMBL or has a conversion script:

https://github.com/tseemann/prokka

ADD COMMENT • link 6.8 years ago by colindaven 6.5k

0

Entering edit mode

No, it is an eukaryote.

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

score 3 · Accepted Answer · 2017-10-27

3

Entering edit mode

6.8 years ago

Juke34 8.6k

EMBLmyGFF3 Have a look here, it is what you are looking for.

ADD COMMENT • link 6.8 years ago by Juke34 8.6k

0

Entering edit mode

Thank you. I tried, and I would like to ask you how to give scaffold names into IDs in the output file?

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

0

Entering edit mode

I have several questions about your script.

How can I provide protein file to use --translate option? What argument does this option accepts?
In output file how can I give unique IDs based on scaffold name?

an example of output file. I want to put scaffold00001 as ID, not XXX. how can do that?

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

1

Entering edit mode

Hi, 1 - I don't get what you want to do ... Or maybe our explanations are nor clear enough. Let's clarify that. You should provide the fasta of the assembly, so DNA sequences. There is no way to pass a protein file to the tool. Using the --translate option will add the translation of the CDS contained in your GFF. It's a boolean nothing else to add.

2 - actually the tool is currenlty giving the accession as prefix for the ID. I' am currently modifying that to use a locus_tag option to define it, it will be clearer. But I don't plan to implement the possibility to use the scaffold name as part of the locus tag. Only the use of the locus_tag given by ENA is mandatory. The rest is an arbitrary choice we made (locus1, locus2, locus3 ...).

ADD REPLY • link 6.8 years ago by Juke34 8.6k

0

Entering edit mode

Hi,

For the second question, For instance in my fasta file:

scaffold001 ATGATG scaffold2 ACGTA

in my gff file scaffold01 scaffold01

what I would like to ask is how can ID in embl file can be done in order fasta file? ID scaffold01 ID scaffold02 etc.

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

0

Entering edit mode

Currently it's not possible, but I will see if I can implement something like that.

ADD REPLY • link 6.8 years ago by Juke34 8.6k

0

Entering edit mode

I was able to. Sorry for that.

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

0

Entering edit mode

Excellent, could you share your trick ?

ADD REPLY • link 6.8 years ago by Juke34 8.6k

1

Entering edit mode

Yes, of course. It is easy to use the script. The script saved my life. I have searched many scripts and tools to convert to embl format. Thank you for providing this script to us. It took ~5 minutes for 75 Mbp assembly and ~17000 gene models to generate embl file and did not give any error during running.

For explanation, more information can be given for beginners. For example;

explanation of "transl_table".

most important thing is IDs of scaffold that I mentioned before. If you can provide this, it would be much better to split big embl file into small files that can be used for downstream analyses.

Once again, thank you for writing this script.

ADD REPLY • link 6.8 years ago by Mehmet ▴ 820

0

Entering edit mode

Thank you very much for your feedback. I will try to improve the help. Actually I mixed up the different terms. When you said "ID" I thought "locus_tag". I just realised you were talking about the ID line... My fault. So, yes there is no way to have the contig name into the ID line otherwise it will break the EMBL rules and not be a valid emboli flat file. Nevertheless, what I can do easily is to add the contig name in the DE. It was previously like that. And it's compatible with the format. It could help to split big embl file into small files.

ADD REPLY • link 6.8 years ago by Juke34 8.6k

0

Entering edit mode

ID   XXX; SV 1; linear; genomic DNA; WGS; INV; 6279 BP.
XX
AC   XXX; 
XX
AC * _scaffold00001
XX
PR   Project:PRJXXXXXXX;
XX
DT   27-Oct-2017 (Rel. 133, Created)
DT   27-Oct-2017 (Rel. 133, Last updated, Version 1)
XX
DE   XXX
XX
KW   .
XX
OS   my species (worm)
XX
OC   Life;
XX
RN   [1]
RP   1-6279
RG   MYGROUP
RT   ;
RL   Submitted (27-OCT-2017) to the INSDC.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..6279
FT                   /mol_type="genomic DNA"
FT                   /organism="my species (worm)"
FT   gene            371..953
FT                   /locus_tag="XXX_locus1"
FT                   /note="source:EVM"
FT                   /note="ID:evm.TU.scaffold00001.1"
FT                   /standard_name="EVM prediction scaffold00001.1"
FT   mRNA            join(371..871,936..953)