Question: How to generate corresponding embl file with gff and fasta file of each scaffold in an assembly.
0
gravatar for Mehmet
2.8 years ago by
Mehmet580
Japan
Mehmet580 wrote:

Dear all,

I have an assembly which has ~1000 scaffold, and has a gff3 and gtf file. I would like to ask you how to generate embl file of each scaffold of the assembly?

gene sequence assembly genome • 2.0k views
ADD COMMENTlink modified 2.8 years ago by Juke344.5k • written 2.8 years ago by Mehmet580
3
gravatar for Juke34
2.8 years ago by
Juke344.5k
Sweden
Juke344.5k wrote:

EMBLmyGFF3 Have a look here, it is what you are looking for.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Juke344.5k

Thank you. I tried, and I would like to ask you how to give scaffold names into IDs in the output file?

ADD REPLYlink written 2.8 years ago by Mehmet580

I have several questions about your script.

  1. How can I provide protein file to use --translate option? What argument does this option accepts?

  2. In output file how can I give unique IDs based on scaffold name?

an example of output file. I want to put scaffold00001 as ID, not XXX. how can do that?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Mehmet580
1

Hi, 1 - I don't get what you want to do ... Or maybe our explanations are nor clear enough. Let's clarify that. You should provide the fasta of the assembly, so DNA sequences. There is no way to pass a protein file to the tool. Using the --translate option will add the translation of the CDS contained in your GFF. It's a boolean nothing else to add.

2 - actually the tool is currenlty giving the accession as prefix for the ID. I' am currently modifying that to use a locus_tag option to define it, it will be clearer. But I don't plan to implement the possibility to use the scaffold name as part of the locus tag. Only the use of the locus_tag given by ENA is mandatory. The rest is an arbitrary choice we made (locus1, locus2, locus3 ...).

ADD REPLYlink written 2.8 years ago by Juke344.5k

Hi,

For the second question, For instance in my fasta file:

scaffold001 ATGATG scaffold2 ACGTA

in my gff file scaffold01 scaffold01

what I would like to ask is how can ID in embl file can be done in order fasta file? ID scaffold01 ID scaffold02 etc.

ADD REPLYlink written 2.8 years ago by Mehmet580

Currently it's not possible, but I will see if I can implement something like that.

ADD REPLYlink written 2.8 years ago by Juke344.5k

I was able to. Sorry for that.

ADD REPLYlink written 2.8 years ago by Mehmet580

Excellent, could you share your trick ?

ADD REPLYlink written 2.8 years ago by Juke344.5k
1

Yes, of course. It is easy to use the script. The script saved my life. I have searched many scripts and tools to convert to embl format. Thank you for providing this script to us. It took ~5 minutes for 75 Mbp assembly and ~17000 gene models to generate embl file and did not give any error during running.

For explanation, more information can be given for beginners. For example;

explanation of "transl_table".

most important thing is IDs of scaffold that I mentioned before. If you can provide this, it would be much better to split big embl file into small files that can be used for downstream analyses.

Once again, thank you for writing this script.

ADD REPLYlink written 2.8 years ago by Mehmet580

Thank you very much for your feedback. I will try to improve the help. Actually I mixed up the different terms. When you said "ID" I thought "locus_tag". I just realised you were talking about the ID line... My fault. So, yes there is no way to have the contig name into the ID line otherwise it will break the EMBL rules and not be a valid emboli flat file. Nevertheless, what I can do easily is to add the contig name in the DE. It was previously like that. And it's compatible with the format. It could help to split big embl file into small files.

ADD REPLYlink written 2.8 years ago by Juke344.5k
ID   XXX; SV 1; linear; genomic DNA; WGS; INV; 6279 BP.
XX
AC   XXX; 
XX
AC * _scaffold00001
XX
PR   Project:PRJXXXXXXX;
XX
DT   27-Oct-2017 (Rel. 133, Created)
DT   27-Oct-2017 (Rel. 133, Last updated, Version 1)
XX
DE   XXX
XX
KW   .
XX
OS   my species (worm)
XX
OC   Life;
XX
RN   [1]
RP   1-6279
RG   MYGROUP
RT   ;
RL   Submitted (27-OCT-2017) to the INSDC.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..6279
FT                   /mol_type="genomic DNA"
FT                   /organism="my species (worm)"
FT   gene            371..953
FT                   /locus_tag="XXX_locus1"
FT                   /note="source:EVM"
FT                   /note="ID:evm.TU.scaffold00001.1"
FT                   /standard_name="EVM prediction scaffold00001.1"
FT   mRNA            join(371..871,936..953)
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Mehmet580

Could you format it properly, it's hard to see like that.

ADD REPLYlink written 2.8 years ago by Juke344.5k
0
gravatar for colindaven
2.8 years ago by
colindaven2.3k
Hannover Medical School
colindaven2.3k wrote:

If it's a bacterium maybe put it into an annotated annotation platform ? Eg NCBIs automated analysis platform.

MAKER et al. will only generate a GFF3.

Actually prokka will generate an EMBL or has a conversion script:

https://github.com/tseemann/prokka

ADD COMMENTlink written 2.8 years ago by colindaven2.3k

No, it is an eukaryote.

ADD REPLYlink written 2.8 years ago by Mehmet580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour