Question: Why the outcome of my EMBL conversion file is all 'gap'?
1
gravatar for Jerryliu
3 months ago by
Jerryliu10
Jerryliu10 wrote:

Thanks for the person telling me that there is a software called EMBLmyGFF3 that can convert gff3-like file into EMBL file. However, I applied the software to my fasta file and gff3 file, the result I got were all 'Gap's!!!? initially I thought it was because the source problem, which is stuff in third column in the gff file, after I changed and made It a second run, it made error again!!! do anyone have similar experience??? this stuff is driving me crazy....!

FT   source          1..593586810                             
FT                   /mol_type="genomic DNA"                  
FT                   /organism="Wheat"                        
FT   **gap**             4503..5478                               
FT                   /estimated_length=976                    
FT   **gap**             11990..13040                             
FT                   /estimated_length=1051                   
FT   **gap**             14847..14856                             
FT                   /estimated_length=10                     
FT   **gap**             16648..16657                             
FT                   /estimated_length=10                     
FT   **gap**             19956..24520                             
FT                   /estimated_length=4565
alignment genome gene • 328 views
ADD COMMENTlink modified 11 weeks ago by WouterDeCoster35k • written 3 months ago by Jerryliu10
1

I don't see what's wrong with it .... Those gaps indicate gaps in your genomic sequence, and there likely be plenty of those indeed.

Scroll further downwards until you passed all these gaps tags and see what's printed there

ADD REPLYlink written 3 months ago by lieven.sterck3.3k

Actually they. were all gaps. I wanted to transform my transposon element in GFF format to EMBL format, so the column which is displaying 'gap' now should be 'transposon', but I don't know why it failed.

ADD REPLYlink written 3 months ago by Jerryliu10

strange, that makes very little sense ...

Can you post an extract of your GFF file so we can inspect?

ADD REPLYlink written 12 weeks ago by lieven.sterck3.3k

sure.

##gff-version 3
chr1A   RepeatMasker    Transposon  280909615   280909852   1198    ID=Denovo_TE000001;Target=chr1A_523945824_545934324_138 10287 10524;Class=LTR/Gypsy;PercDiv=12.2;PercDel=3.4;PercIns=3.4;

chr1A   RepeatMasker    Transposon  280909859   280910584   2802    ID=Denovo_TE000002;Target=chr1A_523945824_545934324_138 9135 9910;Class=LTR/Gypsy;PercDiv=18.6;PercDel=9.4;PercIns=2.3;

chr1A   RepeatMasker    Transposon  280910588   280911206   3621    ID=Denovo_TE000003;Target=chr1A_280907002_311370710_82 12495 13134;Class=LTR/Gypsy;PercDiv=15.1;PercDel=3.9;PercIns=0.5;
ADD REPLYlink modified 12 weeks ago by genomax59k • written 12 weeks ago by Jerryliu10

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 12 weeks ago by genomax59k

I don't understand...what's the difference ???:))

ADD REPLYlink written 12 weeks ago by Jerryliu10

Using the code option converts text to monospace font. Without that option biostars code will try to interpret some symbols like # and change the formatting of text.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by genomax59k

okay, I got it. I will be careful to make it right next time.!!!

ADD REPLYlink written 12 weeks ago by Jerryliu10

and the EMBLmyGFF3 cmdline ?

this gff file does already not correspond to the EMBL you have given previously.

ADD REPLYlink written 12 weeks ago by lieven.sterck3.3k

it is like this:

EMBLmyGFF3 maker.gff3 maker.fa \
--topology linear \
--molecule_type 'genomic DNA' \
--transl_table 1  \
--species 'Drosophila melanogaster' \
--locus_tag LOCUSTAG \
--project_id PRJXXXXXXX \
-o result.embl

I don't really know what these tag means, so I gave them by default.

ADD REPLYlink modified 12 weeks ago by genomax59k • written 12 weeks ago by Jerryliu10

If you don't care about the header and your downstream analysis focus only on the features (FT,SQ,etc...), you could keep fake/default information. Except maybe transl_table that could be important if you ask the tool to do the CDS translation and your species don't use the table 1...

ADD REPLYlink written 12 weeks ago by Juke-341.7k

Hello 3105298904 ,

have you tried to find your organism and gene on plants.ensembl.org? If you found your gene there you can export the data directly to EMBL format without any convertion.

fin swimmer

ADD REPLYlink written 3 months ago by finswimmer8.2k

If I remember the original post, it thought it was to upload data to EBI/ENA

ADD REPLYlink written 3 months ago by lieven.sterck3.3k
1
gravatar for Juke-34
12 weeks ago by
Juke-341.7k
Sweden
Juke-341.7k wrote:

Hi, I'm one of the developers of EMBLmyGFF3. Yes, as mentioned by Lieven your output looks correct. GAP features are created for all part that are Ns in your sequence. This is totally normal and mandatory for submission.

Then you are complaining that Transposon features don't occur in your EMBL output. As it shout be raised when you run EMBLmyGFF3, Transposon is not an accepted feature type by ENA. So they are discarded. If you want to keep them, you should use the --expose_translations option and then edit the json translation consequently. It means inform the tool to translate transposon in one of the ~ 50 features accepted by ENA (your choice should reflect something logic. I'm not sure if one of the feature type from ENA could fit to describe Transposon. Have a look here).

ADD COMMENTlink written 12 weeks ago by Juke-341.7k

haha ..thanks for your answer. I just found something interesting, I edited the column in the gff file where should be transposon into one of the ENA features you mentioned(actually that are not), because I suspected that maybe it was the feature that the software could not recognize. afterward, I got the output in the format I want, like this: gene complement(280897286..280897876) FT /locus_tag="LOCUSTAG_TE6487692" FT /note="source:RepeatProteinMask" FT /note="ID:TP327664" FT gene complement(280897765..280900842) FT /locus_tag="LOCUSTAG_TE6487693" FT /note="source:RepeatProteinMask" FT /note="ID:TP327665"

I should try to make it a second run following your suggestion later, however there is another mistake that happened

actually I am doing the conversion covering the whole genome level of wheat, and my input file, I mean gff file and sequence file,include the transposons annotation and 7 chromosome sequence, so the gff file is the mixture of TEs, and the sequence file is a file that contain multiple blocks of DNA sequence(>chr1A,>chr2A........). What is strange is that I only got the correct output information of chr1A chromosome (it seemed that only the first sequence in the fasta file had been scanned) , and the left TEs and were not recognized and displayed. So the next lines after the last recognized TE were all gaps..again...And I am not sure they are TE or others(because the complement index is not correct)....following is what the output look like.

        gene            complement(280900375..280900935)
FT                   /locus_tag="LOCUSTAG_TE6487694"
FT                   /note="source:RepeatProteinMask"
FT                   /note="ID:TP327666"
FT   gene            complement(280900821..280901252)
FT                   /locus_tag="LOCUSTAG_TE6487695"
FT                   /note="source:RepeatProteinMask"
FT                   /note="ID:TP327667"                    **(the problem start from here)**
FT   gap             4503..5478
FT                   /estimated_length=976
FT   gap             11990..13040
FT                   /estimated_length=1051

do you have a thought about this..haha..so glad that I could meet you and talk about this

ADD REPLYlink written 12 weeks ago by Jerryliu10

--expose_translations is made to keep the gff intact avoiding to edit it. What you have done is the same thing but not recommended.

I think you didn't look carefully enough your output. For every sequence you should have something like that: First the header:

ID   XXX; XXX; circular; genomic DNA; XXX; PRO; 317941 BP.
XX
AC   XXX; 
XX
AC * _ERS324955|SC|contig000001
XX
PR   Project:17285;
XX
DT   07-Sep-2018 (Rel. 133, Created)
XX
DE   XXX
XX
KW   .
XX
OS   Escherichia coli
XX
OC   cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria;
OC   Enterobacterales; Enterobacteriaceae; Escherichia.
XX
RN   [1]
RP   1-317941
RG   REFERENCE_GROUP
RT   ;
RL   Submitted (07-SEP-2018) to the INSDC.
XX

Then you will have your features:

FT   gene            complement(280900821..280901252)
FT                   /locus_tag="LOCUSTAG_TE6487695"
FT                   /note="source:RepeatProteinMask"
FT                   /note="ID:TP327667"

Then the self-generated gap features are added:

FT   gap             4503..5478
FT                   /estimated_length=976
FT   gap             11990..13040
FT                   /estimated_length=1051

Then the sequence:

XX
SQ   Sequence 317941 BP; 94633 A; 70235 C; 57246 G; 95827 T; 0 other;
     AAATATTTTT CTAATTAAGT CATCCATATA AGGACCAAAT ATACCAACTA CTAAACCAAT        60
     AATAAAACTT TTAAAATCCA TAATTACCAC CAACATATTG CTGCATAGGC TACACCTCCA       120
     AGTATAGCTC CACCTGCAGC ACCAGTTACA CCTATTCCTA TAGCAAATGG TCCCAATAGA       180
     AATGTCAAAC CGTTGTTGCA CACCCATCAA TTGCGCCATA TGCAACCCCT GCTGCACAAC       240

And this for every sequence. So I guess the features/gaps for the next chromosome is far far further in you file (after the long chr1 sequence).

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Juke-341.7k

THANK YOU!!! so I could just add "--expose_translations" before I am going to run, right?

you mean editing the feature by user to fit into the EMBL feature list is doing the same thing as running the program using " --expose_translations" mode, so the output from these two ways is the same, except the source column displaying differently, right?

Second, as you said, I mistaken the gaps from chr1A. Because the wheat genome is very big, since now, the program has been playing a whole day, and till now what I got is an unfinished chr1A mapping result, and I still have 7 sequences to go, and this is just one part of my mission, I totally have 3 genome to play. This is so sad. SO I am wondering if the program could run faster to save me some time.

Last, my work to convert the gff file into emblem format is to generate the inout file for another program, a program to further get some improvement with the classification result from Repeatmasker, a sequence alignment engine. (this you could ignore). And the correct form of EMBL is what I desired, the description of the look the of EMBL file is like this:(you could get the detailed info here)

ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 1411106 BP. XX AC unknown; XX XX FT CDS join(141960..142006,142121..142147,142248..142370,142493..142739,142850..142873) FT **/locus_tag="v443_0002_EXONERATE_BLASTX_protOSA_6" FT** /blastp_file="v443_0002_EXONERATE_BLASTX_protOSA_141960_142873_Q5JM42_Match_0005_mRNA_CDS.bltp" FT **/id="v443_0002_EXONERATE_BLASTX_protOSA_141960_142873_Q5JM42_Match_0005_mRNA_joinedCDS" FT** /note="Similar_to: hypothetical_protein" FT /note="BestBlastHit: B9EZI3_ORYSJ TrEMBL databank Putative uncharacterized protein - %25id: 91.67 - hcov: 13.78 - qcov: 100.00" FT /note="Status: High Confidence" FT CDS

when you focus on the "id" part, which is not like the look in my outoutfile, which is showing "note="ID:TP327667", so what I should do is edit the part into desired form manually afterwards?

thank you!!

ADD REPLYlink written 12 weeks ago by Jerryliu10

You should first run --expose_translations without any other option. It will create local json files that I call "mapping files". You should edit the corresponding one as explained in the git repository. Then you will run EMBLmyGFF3 from the same place, it will read the local json file and translate on the fly the feature type of your gff3 (column 3) into the chosen ENA feature type.

The performance of the tool is indeed not efficient. It is planned to improve it, we have a solution in mind, but it has to be implemented... you could write a word here to show that you are interested in this enhancement. It could speed up our interest in implementing it.

Then I'm a bit sceptical about the format you are expecting. Indeed, /blastp_file, /id are not EMBL qualifiers (check here). We have an option to force to keep feature types not accepted by EMBL, but none to force to keep qualifier not accepted by EMBL. So you will not be able to create them, they have to be saved in an accepted qualifier, by default it will be the /note one.

ADD REPLYlink written 12 weeks ago by Juke-341.7k

Hey, sir, I have something to tell you. The program I started four days ago is still not finished, and the program looked like it got stuck in the outputting of the sequence, which is the step you mentioned before, like this, XX

SQ   Sequence 317941 BP; 94633 A; 70235 C; 57246 G; 95827 T; 0 other;
     AAATATTTTT CTAATTAAGT CATCCATATA AGGACCAAAT ATACCAACTA CTAAACCAAT        60
     AATAAAACTT TTAAAATCCA TAATTACCAC CAACATATTG CTGCATAGGC TACACCTCCA       120
     AGTATAGCTC CACCTGCAGC ACCAGTTACA CCTATTCCTA TAGCAAATGG TCCCAATAGA       180
     AATGTCAAAC CGTTGTTGCA CACCCATCAA TTGCGCCATA TGCAACCCCT GCTGCACAAC       240

I tracked the progress very carefully, and I found that the running of the previous steps was actually okey, which took within one day. However, From sep 20-23, the program seemed got stuck, it was just listing the sequence of chr1A in a very slow race. chr1A of the Wheat is 593,586,810 BP, and till now the listing is at the position of 25,082,520 BP. I have made another run on another more advanced server, see if this one can progress more faster.

Since I am an undergraduate students, not an very experienced programmer, so I am not sure if I could be a help in the enhancement, but I have a will give a try.

Last, I figured that the part I want is like this :

FT   mRNA            303437..303917                           
FT                   /locus_tag="LOCUSTAG_LOCUS1"             
FT                   /note="source:WEWseq_PGSB_v1"            
FT   exon            303437..303503                           
FT                   /locus_tag="LOCUSTAG_LOCUS2"             
FT                   /note="source:WEWseq_PGSB_v1"

And i think there is no way that I complete mission by this way, so may be I would do the annotation conversion manually......if the outcome of my second try is not good.

ADD REPLYlink written 12 weeks ago by Jerryliu10

Our tool is definitely not optimised for huge annotation yet. I think the tool was not stuck but just computing the next chromosome, it print by step. It computes the whole chromosome before to write the result.

I'm wondering if editing the gff file like you mentioned earlier didn't break the sanity of the file... it could as well slower even more the calcul.

ADD REPLYlink written 11 weeks ago by Juke-341.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1647 users visited in the last hour