Question: How can I convert an EMBL file into a multiFASTA file?
0
gravatar for Alec Watanabe
19 months ago by
Alec Watanabe60 wrote:

Hello guys,

I'm new to this community, so firstly I ask you to understand if I make any mistakes regarding how this forum works and also for my english (it's not my first language). So... I have some output files from PGAP (pan-genomes analysis pipeline), this tool allows me to verify the core genome, accessory genes and species-specific genes. I'm currently working with bacteria (3 species from genus Proteus) and I need to get all the sequences from the core genome and make it all into a multiFASTA file (to use as input to Vaxign). The problem is that PGAP doesn't generate FASTA files, instead the output files formats are EMBL, .nuc, .pep and other formats. Can I convert these EMBL files into a multiFASTA file? I tried using seqretsplit to split the EMBL file into multiple EMBL files with one sequence each (to convert them to FASTA and merge them right after), but even after reading the manual I couldn't figure out how to do it properly (it was only making a file with a single sequence). Is there any other way to create a multiFASTA file? Can I use seqretsplit to convert the EMBL file to a multiFASTA file? If so, how do I do that? I tried using online converters but I guess the files are too big. I apologize for my lack of computational skills, I'm new to the bioinformatics field. Thanks in advance!

conversion embl fasta • 963 views
ADD COMMENTlink modified 19 months ago • written 19 months ago by Alec Watanabe60

it was only making a file with a single sequence

Are you sure? Was that already in multi-fasta format or did it have just one sequence as you wrote?

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax71k

The EMBL file had several CDSs, when I tried to convert to multiFASTA I checked the file and it was actually a singlefasta. The command I used was:

seqretsplit -sequence Proteus_mirabilis_ARLG2970.emblV1.embl -sformat1 embl -supper1 -outseq Proteus_mirabilis_ARLG2970 -osformat2 fasta -osdirectory2 /home/alec/Documents/Proteus/Proteus_3genomes/ -ossingle2

Maybe I'm doing this wrong?

ADD REPLYlink modified 19 months ago by genomax71k • written 19 months ago by Alec Watanabe60

I am sure you must have checked the command syntax. File format conversions are always tricky. Is it possible to post a small example of the file you are dealing with?

ADD REPLYlink written 19 months ago by genomax71k

Here are the first lines of the EMBL file:

ID   ABVP01000001.1; SV 1; linear; unassigned DNA; STD; UNC; 3747729 BP.
XX
AC   ABVP01000001.1;
XX
DE   Proteus penneri ATCC 35198
XX
XX
OS   Proteus penneri ATCC 35198
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
OC   Morganellaceae; Proteus; Proteus penneri.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..3747729
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:471881"
FT                   /genome_md5=""
FT                   /project="aktwatanabe_471881"
FT                   /genome_id="471881.8"
FT                   /organism="Proteus penneri ATCC 35198"
FT   CDS             complement(2907..3704)
FT                   /locus_tag="Proteus_penneri_ATCC35198_0001"
FT                   /db_xref="SEED:fig|471881.8.peg.4"
FT                   /translation="MQQTSYCTKLKNRKATNFTLSIVILRTNCNTTESHFSDKVVYCNC
FT                   DDPRVSNFFKYFAVNFDNLGLKKLIASCYVENKEGFSSSEAAKNGFYYEYHKENGKKLV
FT                   FDDISVSSFCGDGDFRSSESIDLLKKSDIVVTNPPFSLFREYLDQLIKYDKKFLIIANV
FT                   NSITYKEVFNLIKENKIWLGVHLGRGVSGFIVPEHYELYGTEARIDSNGNRIISPNNCL
FT                   WLTNLDVFIRHKDLPLTRKYFGNESSYPKYDKSEFATHFAKIE"
FT                   /product="Modification methylase EcoRI (EC 2.1.1.72)"
FT                   /EC_number="2.1.1.72"
FT                   /transl_table=11
FT   CDS             3942..4364
FT                   /locus_tag="Proteus_penneri_ATCC35198_0002"
FT                   /db_xref="SEED:fig|471881.8.peg.5"
FT                   /db_xref="GO:0008565"
FT                   /db_xref="GO:0009306"
FT                   /db_xref="GO:0016020"
FT                   /translation="MVFTGSINAVDPLIGPSRTVQVQAVLPNTDNKLKAGMFARVQVTS
FT                   PDSPMVLTVPETAVTYTAYGDTVFVAQSDNQKNLIAKRVSVKVGLRYNGMIEIKEGLTL
FT                   GDQIVSSGQIKLSDGISIEPIEQDTLALAQSATTKP"
FT                   /product="Probable Co/Zn/Cd efflux system membrane fusion
FT                   protein"
FT                   /transl_table=11
ADD REPLYlink modified 19 months ago by genomax71k • written 19 months ago by Alec Watanabe60

You mention you also have .nuc and .pep files. Are those then not the multi-fasta files you're looking for? I'm not familiar with the output of PGAP but it could the EMBL file is a single entry (eg 1 genome or chomosome) with multiple CDS on it. so then you can't split the EMBL file since it is already a single entry. (you could check if there are multiple lines in the PGAP EMBL output file that only contain ' // ' == the record separator).

ADD REPLYlink written 19 months ago by lieven.sterck5.8k

The .pep files have amino acid sequences but I need a FASTA with nucleic acid sequences, and both .nuc and .pep have just the complete set of genes of each genome I've put to analysis in PGAP, therefore they do not present the core genome. I took the easy way out and asked for a friend's help, he wrote a script to extract and merge some PGAP output files:

1.Orthologs_Cluster.txt (which has the information about the core genome and also duplications);

.nuc files (which has the sequence I need for making a multiFASTA file).

It worked great for extracting the specific infos and merging them into a single file, but the file ended up being deformatted. But I guess I can edit the column size and also edit the header. I'm trying to figure it out. This is a way that I've found to try to make a multiFASTA, but it still takes a lot of time. Unfortunately the output files from PGAP aren't really good for visualization.

ADD REPLYlink modified 19 months ago • written 19 months ago by Alec Watanabe60

To get the nucleotide sequenes convert the CDS coordinates to GTF then extract these coordinates from reference with one of the many options bedtools getfasta or samtools faidx etc.

ADD REPLYlink written 19 months ago by Istvan Albert ♦♦ 81k

Does your EMBL file contain the genomic sequence as well, or do you have those in a separate file?

seqretsplit will not help you, that's to split a (multi) file in single files. What you want to do is to extract the cds sequences described in the embl file.

ah, and depending on what you want to visualise, EMBL is a fine format (you can eg. easily load in a genome browser tool to visualise the gene annotation) ;-) That's btw one way (slow and cumbersome though, but if you only need to process a few?) to convert it to fasta format. Load the embl file and the sequence in a genomebrowser tool (eg. igv, genomeview, ... ) and then export the CDS in fasta format

ADD REPLYlink written 19 months ago by lieven.sterck5.8k

Looks like these are reference genomes from human microbiome project? Why not get the fasta format sequences from NCBI entries. e.g. ABVP00000000.1 Proteus penneri ATCC 35198

ADD REPLYlink written 19 months ago by genomax71k
2
gravatar for Alec Watanabe
19 months ago by
Alec Watanabe60 wrote:

Hello guys,

This is just an update for this post and also in respect for those who contributed to this question. After some thought, I opened the EMBL file (output file from PGAP) using Artemis. In Artemis, we can select all CDS features, then write all amino acids from the selected features into a new file (that we can just type .fasta in the end). This creates a file with all sequences in amino acids and also in multifasta format. I tried to use this file in MEDpipe and the webtool accepted it, so I presume it's okay. I didn't test BioPython as jrj.healey previously suggested, so I can't say if it works or not in this particular situation.

ADD COMMENTlink modified 19 months ago • written 19 months ago by Alec Watanabe60

What about this: C: How can I convert an EMBL file into a multiFASTA file?

ADD REPLYlink written 19 months ago by genomax71k

Thanks for following up.

ADD REPLYlink written 19 months ago by Istvan Albert ♦♦ 81k

nice to hear you were able to take this hurdle ;-)

(and even more that I apparently pointed you in the right direction :-)

ADD REPLYlink written 19 months ago by lieven.sterck5.8k
0
gravatar for Joe
19 months ago by
Joe14k
United Kingdom
Joe14k wrote:

You can convert to/from FASTA and to/from EMBL formats with the SeqIO module in Biopython:

http://biopython.org/wiki/SeqIO

Something like this should work (though I've not tested it - just written off the top of my head:

#/usr/bin/env python

from Bio import SeqIO
import sys

# Run me like so:
#   $ python x2y.py infile.ext in_format outfile.ext out_format

infile = sys.argv[1]
intype = sys.argv[2]
outfile = sys.argv[3]
outtype = sys.argv[4]

SeqIO.convert(infile, intype, outfile, outtype)

So in your case you might do:

SeqIO.convert(infile, "embl", outfile, "fasta")
ADD COMMENTlink written 19 months ago by Joe14k

Not sure if that is going to help. if it is anything similar to the BioPERL one it will not parse out the CDS (or other) features from the EMBL file without specifically processing the file. The syntax you provide will just transform the sequence from embl to fasta format (not make a multifasta file with all the CDS sequences of proteins or such)

ADD REPLYlink modified 19 months ago • written 19 months ago by lieven.sterck5.8k

If there is 1 EMBL entry for each sequence of interest, this approach should work, though the OP might have to use SeqIO.parse and and a re-write, instead of just convert in that case.

ADD REPLYlink written 19 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1771 users visited in the last hour