How can I convert an EMBL file into a multiFASTA file?
2
0
Entering edit mode
3.2 years ago

Hello guys,

I'm new to this community, so firstly I ask you to understand if I make any mistakes regarding how this forum works and also for my english (it's not my first language). So... I have some output files from PGAP (pan-genomes analysis pipeline), this tool allows me to verify the core genome, accessory genes and species-specific genes. I'm currently working with bacteria (3 species from genus Proteus) and I need to get all the sequences from the core genome and make it all into a multiFASTA file (to use as input to Vaxign). The problem is that PGAP doesn't generate FASTA files, instead the output files formats are EMBL, .nuc, .pep and other formats. Can I convert these EMBL files into a multiFASTA file? I tried using seqretsplit to split the EMBL file into multiple EMBL files with one sequence each (to convert them to FASTA and merge them right after), but even after reading the manual I couldn't figure out how to do it properly (it was only making a file with a single sequence). Is there any other way to create a multiFASTA file? Can I use seqretsplit to convert the EMBL file to a multiFASTA file? If so, how do I do that? I tried using online converters but I guess the files are too big. I apologize for my lack of computational skills, I'm new to the bioinformatics field. Thanks in advance!

EMBL Fasta Conversion • 2.1k views
0
Entering edit mode

it was only making a file with a single sequence

Are you sure? Was that already in multi-fasta format or did it have just one sequence as you wrote?

0
Entering edit mode

The EMBL file had several CDSs, when I tried to convert to multiFASTA I checked the file and it was actually a singlefasta. The command I used was:

seqretsplit -sequence Proteus_mirabilis_ARLG2970.emblV1.embl -sformat1 embl -supper1 -outseq Proteus_mirabilis_ARLG2970 -osformat2 fasta -osdirectory2 /home/alec/Documents/Proteus/Proteus_3genomes/ -ossingle2


Maybe I'm doing this wrong?

0
Entering edit mode

I am sure you must have checked the command syntax. File format conversions are always tricky. Is it possible to post a small example of the file you are dealing with?

0
Entering edit mode

Here are the first lines of the EMBL file:

ID   ABVP01000001.1; SV 1; linear; unassigned DNA; STD; UNC; 3747729 BP.
XX
AC   ABVP01000001.1;
XX
DE   Proteus penneri ATCC 35198
XX
XX
OS   Proteus penneri ATCC 35198
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
OC   Morganellaceae; Proteus; Proteus penneri.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..3747729
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:471881"
FT                   /genome_md5=""
FT                   /project="aktwatanabe_471881"
FT                   /genome_id="471881.8"
FT                   /organism="Proteus penneri ATCC 35198"
FT   CDS             complement(2907..3704)
FT                   /locus_tag="Proteus_penneri_ATCC35198_0001"
FT                   /db_xref="SEED:fig|471881.8.peg.4"
FT                   /translation="MQQTSYCTKLKNRKATNFTLSIVILRTNCNTTESHFSDKVVYCNC
FT                   DDPRVSNFFKYFAVNFDNLGLKKLIASCYVENKEGFSSSEAAKNGFYYEYHKENGKKLV
FT                   NSITYKEVFNLIKENKIWLGVHLGRGVSGFIVPEHYELYGTEARIDSNGNRIISPNNCL
FT                   WLTNLDVFIRHKDLPLTRKYFGNESSYPKYDKSEFATHFAKIE"
FT                   /product="Modification methylase EcoRI (EC 2.1.1.72)"
FT                   /EC_number="2.1.1.72"
FT                   /transl_table=11
FT   CDS             3942..4364
FT                   /locus_tag="Proteus_penneri_ATCC35198_0002"
FT                   /db_xref="SEED:fig|471881.8.peg.5"
FT                   /db_xref="GO:0008565"
FT                   /db_xref="GO:0009306"
FT                   /db_xref="GO:0016020"
FT                   /translation="MVFTGSINAVDPLIGPSRTVQVQAVLPNTDNKLKAGMFARVQVTS
FT                   PDSPMVLTVPETAVTYTAYGDTVFVAQSDNQKNLIAKRVSVKVGLRYNGMIEIKEGLTL
FT                   GDQIVSSGQIKLSDGISIEPIEQDTLALAQSATTKP"
FT                   /product="Probable Co/Zn/Cd efflux system membrane fusion
FT                   protein"
FT                   /transl_table=11

0
Entering edit mode

You mention you also have .nuc and .pep files. Are those then not the multi-fasta files you're looking for? I'm not familiar with the output of PGAP but it could the EMBL file is a single entry (eg 1 genome or chomosome) with multiple CDS on it. so then you can't split the EMBL file since it is already a single entry. (you could check if there are multiple lines in the PGAP EMBL output file that only contain ' // ' == the record separator).

0
Entering edit mode

The .pep files have amino acid sequences but I need a FASTA with nucleic acid sequences, and both .nuc and .pep have just the complete set of genes of each genome I've put to analysis in PGAP, therefore they do not present the core genome. I took the easy way out and asked for a friend's help, he wrote a script to extract and merge some PGAP output files:

1.Orthologs_Cluster.txt (which has the information about the core genome and also duplications);

.nuc files (which has the sequence I need for making a multiFASTA file).

It worked great for extracting the specific infos and merging them into a single file, but the file ended up being deformatted. But I guess I can edit the column size and also edit the header. I'm trying to figure it out. This is a way that I've found to try to make a multiFASTA, but it still takes a lot of time. Unfortunately the output files from PGAP aren't really good for visualization.

0
Entering edit mode

To get the nucleotide sequenes convert the CDS coordinates to GTF then extract these coordinates from reference with one of the many options bedtools getfasta or samtools faidx etc.

0
Entering edit mode

Does your EMBL file contain the genomic sequence as well, or do you have those in a separate file?

seqretsplit will not help you, that's to split a (multi) file in single files. What you want to do is to extract the cds sequences described in the embl file.

ah, and depending on what you want to visualise, EMBL is a fine format (you can eg. easily load in a genome browser tool to visualise the gene annotation) ;-) That's btw one way (slow and cumbersome though, but if you only need to process a few?) to convert it to fasta format. Load the embl file and the sequence in a genomebrowser tool (eg. igv, genomeview, ... ) and then export the CDS in fasta format

0
Entering edit mode

Looks like these are reference genomes from human microbiome project? Why not get the fasta format sequences from NCBI entries. e.g. ABVP00000000.1 Proteus penneri ATCC 35198

2
Entering edit mode
3.2 years ago

Hello guys,

This is just an update for this post and also in respect for those who contributed to this question. After some thought, I opened the EMBL file (output file from PGAP) using Artemis. In Artemis, we can select all CDS features, then write all amino acids from the selected features into a new file (that we can just type .fasta in the end). This creates a file with all sequences in amino acids and also in multifasta format. I tried to use this file in MEDpipe and the webtool accepted it, so I presume it's okay. I didn't test BioPython as jrj.healey previously suggested, so I can't say if it works or not in this particular situation.

0
Entering edit mode
0
Entering edit mode

Thanks for following up.

0
Entering edit mode

nice to hear you were able to take this hurdle ;-)

(and even more that I apparently pointed you in the right direction :-)

0
Entering edit mode
3.2 years ago
Joe 19k

You can convert to/from FASTA and to/from EMBL formats with the SeqIO module in Biopython:

http://biopython.org/wiki/SeqIO

Something like this should work (though I've not tested it - just written off the top of my head:

#/usr/bin/env python

from Bio import SeqIO
import sys

# Run me like so:
#   \$ python x2y.py infile.ext in_format outfile.ext out_format

infile = sys.argv[1]
intype = sys.argv[2]
outfile = sys.argv[3]
outtype = sys.argv[4]

SeqIO.convert(infile, intype, outfile, outtype)


So in your case you might do:

SeqIO.convert(infile, "embl", outfile, "fasta")

0
Entering edit mode

Not sure if that is going to help. if it is anything similar to the BioPERL one it will not parse out the CDS (or other) features from the EMBL file without specifically processing the file. The syntax you provide will just transform the sequence from embl to fasta format (not make a multifasta file with all the CDS sequences of proteins or such)

0
Entering edit mode

If there is 1 EMBL entry for each sequence of interest, this approach should work, though the OP might have to use SeqIO.parse and and a re-write, instead of just convert in that case.