Question: what is the best way to cut a lot of embl file at given position in a way to obtain a smaller embl file ?
0
gravatar for CrLs
4.9 years ago by
CrLs10
France
CrLs10 wrote:

Hello,

i have a lot of Embl files and i would like to get some sequence from it (for example 'from position : 19944 to 27500 in genome DC000456)). I know i can do it from NCBI but it will be file by file. There is any python/perl script who can do that automatically or some other site ?

 

Thanks in advance.

 

 

sequence • 2.0k views
ADD COMMENTlink modified 4.8 years ago • written 4.9 years ago by CrLs10

Is DC000456 a real EMBL accession ID for a genome? I don't find it using a web search.

ADD REPLYlink written 4.9 years ago by Neilfws48k
1
gravatar for Neilfws
4.9 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

There are libraries for both Python and Perl which include a subsequence method, so you can use them to write your own script.

For BioPython - Bio.SeqIO and Bio.SeqRecord; for BioPerl - search the beginners HOWTO for subseq.

ADD COMMENTlink written 4.9 years ago by Neilfws48k

Hey, thank you for you answer.

I know i can write it myself. I just wanted to know if there was a know tool already out. I found the .extract() tool on biopython. I'll try it this way.

Charles

ADD REPLYlink written 4.9 years ago by CrLs10

Generally if you search the web for solutions to these common sequence processing tasks, you'll find libraries for writing your own code rather than ready-made solutions. The closest thing to an existing tool might be biopieces, which I have not used (due to the many dependencies for installation), but which might go something iike this:

read_embl -i myfile.embl | extract_seq -b 19944 -e 27500
ADD REPLYlink written 4.9 years ago by Neilfws48k

Yeah, i'm aware of that. i'll write it my self. But i guess, those kind of script have been made a lot of time by scientist. It's just a waste of time overall.

Anyway, thank you for your answer (and time) !

ADD REPLYlink written 4.9 years ago by CrLs10
1
gravatar for bioslayer
4.8 years ago by
bioslayer50
New Zealand
bioslayer50 wrote:

The sfetch utility from biosquid can allow you to do that and specify the format you want to extract your sequences in, simply,

 

sfetch -d file.embl -f 19944 -t 27500 -F fasta .

 

Note the "."  at the end.

You can as well rename the seq you've extracted

sfetch -d CP001581.1.embl -f 3 -t 20 -F fasta -r "extracted_19944_27500" .
ADD COMMENTlink written 4.8 years ago by bioslayer50

Hello,

Thank you for answering. I just post a python script i did, i wasn't aware of biosquid. I'll think about it next time.

 

C.

ADD REPLYlink written 4.8 years ago by CrLs10
0
gravatar for CrLs
4.8 years ago by
CrLs10
France
CrLs10 wrote:

Hey hello,

Thx for you answer.

I wrote this litlle python script to do the job. Forgot to post it.

Fell free to use.

 

from Bio import Entrez, SeqIO
Entrez.email = "Youremail@email.com"
for i, j, start, end in [('AE009948','Nameoftheregion1', 80000, 90000),
                      ('AE009948','Nameoftheregion2', 100000, 110000)]:
    handle = Entrez.efetch(db="nucleotide", id=i, seq_start=start,
            seq_stop=end, rettype="gb")
    Fasta = open(j + '.fasta', 'a')
    for seq_record in SeqIO.parse(handle, "gb"):
        for feature in seq_record.features:
            if feature.type == "CDS":
                if 'translation' in feature.qualifiers:
                    CDS_seq = feature.qualifiers['translation'][0]
                    protid = feature.qualifiers['protein_id'][0]
                    Fasta.write(">" + protid + "_" + j + "\n" + str(CDS_seq) + "\n")
    Fasta.close()

 

This one, get a multifasta prot. not a fasta from the region.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by CrLs10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1576 users visited in the last hour