Question: Slicing Genbank File by 'gene_id' range
0
gravatar for b.d237
11 months ago by
b.d237160
National University of Ireland, Galway
b.d237160 wrote:

Hi guys,

I have searched quite extensively for this, to no avail.

I have no problem extracting Gene_ids from a genbank file when given a list of gene_ids to extract:

for seq_record in SeqIO.parse(gbk_file, "genbank"):
    for feat in seq_record.features:
        if feat.type == "CDS":
            for gene in genes:
                if gene in feat.qualifiers['gene_id']:
                    seq = feat.qualifiers['translation']
                    print(gene)
                    print(seq)

Where "gene in genes" is a variable storing each desired Gene_id in a list.

However

I would like to provide a start gene, and an end gene as my range so I can extract a Gene Cluster in fasta format. Is this possible?

I will be looping through the start and end gene_id's in a for loop as follows:

for x, y, z in zip(start_list, end_list, cluster_list):

Once the gene_id range is captured (along with its Amino Acid sequence), I will output them to a unique file using:

with open("%s" % (z), "w") as outfile:
outfile.write(">%s\n%s" % (gene, sequence))

Any help in capturing a range of gene_ids would be greatly appreciated.

genbank slicing python • 512 views
ADD COMMENTlink modified 11 months ago by Joe14k • written 11 months ago by b.d237160

Would it be possible to use start and end coordinates range instead of gene IDs to extract genes within a particular slice?

ADD REPLYlink modified 11 months ago • written 11 months ago by Sej Modha4.4k

yep thats my plan, use the gene_id to then capture the corresponding location :)

Check out my answer to jrj.healey

ADD REPLYlink modified 11 months ago • written 11 months ago by b.d237160
4
gravatar for Joe
11 months ago by
Joe14k
United Kingdom
Joe14k wrote:

It shouldn't be too difficult to do. BioPython allows you to slice SeqRecords using standard python notation.

The important bits are that sequence features have an accessible location you can use, and then chopping out the bit you need is as simple as subrecord = record[start:end]. Biopython does extra magic, by keeping the subrecord in the same format as the original record, so when you come to write out the data, you still have all the features/translations etc. If you wanted to write out the genes only, you could just loop over the features in the subrecord and write them as fastas.

This essentially borrows the same idea that i used for a script some time ago: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

(except that that script is designed to use either known indices, or a complementary FASTA sequence to figure out what sequence span to subset)

Edit: Updated code, now handles multigenbanks, and the reverse strand logic should be correct.

Edit II: Code updated again. Realised the strand switching logic actually wasn't needed as biopython stores everything relative to the forward strand anyway. Added argparse, and also fixed an error in how multigenbanks are processed (it was slicing all records not just the record under iteration at that moment).

Invoke the code as per the usage comments:

$ python slice_genes.py -i infile.gbk -r GeneA:GeneB

I was using locus tags to test this, you will want to change the 4 occurrences of 'locus_tag' back to 'gene_id' (or whatever other qualifier you like

I'm fairly sure this is behaving correctly now. The script will print to screen though, so I'll leave writing a file to you. I've used sys.stderr specifically so you can pipe/redirect STDOUT if you'd prefer to do that instead of writing a file

ADD COMMENTlink modified 11 months ago • written 11 months ago by Joe14k
import pandas as pd 
from Bio import SeqIO
from Bio import SeqFeature

df = pd.read_csv('27SEPTtest.txt', sep='\t', names=['Cluster', 'Start', 'Stop'])
start_gene = df["Start"].tolist()
end_gene = df["Stop"].tolist()
cluster_name = df["Cluster"].tolist()

gbk = "antiSMASH.gbk"
record = next(SeqIO.parse("antiSMASH.gbk", "genbank"))

for x, y, z in zip(start_gene, end_gene, cluster_name):
    for seq_record in SeqIO.parse(gbk, "genbank"):
        for feat in seq_record.features:
            if feat.type == "CDS":
                if x in feat.qualifiers["gene_id"]:

                    cluster_start = feat.location.start

                else:

                    if y in feat.qualifiers["gene_id"]:

                        cluster_end = feat.location.end

                        subrecord = record[cluster_start:cluster_end]

                        filename = "%s.gbk" % z

                        SeqIO.write(subrecord, filename, "gb")

Here is my attempt. I'd place myself at "just above noob level" and not used to using your style of for loops -- so I used this way instead.

I'm at a loss because the "cluster_start" and "cluster_end" strings correspond to the correct location in the genbank file. However when I call "record[cluster_start:cluster_end]" and write it to a file, it generates a genbank file where the captured genes are about 100 genes downstream of my actual target.

I feel like I am very close to solving this >:(

Output can be genbank or Fasta (protein seq) because its going to be used for multigeneblast.

Perhaps useful to note, I am parsing a Genbank file created from antiSMASH. Yes I know they have cluster'XX'.gbk files that can be downloaded, but I am trying to extract different Clusters (predicted by SMIPS/CASSIS).

ADD REPLYlink modified 11 months ago • written 11 months ago by b.d237160
1

Yep I’m familiar with both tools. Give me a little more time and I’ll make something properly functional :)

Is your input file a multi-genbank?

ADD REPLYlink modified 11 months ago • written 11 months ago by Joe14k

I believe it is a multigenbank file. It has each Contig in it seperated by //.

ADD REPLYlink written 11 months ago by b.d237160
1

Ok, I thought that might be the case, that does complicate things slightly but should still be doable.

(Alternatively you could merge the genbanks)

ADD REPLYlink written 11 months ago by Joe14k

Sorry I got confused by the terminology. I am parsing the "final.gbk" file generated by antismash. One file, that I'm guessing is merged together by antismash during analysis.

ADD REPLYlink written 11 months ago by b.d237160
1

I've updated my post with some properly functional code. I still didn't test it very thoroughly, but the slicing seems to work for both strands, and it handled a multigenbank of 2 sequences.

You might have to tweak it to read from your dataframe and/or write a file for each record (I haven't done this because you will have to append an identifier or iterator to the file name (or allow appending) else the loop will cause your file to be overwritten for each input sequence).

ADD REPLYlink written 11 months ago by Joe14k
1

Also just be aware, if you copy and paste the code, that there is a known formatting bug on the forum that stops some brackets rendering properly.

The line:

sys.stderr.write('Operating on {}.\n'.formatrecord.id))

Should have a ( between format and record.id.

ADD REPLYlink written 11 months ago by Joe14k
1

Thanks a million for your time. I'll give it a whirl when I get home. Get off the laptop and enjoy a pint 🤙🏻

ADD REPLYlink written 11 months ago by b.d237160
1

Comment content no longer relevant, main post updated.

ADD REPLYlink modified 11 months ago • written 11 months ago by Joe14k

FYI, I've updated the code again as I realised the strand logic actually isnt necessary, and there actually was an error in how multigenbanks were processed still.

I think the code is now fully correct.

ADD REPLYlink written 11 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1999 users visited in the last hour