Parsing a genebank file and tsv file - extracting and outputting specific data to a single csv file
2
0
Entering edit mode
8 weeks ago
Zach G • 0

Hi,

I am fairly new to python and am working on writing a script that parses two input files (one in gbk format and the other in tsv format) and then outputs select data to a single csv formatted file.

"infile1"is a gbk formatted file that looks like the following:

LOCUS       scaffold_31            38809 bp    DNA              UNK 01-JAN-1980
DEFINITION  scaffold_31.
ACCESSION   scaffold_31
VERSION     scaffold_31
KEYWORDS    .
SOURCE      .
ORGANISM  .
COMMENT     ##antiSMASH-Data-START##
Version      :: 6.1.1
Run date     :: 2022-09-21 11:09:55
##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
protocluster    26198..38809
/aStool="rule-based-clusters"
/category="terpene"
/contig_edge="True"
/core_location="[36197:37079](-)"
/cutoff="20000"
/detection_rule="(Terpene_synth or Terpene_synth_C or
phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
/neighbourhood="10000"
/product="terpene"
/protocluster_number="1"
/tool="antismash"


So far, I have code that parses the file and is able to extract seq_id, category and product qualifiers from the protocluster feature:

    for record in SeqIO.parse(infile1, 'genbank'): # parse through records in genebank input file
for feature in record.features: # parse through record features in input file to find protocluster feature
if feature.type == "protocluster":
category = 'Unknown'
product = 'Unknown'
if 'category' in feature.qualifiers: # pull category value from protocluster feature
category = feature.qualifiers['category'][0]
if 'product' in feature.qualifiers: # pull product value from protocluster feature
product = feature.qualifiers['product'][0]
# Extract nucl start and nucl end "ie. 26198..38809"
# calculate difference between these values to find nucl_length
line = f'{record.id},{category},{product},{nucl_start},{nucl_end},{nucl_length}\n'


Now in addition to the category, product and record.id in the protocluster feature, I also need to extract the sequence start/end which is the 26198 and 38809 values in the gbk file and then calculate the difference between these to find the sequence length. However, I am unsure how to extract this data as I did for the record.id, product, and category data.

Now the second file, infile2, is a tsv file where the tab separated columns have the following field names with data:

sequence_id           detector  detector_version    detector_label  bgc_candidate_id    nucl_start  nucl_end    product_class   nucl_length num_proteins    num_domains num_bio_domains deepbgc_score   product_activity    antibacterial   cytotoxic   inhibitor   antifungal  product_class   Alkaloid    NRP Other   Polyketide  RiPP    Saccharide  Terpene protein_ids bio_pfam_ids    pfam_ids


Here is a photo of the file with data to make it more clear what exactly it looks like:

I would like to extract the data present in the following columns of the file:

sequence_id
nucl_start
nucl_end
length
proteins
activity
class
score


The code I have for that looks like this. I am using the csv package and am reading in a second infile. I want to open the same outfile and add the data to it (which is what I am mainly unsure of at the moment)

    with open(infile2, newline=''), open(outfile, 'w', newline=''):
tsv_reader = csv.DictReader(infile2, delimiter='\t')
csv_writer = csv.DictWriter(outfile, delimter=',', fieldnames= ['sequence_id','nucl_start','nucl_end','nucl_length','num_proteins','product_activity','product_class','deepbgc_score'],extrasaction='ignore')
csv_writer.writeheader()


So the final output file should have the following columns with data in comma separated format. From the the gbk file, it should have the seq_id, category, product, nucl_start, nucl_end, and the calculated nucl_length. From the tsv file, it should also have the seq_id information, nucl_start, nucl_end, nucl_length in addition to these columns only present in the tsv file: num_proteins, product_activity, product_class, deepbgc_score. Example code for implementing this would be greatly appreciated. Thanks!

parsing tsv csv biopython genbank • 1.1k views
ADD COMMENT
1
Entering edit mode

Could you provide a couple of more lines in infile2.tsv (e.g., first five lines)?

ADD REPLY
0
Entering edit mode

Hi Andrzej Zielezinski here is what the infile2 looks like with data present in the columns

sequence_id  detector version   label   bgc_candidate_id    nucl_start nucl_end class length proteins domains bio_domains score activity antibacterial  cytotoxic inhibitor antifungal class    Alkaloid    NRP Other   Polyketide  RiPP    Saccharide  Terpene protein_ids bio_pfam_ids    pfam_ids
scaffold_1  deepbgc 0.1.0   deepbgc scaffold_1_25226-26807.1  25226  26807      1581    1   1   0   0.50636 antibacterial   0.87    0.06    0.17    0.26        0   0.09    0.07    0.33    0.21    0.03    0.32    scaffold_1_scaffold_1_20        PF16257
scaffold_1  deepbgc 0.1.0   deepbgc scaffold_1_28614-29856.1  28614 29856       1242    1   1   1   0.74074 antibacterial   0.63    0.07    0.4 0.23        0   0.46    0.09    0.23    0.06    0.03    0.18    scaffold_1_scaffold_1_22    PF00668 PF00668
scaffold_1  deepbgc 0.1.0   deepbgc scaffold_1_30349-30628.1  30349 30628   Polyketide  279 1   1   0.54142 antibacterial   0.8 0.04    0.16    0.46         Polyketide 0   0   0.01    0.92    0.01    0.01    0.05    scaffold_1_scaffold_1_24        PF07690
scaffold_2  deepbgc 0.1.0   deepbgc scaffold_2_11916-12357.1. 11916 12357       441 1   1   0   0.5283  antibacterial   0.87    0.06    0.17    0.26        0   0.09    0.07    0.32    0.24    0.01    0.3 scaffold_2_scaffold_2_13        PF04397
scaffold_2  deepbgc 0.1.0   deepbgc scaffold_2_12966-13482.1  12966 13482       516 1   1   1   0.56301 antibacterial   0.86    0.07    0.17    0.28        0   0.09    0.11    0.31    0.21    0.01    0.3 scaffold_2_scaffold_2_15    PF00857 PF00857

ADD REPLY
0
Entering edit mode

Would a link to the actual file be more clear? Its difficult to get the formatting jus right using code chunks

ADD REPLY
0
Entering edit mode

It would be great if you could share two input files (gbk and tsv). They don't have to be complete. I would then return you the working code. My last question: sequence_id maps two files (gbk and tsv). However, there are multiple lines in tsv for a given sequence_id (e.g., scaffold_1 is present three times). How do I know which line to choose (i.e., to get nucl_start, nucl_end, etc.)?

ADD REPLY
0
Entering edit mode

So for the tsv file, every line with a scaffold ID should be extracted (even the duplicated ones). So every “version” of scaffold 1, 2 etc. should be in the output file with the corresponding other metadata (ie. nucl_start, nucl_end, etc.). Let me know if that makes sense. Also, what would be the best way to send you those example files? Is there a way to attach them to a post? Or perhaps I should send the link? I'm rather new to this site. Thanks!

ADD REPLY
0
Entering edit mode

Here are the links to the files. Let me know if they work

ADD REPLY
1
Entering edit mode

Also the sequence record in the gbk is incomplete. Please provide the full sequence record from the beginning to the end (//) or more than one record.

ADD REPLY
0
Entering edit mode

Thanks, I downloaded the files. However, there are no sequence ids common between the files (could you add data to the tsv file for scaffold31?).

ADD REPLY
0
Entering edit mode

Great. So that is actually purposeful that there are no sequence ids common between the files. The idea is to add the sequence id's from both files into the same merged output file. An example of what the output should look like is this:

https://docs.google.com/spreadsheets/d/1vtO4Oq-8yJw91QutKJRjBcrGe7uwcpMJQsW37Cp2cjs/edit?usp=sharing

The columns highlighted blue represent the overlap of information (ie. types of information common among both files), the highlighted green represents information present only in the genebank file and the highlighted orange represents information present in only the tsv file. Let me know if this is clear.

Ps. here is the updated link for the complete gbk file. Hopefully this is helpful.

https://drive.google.com/file/d/1jsKDFny9-9uAsj8T1Q_edUjkCh8owCjX/view?usp=sharing

ADD REPLY
0
Entering edit mode

Thanks - now I understood. Please see my answer.

ADD REPLY
1
Entering edit mode
8 weeks ago

to get the sequence for a feature use the extract method of the SeqFeature as described in the API:

to add more columns you need to transform the iterator to have more elements per row

def add(row):
other = [ 1, 2, 3 ]
return row + other

rows = map(add, rows)


then use the elems iterator to write the rows:

csvwriter.writerows(rows)

ADD COMMENT
1
Entering edit mode
8 weeks ago

The following code produces the output you provided. Hope you find this code helpful.

from Bio import SeqIO

GBK_FILE = 'inputfile.gbk'
TSV_FILE = 'inputfile2.tsv'
OUTPUT = 'output.csv'

header = [
'sequence_id',
'Category',
'Product',
'start',
'end',
'nucl_length',
'num_proteins',
'product_activity',
'product_class',
'deepbgc_score',
]

oh = open(OUTPUT, 'w')
oh.write(",".join(header))
oh.write('\n')

# Parse GenBank
for record in SeqIO.parse(GBK_FILE, 'genbank'):
for feature in record.features:
if feature.type == "protocluster":
category = 'Unknown'
product = 'Unknown'
if 'category' in feature.qualifiers:
category = feature.qualifiers['category'][0]
if 'product' in feature.qualifiers:
product = feature.qualifiers['product'][0]
start = feature.location.start + 1
end = feature.location.end
length = end - start + 1
line = ",".join([
record.id,
category,
product,
f'{start}',
f'{end}',
f'{length}',
'',
'',
'',
''
])
oh.write(f'{line}\n')

# Parse TSV
fh = open(TSV_FILE)
fh.readline() # Skip header
for line in fh:
cols = line.rstrip().split('\t')
sequence_id = cols[0]
start = int(cols[5])
end = int(cols[6])
length = int(cols[8])
num_proteins = int(cols[9])
product_activity = cols[13]
product_class = cols[18]
deepbgc_score = cols[12]
line = ",".join([
sequence_id,
'',
'',
f'{start}',
f'{end}',
f'{length}',
f'{num_proteins}',
product_activity,
product_class,
deepbgc_score
])
oh.write(f'{line}\n')

oh.close()

ADD COMMENT
0
Entering edit mode

Thank you very much. This is helpful. Now I tried implementing the code and it did give me a NameError: name 'category' not defined referring to the category feature qualifier in the genebank parser. However, I have defined and implemented it as you have above:

# Parse genebank formatted file
for record in SeqIO.parse(GenBank_Input, 'genbank'):
for feature in record.features:
if feature.type == "protocluster":
category = 'Unknown'
product = 'Unknown'
if 'category' in feature.qualifiers:
category = feature.qualifiers['category'][0]
if 'product' in feature.qualifiers:
product = feature.qualifiers['product'][0]
start = feature.location.start + 1
end = feature.location.end
length = end - start + 1
line = ','.join([
record.id,
category,
product,
f'{start}',
f'{end}',
f'{end}',
'',
'',
'',
''
])
oh.write(f'{line}\n')


Any thoughts on why this might be?

Edit: I now see the issue is probably with my indentations

ADD REPLY
0
Entering edit mode

Yes, the issue is with your identations. In particular, the if statements are incorrectly indented.

ADD REPLY

Login before adding your answer.

Traffic: 1205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6