Hi,
I am fairly new to python and am working on writing a script that parses two input files (one in gbk format and the other in tsv format) and then outputs select data to a single csv formatted file.
"infile1"is a gbk formatted file that looks like the following:
LOCUS scaffold_31 38809 bp DNA UNK 01-JAN-1980
DEFINITION scaffold_31.
ACCESSION scaffold_31
VERSION scaffold_31
KEYWORDS .
SOURCE .
ORGANISM .
COMMENT ##antiSMASH-Data-START##
Version :: 6.1.1
Run date :: 2022-09-21 11:09:55
##antiSMASH-Data-END##
FEATURES Location/Qualifiers
protocluster 26198..38809
/aStool="rule-based-clusters"
/category="terpene"
/contig_edge="True"
/core_location="[36197:37079](-)"
/cutoff="20000"
/detection_rule="(Terpene_synth or Terpene_synth_C or
phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
/neighbourhood="10000"
/product="terpene"
/protocluster_number="1"
/tool="antismash"
So far, I have code that parses the file and is able to extract seq_id, category and product qualifiers from the protocluster feature:
for record in SeqIO.parse(infile1, 'genbank'): # parse through records in genebank input file
for feature in record.features: # parse through record features in input file to find protocluster feature
if feature.type == "protocluster":
category = 'Unknown'
product = 'Unknown'
if 'category' in feature.qualifiers: # pull category value from protocluster feature
category = feature.qualifiers['category'][0]
if 'product' in feature.qualifiers: # pull product value from protocluster feature
product = feature.qualifiers['product'][0]
# Extract nucl start and nucl end "ie. 26198..38809"
# calculate difference between these values to find nucl_length
line = f'{record.id},{category},{product},{nucl_start},{nucl_end},{nucl_length}\n'
Now in addition to the category, product and record.id in the protocluster feature, I also need to extract the sequence start/end which is the 26198 and 38809 values in the gbk file and then calculate the difference between these to find the sequence length. However, I am unsure how to extract this data as I did for the record.id, product, and category data.
Now the second file, infile2, is a tsv file where the tab separated columns have the following field names with data:
sequence_id detector detector_version detector_label bgc_candidate_id nucl_start nucl_end product_class nucl_length num_proteins num_domains num_bio_domains deepbgc_score product_activity antibacterial cytotoxic inhibitor antifungal product_class Alkaloid NRP Other Polyketide RiPP Saccharide Terpene protein_ids bio_pfam_ids pfam_ids
Here is a photo of the file with data to make it more clear what exactly it looks like:
I would like to extract the data present in the following columns of the file:
sequence_id
nucl_start
nucl_end
length
proteins
activity
class
score
The code I have for that looks like this. I am using the csv package and am reading in a second infile. I want to open the same outfile and add the data to it (which is what I am mainly unsure of at the moment)
with open(infile2, newline=''), open(outfile, 'w', newline=''):
tsv_reader = csv.DictReader(infile2, delimiter='\t')
csv_writer = csv.DictWriter(outfile, delimter=',', fieldnames= ['sequence_id','nucl_start','nucl_end','nucl_length','num_proteins','product_activity','product_class','deepbgc_score'],extrasaction='ignore')
csv_writer.writeheader()
So the final output file should have the following columns with data in comma separated format. From the the gbk file, it should have the seq_id, category, product, nucl_start, nucl_end, and the calculated nucl_length. From the tsv file, it should also have the seq_id information, nucl_start, nucl_end, nucl_length in addition to these columns only present in the tsv file: num_proteins, product_activity, product_class, deepbgc_score. Example code for implementing this would be greatly appreciated. Thanks!
Could you provide a couple of more lines in infile2.tsv (e.g., first five lines)?
Hi Andrzej Zielezinski here is what the infile2 looks like with data present in the columns
Would a link to the actual file be more clear? Its difficult to get the formatting jus right using code chunks
It would be great if you could share two input files (gbk and tsv). They don't have to be complete. I would then return you the working code. My last question:
sequence_id
maps two files (gbk and tsv). However, there are multiple lines in tsv for a given sequence_id (e.g., scaffold_1 is present three times). How do I know which line to choose (i.e., to get nucl_start, nucl_end, etc.)?So for the tsv file, every line with a scaffold ID should be extracted (even the duplicated ones). So every “version” of scaffold 1, 2 etc. should be in the output file with the corresponding other metadata (ie. nucl_start, nucl_end, etc.). Let me know if that makes sense. Also, what would be the best way to send you those example files? Is there a way to attach them to a post? Or perhaps I should send the link? I'm rather new to this site. Thanks!
Here are the links to the files. Let me know if they work
Link to tsv file: https://drive.google.com/file/d/1K7A1P0Xr_lMdTvjO5ojJ28SfXtXfr5st/view?usp=sharing
Link to gbk file: https://drive.google.com/file/d/1IX5czKq4bCQldIf271cqTc-pCJF_aLUE/view?usp=sharing
Also the sequence record in the gbk is incomplete. Please provide the full sequence record from the beginning to the end (//) or more than one record.
Thanks, I downloaded the files. However, there are no sequence ids common between the files (could you add data to the tsv file for scaffold31?).
Great. So that is actually purposeful that there are no sequence ids common between the files. The idea is to add the sequence id's from both files into the same merged output file. An example of what the output should look like is this:
https://docs.google.com/spreadsheets/d/1vtO4Oq-8yJw91QutKJRjBcrGe7uwcpMJQsW37Cp2cjs/edit?usp=sharing
The columns highlighted blue represent the overlap of information (ie. types of information common among both files), the highlighted green represents information present only in the genebank file and the highlighted orange represents information present in only the tsv file. Let me know if this is clear.
Ps. here is the updated link for the complete gbk file. Hopefully this is helpful.
https://drive.google.com/file/d/1jsKDFny9-9uAsj8T1Q_edUjkCh8owCjX/view?usp=sharing
Thanks - now I understood. Please see my answer.