Question: Eliminate duplicates skipping same key and values
0
gravatar for felipelira3
4 days ago by
France/Angers/IRHS
felipelira30 wrote:

Files to test can be downloaded from https://github.com/felipelira/files_to_test.git

I want to retrieve the information from several files in one folder and create a table with the information as a dictionary to create a table after.

#!/usr/bin/env python

import os
import sys
from Bio import SeqIO
from Bio import GenBank

dict1 = {}

input_file = open(sys.argv[1], "r")

for seq_record in SeqIO.parse(input_file, "genbank"):
    for seq_feature in seq_record.features:
        if seq_feature.type=="source":
            try:
                source = seq_feature.qualifiers['organism'][0]
            except (KeyError, IndexError):
                source = 'n/a'
            try: 
                strain = seq_feature.qualifiers['strain'][0]
            except (KeyError, IndexEror):
                strain = 'n/a'
            try:
                country = seq_feature.qualifiers['country'][0]
            except (KeyError, IndexError):
                country = 'n/a'
            try:
                host = seq_feature.qualifiers['host'][0]
            except (KeyError, IndexError):
                host = 'n/a'
            try:
                plasmid = seq_feature.qualifiers['plasmid'][0]
            except (KeyError, IndexError):
                plasmid = 'n/a'
            try:
                pathovar = seq_feature.qualifiers['pathovar'][0]
            except (KeyError, IndexError):
                pathovar = 'n/a'


# Here I have the concatenation of values that I need for the table

            value = strain , pathovar , host , plasmid

# Here is where I want to feed the dictionary but refusing if the key and value is already present.
        if source not in dict1.keys() and value not in dict1.values():
            dict1[source] = value
        else:
            if source in dict1.keys() and value != dict1[source]:
            #if source in dict1.keys() and value not in dict1.values():
                dict1[source] = value

For the file Pseudomonas_syringae_pv._actinidiae_ICMP_9853.gbk , that contains 3 sequences, I have this:

{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'n/a')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_A')}
{'Pseudomonas syringae pv. actinidiae ICMP 9853': ('ICMP 9853', 'actinidiae', 'Actinidia', 'p9853_B')}

For the other (Pseudomonas_syringae_str.ICMP_3690_scaffold1.gbk), because it is a scaffold, I have this:

{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}
{'Pseudomonas syringae': ('ICMP 3690', 'n/a', 'n/a', 'n/a')}

The expected result is to obtains just only one key and value for the second sequence and three (or more) for the genomes with sequences such as plasmids.

genbank python • 90 views
ADD COMMENTlink modified 4 days ago • written 4 days ago by felipelira30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1014 users visited in the last hour