I am a bioinformatics intern and I have a problem.
I have 2 files,:
- f1: one containing the list of viruses infecting several kinds of bacteria
- f2: another containing a list of orthological groups where each line represents an orthological group with its proteins. We find the bacteria locus present in the first file. To make it simple, I have to distribute the conservation of orthologist groups within the bacterium group (E. coli,...) by determining the list of orthologist groups of the bacterium group and then their conservation in the group for each of the species of bacteria I have. The idea is to then make a bar chart with these values (under R).
This is how I wanted to proceed:
I first made a dictionary to recover the locus of viruses infecting each of the bacteria: (from f1) {'E.coli': ['JDHTG_45','ABTD_65','JUIDL_345',...] "Lysteria": ['JHSY_65','GTSRF_34',...]} (these are not the real names of the locus)
Then I would like to locate the locus present for a species in the dictionary I made for each line of the f2 file and count for each line which species are present.
I don't know if this is the right way to proceed, I'd like to know if there is anything simpler or not.
Thank you in advance for taking the time to read the post