I am very new to the world of python (or any coding), but have installed biopython which I believe will help me do what I am trying to accomplish. I am working with RNAseq data for my project, and wanting to run my genelists through DAVID to look at enrichment. The output of the RNAseq pipeline thru bowtie/tophat/cuffdiff, gives me log fold increases for each gene, with the chromosome identifier of SMaxxxxx. That would be fine, but DAVID does not seem to be able to handle the fact that my organism has 3 replicons (1 chromosome and 2 megaplasmids), that are SMa, SMb and SMc. When I try to use these identifiers as the gene list, DAVID is only able to find 1/3 (ie. one replicon) and identify them as genes. I was playing around and it seems like it can identify genes from all replicons if I use the gene identifier such as lacI, manB, etc. In the genbank file the gene notation I want is labelled: gene="xxxX"
I want to do two things:
- Go through each replicons genbank file and pull out the gene identifier for each gene in the 4 letter format
- Feed in a list of 150 genes in the SMaxxxx format, and have it output the 4 letter gene identifier for each gene
So far I have been able to read in a genbank file, but not pull out the information I want.
#to read in list of genes I want to pull out with open('synthgenelistfordavid.txt') as f: content = f.readlines() wanted = [line.rstrip('\n') for line in open('synthgenelistfordavid.txt')] #then I've gotten this far, import file and set up a for loop from Bio import SeqIO genbank_file = "sma.gbk" for record in SeqIO.parse("sma.gbk", "genbank"):