Question: Filtering bacterial genes by annotations and finding common genes between genomes
0
gravatar for dch014
7 weeks ago by
dch0140
dch0140 wrote:

Dear Biostars community,

I am working on a codon usage bias analysis on bacterial genomes, and I would like to create a reference set of genes that are predicted to be highly expressed. I am therefore wondering if there is a way to programmatically filter annotated genes retrieved from GenBank? Specifically, I have several large txt files for each of my bacteria with annotated genes in the following format:

lcl|NC_002947.4_cds_WP_003250662.1_2506 [gene=infC] [locus_tag=PP_RS12860] [protein=translation initiation factor IF-3] [protein_id=WP_003250662.1] [location=2811534..2812067] [gbkey=CDS] ATGAGAAACGATAAGCGTGCTGTACCGAAGGCCCCGATCAACGAGAATATCTCGGCCCGCGAGGTTCGGT TAATTGGCGCAGACGGCGAGCAGGTTGGCATCGTCTCGATTGATGAAGCGCTGCGTATCGCTGATGAAGC GAAGCTGGACCTGGTAGAAATCTCTGCAGACGCGGTACCACCCGTCTGCAAGGTCATGGACTACGGTAAG CACCTCTTCGAGAAGAAGAAGCAGGCTAACGAAGCCAAGAAAAACCAGAAACAGATCCAGATCAAAGAAA TCAAGTTTCGTCCAGGGACGGAGGATGGGGATTACCAGGTAAAACTACGCAACCTGGTACGTTTCCTTAC CGATGGGGACAAGGCCAAGATCTCTCTGAGATTCCGTGGTCGTGAGATGGCCCACCAGGAGCTGGGCATG GAGCTGTTGAAGCGGGTCGAAGCCGACCTCGCCGAATACGGCACCGTTGAGCAGCATCCGAAGATGGAAG GACGCCAGCTTATGATGGTCATCGCCCCCAAGAAAAAGAAGTAA lcl|NC_002947.4_cds_WP_003250667.1_2507 [gene=rpmI] [locus_tag=PP_RS12865] [protein=50S ribosomal protein L35] [protein_id=WP_003250667.1] [location=2812128..2812322] [gbkey=CDS] ATGCCAAAAATGAAAACCAAGAGCGGTGCTGCGAAGCGCTTCCTGAAGACAGCTTCGGGCTTCAAGCACA AGCACGCTTTCAAGAGCCACATCCTGACCAAAATGTCGACCAAGCGTAAGCGTCAACTGCGCGGTGCCAG CTTGCTGCACCCGTCCGACGTCGCAAAAGTCGAGCGCATGCTGCGCGTTCGTTAA lcl|NC_002947.4_cds_WP_003250671.1_2508 [gene=rplT] [locus_tag=PP_RS12870] [protein=50S ribosomal protein L20] [protein_id=WP_003250671.1] [location=2812350..2812706] [gbkey=CDS] ATGGCTCGTGTTAAGCGCGGCGTTATCGCTCGTAAGCGTCACAAAAAAATTCTGAAACTGGCTAAAGGTT ACTACGGTGCACGTTCGCGCGTATTCCGTGTAGCCAAGCAAGCGGTCATCAAGGCAGGCCAATACGCCTA CCGCGACCGTCGCCAGAAGAAGCGTCAGTTCCGCGCACTGTGGATCGCTCGTATCAACGCCGGTGCCCGC ACCAACGGTCTGTCCTACAGCCGTCTGATTGCTGGCCTGAAAAAGGCTTCGATCGAAATCGACCGTAAGG TTCTGGCTGATCTGGCAGTGAACGAAAAAGCGGCGTTTGCTGCGATTGTCGAGAAAGCTAAAGCCGTTCT GGCTTAA

Is there any way I can filter/select/pull only the genes that are annotated as "ribosomal"/"chaperone"/etc.? I believe KEGG database has this "filtering by annotation" feature, but not all the genomes of my bacteria can be found in KEGG. The file format of my genes can also be of any other kind of format downloadable from GenBank (the format of the example above is "cds_from_genome.fa", I am using it for now because it is readable using coRdon package in R).

Also, after I have my reference sets of genes from each genome of my bacteria, I would like to compare them and find genes common across all my genomes, any tips for that as well?

Much thanks!

-D

ADD COMMENTlink modified 2 days ago by Biostar ♦♦ 20 • written 7 weeks ago by dch0140

If you're trying to extract these keywords from "cds_from_genome.fa", you can try:

grep -i -E "ribosomal|chaperone" cds_from_genome.fa

You only need to search the headers, so if you have other files that do not have the sequences, or have ">" before the headers so that you can only get the headers (using grep ">" or awk ), and then search for the keywords, that might be more efficient.

ADD REPLYlink modified 2 days ago • written 2 days ago by Fatima820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2261 users visited in the last hour