Question: How To Study The Characteristics Of Neighboring Genes (Of A Query) Of A Bacterial Species?
0
gravatar for spandankeshav
5.6 years ago by
Finland
spandankeshav60 wrote:

Hi, I am trying to analyze the function of certain genes along with the predefined characteristics (if any), which locates nearby of the query gene (neighboring genes may be 5 genes forward and backwards necessarily from the query) of bacteria species (mainly streptomyces species) in a whole genome. Is there any browser/software to perform this task? I have noticed NCBI genomic browser but didn't find any relevant information regarding what I am looking for as most of the genes allocated are not reviewed (so lack of information)....Any suggestion/views would be great! Thanks! -Spandan

genes • 1.8k views
ADD COMMENTlink modified 5.6 years ago by Alex Reynolds28k • written 5.6 years ago by spandankeshav60

Is your question how to efficiently get a program to find the 5 genes in front/behind each gene in the strep. genome, how to assess the function of those genes, or both?

ADD REPLYlink written 5.6 years ago by Devon Ryan90k

Actually, Both!

ADD REPLYlink written 5.6 years ago by spandankeshav60

The first part is "relatively easy" with GenomicRanges, though I've never tried to use that on a circular genome (I think there's support for that). Of course, you'd need an annotation (GTF or GFF, normally) for the genome, which I presume exists.

For the functional annotation part, I guess it's a question of whether there are already gene ontology mappings for strep. and if that's sufficient for your needs (I obviously don't work on prokaryotes, so I don't know what else might be out there for them). If you have something else in mind, then you might update your question to mention that.

If that turns out to be exactly what you need (well, it gets you the information, what you do with it from there is up to you), then first start playing around with GenomicRanges in R to see if you can get it working properly. You can then update your question (or close it and post a new one if that makes more sense) with more precise questions about the various steps that you're trying to perform.

ADD REPLYlink written 5.6 years ago by Devon Ryan90k

Have you tried: http://microbes.ucsc.edu

ADD REPLYlink written 5.6 years ago by microbeatic80
2
gravatar for Alex Reynolds
5.6 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

I am trying to analyze the function of certain genes along with the predefined characteristics (if any), which locates nearby of the query gene (neighboring genes may be 5 genes forward and backwards necessarily from the query) of bacteria species (mainly streptomyces species) in a whole genome.

If you can convert your list of query genes and all genes into two separate UCSC-formatted BED files, you can use BEDOPS to apply set operations on your query genes, to answer this question. BEDOPS is agnostic about genomes — it doesn't care if you are working with virus, bacteria or eukaryotic genomes.

The BEDOPS suite includes the closest-features tool, which finds the nearest feature (in your case, the nearest "gene" from a set of "genes") from a reference input (individual "query gene"). Overlaps are allowed. Both inputs are UCSC BED-formatted and lexicographically sorted.

As well as file input, the closest-features command accepts UNIX standard input. In other words, you can keep passing results off to the next search. I'll explain below how this makes it fast and relatively easy to find N closest genes.

First, the command:

$ closest-features QueryGenes.bed Genes.bed

will return the closest two genes in Genes.bed for each element in QueryGenes.bed. The format of this output is, generally (see --help for options):

[ query element #1 ]  |  [ upstream element ]  |  [ downstream element ]
[ query element #2 ]  |  [ upstream element ]  |  [ downstream element ]
...
[ query element #n ]  |  [ upstream element ]  |  [ downstream element ]

Note the use of the | delimiter to split the elements. We can make use of this when passing results to standard UNIX utilities.

The following command yields the first closest gene downstream of the region element, using awk to split the result by the | delimiter and print out the third item (downstream element):

$ closest-features QueryGenes.bed Genes.bed \
    | awk -F'|' '{print $3}' -

For sake of demonstration, I'm sticking to downstream items. Note that we can complicate things here if we want an upstream element, by either printing the second item, or using the --dist option with closest-features. This option returns distance in bases, along with the nearest "hits" up- and downstream. By passing those distances to the awk statement, a simple conditional on the distance value could print either the second item (upstream hit) or third item (downstream hit).

Note that the output from this command — the first nearest gene — is in BED format, because it is an element in Genes.bed! This is really useful, because this means we can pass this back into closest-features as standard input, and it will try to find the nearest hit to that result.

Because closest-features takes in reference regions in standard input, these searches can be chained together on the command line. This BEDOPS utility follows the GNU convention of using the hyphen to denote standard input.

In other words, to search for the second closest gene upstream, we pass the results of the first search as a reference region for closest-features to search, using a standard UNIX pipe:

$ closest-features QueryGenes.bed Genes.bed \
    | awk -F'|' '{print $3}' - \
    | closest-features - Genes.bed \
    | awk -F'|' '{print $3}' -

And we find the third closest feature, by passing that result to another search:

$ closest-features QueryGenes.bed Genes.bed \
    | awk -F'|' '{print $3}' - \
    | closest-features - Genes.bed \
    | awk -F'|' '{print $3}' -
    | closest-features - Genes.bed \
    | awk -F'|' '{print $3}' -

And so on and so on, piping N-1 times for N closest features (in this case, features that are downstream of the input region element).

The output of each Nth iteration of this command is a BED-formatted genomic coordinate that represents the Nth-closest element ("gene") to the query-gene-of-interest.

Let's use these chained commands to build files containing the first through fifth closest features (from Genes.bed) to elements in QueryGenes.bed:

$ closest-features QueryGenes.bed Genes.bed | awk -F'|' {print $3} - > 1.bed
$ closest-features QueryGenes.bed Genes.bed | awk -F'|' {print $3} - | closest-features - Genes.bed | awk -F'|' {print $3} - > 2.bed
$ closest-features ... > 3.bed
$ closest-features ... > 4.bed
$ closest-features ... > 5.bed

While this looks like a lot of typing, you would write a script to construct and execute these chained commands, in order to save time and avoid typos.

Once you have the individual results, it's a simple matter of concatenating the region element with each result. We can use a simple UNIX paste command to do this, e.g.:

$ paste -d'|' QueryGenes.bed 1.bed 2.bed 3.bed 4.bed 5.bed > Results.bed

Each line of Results.bed should look something like:

[ query-gene-1 ]  |  [ first hit ]  |  [ second hit ]  | ... |  [ fifth hit ]
[ query-gene-2 ]  |  [ first hit ]  |  [ second hit ]  | ... |  [ fifth hit ]
...
[ query-gene-n ]  |  [ first hit ]  |  [ second hit ]  | ... |  [ fifth hit ]

About sorting

Note that the inputs to closest-features should be lexicographically-sorted. BEDOPS provides a tool called sort-bed which does this. You only need to sort once, if unsorted:

$ sort-bed QueryGenes.bed > QueryGenes.sorted.bed
$ sort-bed Genes.bed > Genes.sorted.bed

Then use BEDOPS tools, as usual:

$ closest-features QueryGenes.sorted.bed Genes.sorted.bed
...
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1077 users visited in the last hour