Question

Gene sequences for homologs across bacterial species

0

Entering edit mode

2.2 years ago

Julian ▴ 20

Here's a straightforward question that comes up a lot in my lab:

Is [DNA Sequence Feature X] in [Gene A] widely conserved across bacteria?

To start to answer that, I just need the DNA sequences of [Gene A] across bacteria. How do I do that?

One way: I'm about to just download a few thousand bacterial genomes from NCBI Assembly, build a custom BLAST database from those, search for the most similar sequence to [Gene A] in each genome, and then use those sequences. I think that'll take a little while. Is that the best way to go about this problem?

It seems like NCBI has some annotations for bacterial homologs--even though there's no Homologene. For example, I can search for a gene in NCBI Gene--say metG as an example, and I get this table: enter image description here

I can download the table, but the "Location" is left out in many entries (like the last row in the image above). So, with the table, I can't automatically grab homologous DNA sequences. (Same with doing the same query from the command line with Entrez). Is there a "Download all DNA sequences from this table" option in NCBI Gene that I'm just missing?

Do you have any suggestions? This seems like a pretty straightforward problem so I think I'm probably missing something simple. Thanks!

blast homology alignment conservation bacteria • 1.4k views

ADD COMMENT • link 2.2 years ago by Julian ▴ 20

1

Entering edit mode

2.2 years ago

Mensur Dlakic ★ 27k

There are many ways to solve this problem, and what you are planning to do is one of them.

You may get a faster solution if you protein belongs to a well-characterized protein family that is well represented in protein database. Let's look at PF09334 in Pfam. If you go there and click on "Alignments" link on the left, you will see that a seed group of proteins has 23 members, while a total number of members in UniProt is over 100 thousand. Those sequences can be downloaded in different formats (including FASTA) and from that point on it is just a matter of selecting only bacterial sequences. This approach will definitely be faster than what you are trying to do, and it may even be more comprehensive. Still, you would need to do some post-processing if you wish to use only bacterial members.

Yet another way based on Pfam is to download the corresponding hidden Markov model for this family, and then search your bacterial database with it. Likely to be more sensitive than regular BLASTp.

ADD COMMENT • link 2.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

On second reading I realize that you are talking about DNA sequences, while my answer is about proteins. Still, finding a related protein also means that there is a related DNA sequence, though you may have to download them separately.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 27k

score 3 · Accepted Answer · 2022-03-02

I believe the following EntrezDirect query should work for RefSeq genomes (truncated for space with only headers shown) :

$ esearch -db gene -query "metG [GENE] AND bacteria [orgn]" | efetch -format docsum |
  xtract -pattern GenomicInfoType -element ChrAccVer ChrStart ChrStop |
  xargs -n 3 sh -c 'efetch -db nuccore -format fasta -id "$0" -chr_start "$1" -chr_stop "$2"' > metG.fa


>NC_000913.3:2194300-2196333 Escherichia coli str. K-12 substr. MG1655, complete genome
>NC_015567.1:c1659139-1657112 Serratia plymuthica AS9, complete sequence
>NC_004545.1:113619-115256 Buchnera aphidicola str. Bp (Baizongia pistaciae), complete sequence
>NC_016842.1:c868089-865654 Treponema pallidum subsp. pertenue str. SamoaD, complete sequence
>NC_004556.1:1841691-1843763 Xylella fastidiosa Temecula1, complete sequence
>NC_010729.1:299637-301679 Porphyromonas gingivalis ATCC 33277, complete sequence
>NC_002663.1:354000-356048 Pasteurella multocida subsp. multocida str. Pm70, complete sequence
>NC_014541.1:c2075406-2073349 Ferrimonas balearica DSM 9799, complete sequence
>NC_005966.1:c754642-752585 Acinetobacter baylyi ADP1, complete sequence

score 2 · Accepted Answer · 2022-03-01

2

Entering edit mode

2.2 years ago

GenoMax 142k

NCBI datasets package would likely be useful to download these gene sequences. Take a look at this example.

You can get the gene ID's using EntrezDirect. I am using an example of metG. You will then upload this file to the tool above.

$ esearch -db gene -query "metG [GENE] AND bacteria [orgn]" | efetch -format docsum | xtract -pattern DocumentSummary -element Id > metG_id

70922259
70911831
70916534
70907463
946643
29739679
57666352
56470647
66887781
66330936

ADD COMMENT • link 2.2 years ago by GenoMax 142k

0

Entering edit mode

By the way, the datasets package seems to work great on some IDs, but not others, for example:

> ./datasets download gene gene-id 70922259
The gene you requested, (70922259) is not a recognized NCBI Gene ID.

>./datasets download gene gene-id 946643
Downloading: ncbi_dataset.zip    4.17kB done

Do you think this is an error on NCBI's end that I should report, or am I doing something wrong?

(For reference, 6 of the 10 above work: 29739679 56470647 57666352 66330936 66887781 946643. While testing another gene I had 17/20 work. It's also the same through the web GUI.)

ADD REPLY • link 2.2 years ago by Julian ▴ 20

1

Entering edit mode

You will likely get a more comprehensive answer from GenoMax, but I don't think this will work in all the cases, regardless of download issues. The criterion we ask for in GenoMax example above is for all the proteins annotated as metG [GENE] AND bacteria [orgn], which may work most of the time. Still, for various reasons that have to do with inconsistent or downright incorrect annotations, some metG genes may not be annotated as such. And even if they are for metG, it is almost a guarantee that they will not properly annotated for some other example. If your goal is to be comprehensive, I think you will have to do some kind of a sequence search, no matter how tempting and neat the datasets approach seems to be.

ADD REPLY • link 2.2 years ago by Mensur Dlakic ★ 27k

1

Entering edit mode

datasets is still BETA software. It is likely to generate some glitches. As you can see 70922259 is a valid gene ID if you were to search NCBI Gene database.

I concur with Mensur Dlakic . If you truly need an exhaustive set of sequences then pulling them out by searching may be the best option.

ADD REPLY • link 2.2 years ago by GenoMax 142k

1

Entering edit mode

Follow-up on this: Gene IDs above can be discontinued (and those won't work in datasets--~800 of ~1100 results I got from another gene were discontinued), so adding ALIVE[PROP] to the query helps (e.g. esearch -db gene -query "metG [GENE] AND bacteria [orgn] AND ALIVE[PROP]"). And in other cases, like 70922259, the gene id was just updated recently--I submitted a bug report and got a quick reply saying that they'll run an update & it should work tomorrow.

ADD REPLY • link 2.2 years ago by Julian ▴ 20