What is wrong with my NCBI esearch command for 16S rRNA?
1
0
Entering edit mode
5 months ago
Morgan S. ▴ 80

Hi there,

I am trying to extract the 16S rRNA from my genomes using the genome accession from NCBI. I tried the code below and ended up with the same unrelated sequence for each of my accessions. Instead of the rRNA for each respective genome, in each output file I have the same sequence mapping to Wolbachia. Any suggestions or is this too ambitious?

for i in $(cat genome_list.txt); do esearch -db assembly -query ${i} < /dev/null | elink -target nuccore -name assembly_nuccore_insdc | elink -target protein | esearch -db protein -query '"16S rRNA"[Protein name]' | efetch -format fasta_cds_na > ${i}_16S.fa ; done
16S NCBI Entrez assembly • 944 views
ADD COMMENT
0
Entering edit mode

Probably not exactly what you are looking for but try:

$ esearch -db nuccore -query "PRJNA33317 OR PRJNA33175 " | efetch -format fasta 
ADD REPLY
0
Entering edit mode

Do you have one or two of the genomes in the list? Just want to try out the command; I've been bitten by Entrez being fiddly with quotation marks and it always takes me some debugging runs to find them

ADD REPLY
0
Entering edit mode

While the following seems to work (E coli genome) the command does not work reliably (at lest for me, it spits up a bunch of errors on multiple tries, following is a successful run sans errors).

$ esearch -db assembly -query 'GCA_000005845' | elink -target nuccore -name assembly_nuccore_insdc | elink -target protein | esearch -db protein -query '"16S rRNA"[Protein name]' | efetch -format fasta_cds_na
>lcl|BMAO01001414.1_cds_GFQ73155.1_1 [gene=rsmD] [locus_tag=TNCT_542461] [protein=16S rRNA] [protein_id=GFQ73155.1] [location=join(3724..3973,6076..6272)] [gbkey=CDS]
ATGGTAGATTCAGACTATTACAATTTACAATTACCTAAAAAAACAGCAGAAGATTTTGGAATTATGGACA
ATGTTACGCTAATTTGCTGCAGTGCTAACGGATTACCAAAACCTATTTCAAAATGCGACATAGTTTTTAT
ADD REPLY
0
Entering edit mode

Hi GenoMax

This is what I am seeing as well. You provided an E. coli genome, but the 16S sequence is for a different entry, Trichonephila clavata. https://www.ncbi.nlm.nih.gov/protein/GFQ73155.1/

And when I provide a list, I get this same entry for them all. So it seems to not be the loop, but something with piping the different commands.

ADD REPLY
0
Entering edit mode

You can feed in starts and stops below to recover the actual sequence

(Ignore the column with all 9's below).

$ esearch -db assembly -query 'GCA_000005845' | elink -target nuccore | elink -target gene | efilter -query 'rRNA' | esummary  |xtract -pattern DocumentSummary -element Name,Description,ChrAccVer,ChrStart,ChrStop | grep "ribosomal RNA"
rrsB    16S ribosomal RNA       NC_000913.3     4166658 999999999       4168199
rrsD    16S ribosomal RNA       NC_000913.3     3428761 999999999       3427220
rrsE    16S ribosomal RNA       NC_000913.3     4208146 999999999       4209687
rrsG    16S ribosomal RNA       NC_000913.3     2731156 999999999       2729615
rrsH    16S ribosomal RNA       NC_000913.3     223770  999999999       225311
rrsC    16S ribosomal RNA       NC_000913.3     3941807 999999999       3943348
rrlB    23S ribosomal RNA       NC_000913.3     4168640 999999999       4171543
rrlG    23S ribosomal RNA       NC_000913.3     2729183 999999999       2726280
rrlE    23S ribosomal RNA       NC_000913.3     4210042 999999999       4212945
rrlD    23S ribosomal RNA       NC_000913.3     3426782 999999999       3423879
rrlH    23S ribosomal RNA       NC_000913.3     225758  999999999       228661
rrlC    23S ribosomal RNA       NC_000913.3     3943703 999999999       3946606
rrfB    5S ribosomal RNA        NC_000913.3     4171636 999999999       4171755
rrfE    5S ribosomal RNA        NC_000913.3     4213039 999999999       4213158
rrfC    5S ribosomal RNA        NC_000913.3     3946699 999999999       3946818
rrfG    5S ribosomal RNA        NC_000913.3     2726187 999999999       2726068
rrfH    5S ribosomal RNA        NC_000913.3     228755  999999999       228874
rrfF    5S ribosomal RNA        NC_000913.3     3423541 999999999       3423422
rrfD    5S ribosomal RNA        NC_000913.3     3423786 999999999       3423667
rrsA    16S ribosomal RNA       NC_000913.3     4035530 999999999       4037071
rrlA    23S ribosomal RNA       NC_000913.3     4037518 999999999       4040422
rrfA    5S ribosomal RNA        NC_000913.3     4040516 999999999       4040635

Another option

$ esearch -db assembly -query 'GCA_000005845' | elink -target nuccore | elink -target gene | efilter -query 'ribosomal RNA' | esummary  |xtract -pattern DocumentSummary -element Name,Description -block GenomicInfo -element ChrAccVer,ChrStart,ChrStop 
rrsB    16S ribosomal RNA       NC_000913.3     4166658 4168199
rrsD    16S ribosomal RNA       NC_000913.3     3428761 3427220
rrsE    16S ribosomal RNA       NC_000913.3     4208146 4209687
rrsG    16S ribosomal RNA       NC_000913.3     2731156 2729615
rrsH    16S ribosomal RNA       NC_000913.3     223770  225311
rrsC    16S ribosomal RNA       NC_000913.3     3941807 3943348
rrlB    23S ribosomal RNA       NC_000913.3     4168640 4171543
rrlG    23S ribosomal RNA       NC_000913.3     2729183 2726280
rrlE    23S ribosomal RNA       NC_000913.3     4210042 4212945
rrlD    23S ribosomal RNA       NC_000913.3     3426782 3423879
rrlH    23S ribosomal RNA       NC_000913.3     225758  228661
rrlC    23S ribosomal RNA       NC_000913.3     3943703 3946606
rrfB    5S ribosomal RNA        NC_000913.3     4171636 4171755
rrfE    5S ribosomal RNA        NC_000913.3     4213039 4213158
rrfC    5S ribosomal RNA        NC_000913.3     3946699 3946818
rrfG    5S ribosomal RNA        NC_000913.3     2726187 2726068
rrfH    5S ribosomal RNA        NC_000913.3     228755  228874
rrfF    5S ribosomal RNA        NC_000913.3     3423541 3423422
rrfD    5S ribosomal RNA        NC_000913.3     3423786 3423667
rpsA    30S ribosomal subunit protein S1        NC_000913.3     961994  963667
rrsA    16S ribosomal RNA       NC_000913.3     4035530 4037071
rrlA    23S ribosomal RNA       NC_000913.3     4037518 4040422
rpsD    30S ribosomal subunit protein S4        NC_000913.3     3441674 3441054
rpsT    30S ribosomal subunit protein S20       NC_000913.3     21077   20814
rmf     ribosome modulation factor      NC_000913.3     1015714 1015881
rlmN    23S rRNA m(2)A2503 methyltransferase/tRNA m(2)A37 methyltransferase     NC_000913.3     2644282 2643128
suhB    Nus factor SuhB NC_000913.3     2663441 2664244
rrfA    5S ribosomal RNA        NC_000913.3     4040516 4040635
rhlE    ATP-dependent RNA helicase RhlE NC_000913.3     830871  832235
rsmI    16S rRNA 2'-O-ribose C1402 methyltransferase    NC_000913.3     3293334 3292474
ADD REPLY
0
Entering edit mode
5 months ago

OK I think I got it now.

between here:

esearch -db assembly -query 'GCA_000005845' | \
elink -target nuccore -name assembly_nuccore_insdc | \ 
elink -target protein | \

and here:

esearch -db protein -query '"16S rRNA"[Protein name]' | \
efetch -format fasta_cds_na

There is a break, the previous results before esearch get discarded. esearch begins anew. You can try this out yourself by manually entering that search term in Genbank:

https://www.https://www.ncbi.nlm.nih.gov/protein/?term=%2216S+rRNA%22%5BProtein+name%5D

You should get the same spider protein as above, it's just randomly the first result for your search term: https://www.ncbi.nlm.nih.gov/protein/2071305251

I learned this by adding efetch -format docsum to every step, beginning with the first one.

Here's what you could do instead with replacing esearch by efilter, but I'm sure there's a nicer way of doing this:

esearch -db assembly -query 'GCA_000005845' | \
elink -target nuccore -name assembly_nuccore_insdc | \
elink -target protein | \
efilter -query '16S[Title]' | \
efetch -format fasta > all_my_16S_proteins.fasta

It's not perfect yet - I do get several proteins, but all of them at least have 16S in their title, and all of them come from that group of assemblies (E. coli MG1655).

ADD COMMENT
0
Entering edit mode

I don't think OP wants protein sequence. You are also getting methyl transferase enzymes in this search not actual 16S rRNA.

>AAC75983.2 16S rRNA m(3)U1498 methyltransferase [Escherichia coli str. K-12 substr. MG1655]
>AAC74905.2 16S rRNA m(5)C1407 methyltransferase [Escherichia coli str. K-12 substr. MG1655]
>AAC76314.1 16S rRNA m(5)C967 methyltransferase [Escherichia coli str. K-12 substr. MG1655]
>AAC77324.1 16S rRNA m(2)G1207 methyltransferase [Escherichia coli str. K-12 substr. MG1655]
>AAC76763.1 16S rRNA m(7)G527 methyltransferase [Escherichia coli str. K-12 substr. MG1655]
ADD REPLY

Login before adding your answer.

Traffic: 1519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6