Question: How to get summary for acc.no. not starting with 'WP_' ?
0
gravatar for 6schulte
4 days ago by
6schulte0
6schulte0 wrote:

Hi Biostars community,

I want to use epost and esummary (NCBIs eutils) to obtain information on the lineage.

But I have some problems with accession numbers not starting like WP_.

While

cat "$ListWithAccessionNumbers" | epost -db protein |\
    esummary -db taxonomy -format xml | \
    xtract  -pattern Seq-entry -element Org-ref_taxname, OrgName_lineage, NCBIeaa, Textseq-id_accession \
    > SummaryTable.tsv

gives me a tsv file indeed, some cells are not filled with the requested information.

For the accession numbers not starting with WP_ the accession number and sequence are not printed out, this will only be printed for the accession numbers starting with WP_.

So my current question is how can I obtain lineage information for those accession numbers using epost and esummary that do not start with WP_ but still also get the accession number and sequence printed out? Is there anyone with experience regarding this?

If though you have some suggestions on how to use epost and esummary differently instead or know of an alternative way of using the NCBIs e-utilities to solve this problem, I am grateful for your ideas and help! Thank you!

ADD COMMENTlink modified 4 days ago by RamRS30k • written 4 days ago by 6schulte0
1

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLYlink written 4 days ago by RamRS30k

Thank you! I didn't know that, very helpful :)

ADD REPLYlink written 4 days ago by 6schulte0
2
gravatar for genomax
4 days ago by
genomax92k
United States
genomax92k wrote:

So my current question is how can I obtain lineage information for those accession numbers using epost and esummary that do not start with WP_ but still also get the accession number and sequence printed out? I

Once an input is passed to Entrezdirect there is no way to keep track of it. So you should print it before it gets passed to Entrezdirect.

Edit: Taking out code which was not tested. Use @vkkodali's solution which is complete.

If that information is missing for specific accessions, you are not going to get that.

ADD COMMENTlink modified 4 days ago • written 4 days ago by genomax92k
2

Just noting that you are not parsing taxonomy XMLs. The output piped from esearch already is using the protein database and esummary ignores -db taxonomy in this instance. If you want to do cross-database searches you will need to use elink. If all you need is taxonomy info (and not sequence) you can do something like:

for i in `cat file_w_accession_one_per_line`; do 
    printf ${i}"\t"; \
    epost -db protein -id ${i} -format acc \
    | elink -target taxonomy \
    | efetch -format xml \
    | xtract -pattern TaxaSet -first ScientificName -element Lineage ;
done > SummaryTable.tsv

If on the other hand, you want the sequence as well, you can simplify it to:

for i in `cat file_w_accession_one_per_line`; do 
    printf ${i}"\t"; \
    efetch -db protein -id ${i} -format xml \
    | xtract -pattern Seq-entry -element Org-ref_taxname, OrgName_lineage, NCBIeaa, Textseq-id_accession ;
done > SummaryTable.tsv
ADD REPLYlink written 4 days ago by vkkodali2.2k

Thank you for your reply genomax, You have already helped me with my last question. And I am sorry for not giving an example on what acc. no. I am working with... WP_112675856, CCH19814, WP_149273991, RSN10899 would be examples.

And thank you, vkkodali, very much as well! The suggested code works pretty good! :) The code suggested for also retreving the sequence unfortunately does not print the sequence for acc. no. like: CCH19814 and RSN10899. But only for the ones like WP_112675856, and WP_149273991.

Because of that I can extract the sequence with an easy esl-sfetch command and do not neccessaryly need it right now I will just work with the code you suggested as first option. Thank you!

ADD REPLYlink modified 2 days ago • written 2 days ago by 6schulte0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2223 users visited in the last hour