Question: Entrez Direct E-utilities on command line: efilter parameters for limiting efetch results
0
gravatar for sovrappensiero
11 months ago by
sovrappensiero10 wrote:

I'm using the command line e-utilities tools to get nucleotide sequences from Biosample IDs. RIght now I'm working it out for just one ID, and then I will address all 50+ IDs I need to fetch. I would like to limit the efetch to only capture the complete genome.

Initial code showing the results that I want to filter:

 esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efilter -source refseq | efetch -format fasta | grep '>'

Output:

>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
>NZ_CP021548.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000005, complete sequence
>NZ_CP021547.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000003, complete sequence
>NZ_CP021546.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000002, complete sequence
>NZ_CP021545.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000001, complete sequence
>NZ_CP021544.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000000, complete sequence

I want only the first result. (To save time, I'd rather set up that filter in the efetch command rather than filtering after fetching all the results.)

How I've tried to alter the efetch command in the above query:

efilter -source refseq -field "complete genome"

This produces no output.

efilter -source refseq -query "genome"

This produces the same output as removing the -query "genome" from the efilter command.

Any suggestions? I've looked through the Entrez Direct guide and did a Google search, but I still can't come up with a solution.

ADD COMMENTlink modified 11 months ago by genomax91k • written 11 months ago by sovrappensiero10

I would suggest that you use Kai Blins' NCBI genome download tool.

ncbi-genome-download --assembly-level complete --genus "Klebsiells pneumoniae" bacteria

You can also parse a file NCBI makes available: how to download all the complete genomes for mycobacteria from NCBI?

ADD REPLYlink written 11 months ago by genomax91k

Thanks. I read through the documentation and it doesn't like this includes any of the features I need. Namely, I need to link 50+ Biosample IDs to their Nucleotide IDs, and then fetch the associated RefSeq complete genomes. I'm starting with a list of Biosample IDs that I care about, not all are K. pneumoniae (that was just an example for illustration). Have I misunderstood the tools' functions?

ADD REPLYlink written 11 months ago by sovrappensiero10

In case you knew the organisms you were interested in this would be a simpler solution. Noted to make you aware of its presence.

ADD REPLYlink written 11 months ago by genomax91k
2
gravatar for genomax
11 months ago by
genomax91k
United States
genomax91k wrote:

If you want all accessions labeled as "chromosomes":

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta'

If you only want RefSeq:

$  esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -if SourceDb -contains refseq -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | head -5
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
GAGATATCAGGCTGGTTGTTAATAAAATCGTCAGGGTAGTAGCCGTAGTTTTTGACGCCGTTCAGCCGCA
GCAAACGCATCCAGTCCGCCAGCTGCTGGTCGGGAATGGCGTTTTGCTGGCGTCGATTCCAGTCTCTGGC
CTGCAGTTCAAAGATGGTTTTCTTCAGCGCGCCCGGGTGGCGGGCGACGGCCTGCACCAGCCGCGTCAGC
CAGGCCTGGCTGGCGTCGATCGGCACGGACTCCATCAGCGGCATCGCCATGGGGACCGTCCAGTCATAGG
ADD COMMENTlink modified 11 months ago • written 11 months ago by genomax91k

Thanks, this is helpful. Do I understand correctly that this will return only the first ID from the efetch procedure (is that the $0)? So this would work as long as in the Nucleotide database the first record is always the most complete genome available. Seems likely to be a good assumption...but I want to check 1). Do I understand the code correctly, and 2). Is it a good assumption?

Initially I kind of assumed there should be some efilter parameter that allows you to get exactly the complete genome, based on the item description in the database rather than position in the results, but perhaps there isn't.

ADD REPLYlink written 11 months ago by sovrappensiero10

It is not getting only the first record. It should be getting one record at a time, how many ever there may be. You can verify that by doing:

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -if SourceDb -contains refseq -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | grep "^>"
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | grep "^>"
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
>CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome

There are probably more ways than one to do this but in-line help for efilter did not suggest an obvious option.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax91k

Ok, I understand it now. Clever! Thank you so much for the help.

ADD REPLYlink written 11 months ago by sovrappensiero10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 847 users visited in the last hour