Question

Entrez Direct E-utilities on command line: efilter parameters for limiting efetch results

0

Entering edit mode

5.7 years ago

sovrappensiero ▴ 100

I'm using the command line e-utilities tools to get nucleotide sequences from Biosample IDs. RIght now I'm working it out for just one ID, and then I will address all 50+ IDs I need to fetch. I would like to limit the efetch to only capture the complete genome.

Initial code showing the results that I want to filter:

 esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efilter -source refseq | efetch -format fasta | grep '>'

Output:

>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
>NZ_CP021548.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000005, complete sequence
>NZ_CP021547.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000003, complete sequence
>NZ_CP021546.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000002, complete sequence
>NZ_CP021545.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000001, complete sequence
>NZ_CP021544.1 Klebsiella pneumoniae strain AR_0112 plasmid tig00000000, complete sequence

I want only the first result. (To save time, I'd rather set up that filter in the efetch command rather than filtering after fetching all the results.)

How I've tried to alter the efetch command in the above query:

efilter -source refseq -field "complete genome"

This produces no output.

efilter -source refseq -query "genome"

This produces the same output as removing the -query "genome" from the efilter command.

Any suggestions? I've looked through the Entrez Direct guide and did a Google search, but I still can't come up with a solution.

ncbi efilter efetch elink biosample • 2.4k views

ADD COMMENT • link updated 5.7 years ago by GenoMax 152k • written 5.7 years ago by sovrappensiero ▴ 100

0

Entering edit mode

I would suggest that you use Kai Blins' NCBI genome download tool.

ncbi-genome-download --assembly-level complete --genus "Klebsiells pneumoniae" bacteria

You can also parse a file NCBI makes available: how to download all the complete genomes for mycobacteria from NCBI?

ADD REPLY • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Thanks. I read through the documentation and it doesn't like this includes any of the features I need. Namely, I need to link 50+ Biosample IDs to their Nucleotide IDs, and then fetch the associated RefSeq complete genomes. I'm starting with a list of Biosample IDs that I care about, not all are K. pneumoniae (that was just an example for illustration). Have I misunderstood the tools' functions?

ADD REPLY • link 5.7 years ago by sovrappensiero ▴ 100

0

Entering edit mode

In case you knew the organisms you were interested in this would be a simpler solution. Noted to make you aware of its presence.

ADD REPLY • link 5.7 years ago by GenoMax 152k

score 2 · Accepted Answer · 2019-11-15

2

Entering edit mode

5.7 years ago

GenoMax 152k

If you want all accessions labeled as "chromosomes":

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta'

If you only want RefSeq:

$  esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -if SourceDb -contains refseq -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | head -5
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
GAGATATCAGGCTGGTTGTTAATAAAATCGTCAGGGTAGTAGCCGTAGTTTTTGACGCCGTTCAGCCGCA
GCAAACGCATCCAGTCCGCCAGCTGCTGGTCGGGAATGGCGTTTTGCTGGCGTCGATTCCAGTCTCTGGC
CTGCAGTTCAAAGATGGTTTTCTTCAGCGCGCCCGGGTGGCGGGCGACGGCCTGCACCAGCCGCGTCAGC
CAGGCCTGGCTGGCGTCGATCGGCACGGACTCCATCAGCGGCATCGCCATGGGGACCGTCCAGTCATAGG

ADD COMMENT • link 5.7 years ago by GenoMax 152k

0

Entering edit mode

Thanks, this is helpful. Do I understand correctly that this will return only the first ID from the efetch procedure (is that the $0)? So this would work as long as in the Nucleotide database the first record is always the most complete genome available. Seems likely to be a good assumption...but I want to check 1). Do I understand the code correctly, and 2). Is it a good assumption?

Initially I kind of assumed there should be some efilter parameter that allows you to get exactly the complete genome, based on the item description in the database rather than position in the results, but perhaps there isn't.

ADD REPLY • link 5.7 years ago by sovrappensiero ▴ 100

0

Entering edit mode

It is not getting only the first record. It should be getting one record at a time, how many ever there may be. You can verify that by doing:

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -if SourceDb -contains refseq -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | grep "^>"
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome

$ esearch -db biosample -query "SAMN04014953" | elink -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -if Genome -contains chromosome -element Caption | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format fasta' | grep "^>"
>NZ_CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome
>CP021549.1 Klebsiella pneumoniae strain AR_0112, complete genome

There are probably more ways than one to do this but in-line help for efilter did not suggest an obvious option.