Question

How do I set filter for NCBI esearch to get fasta for Genes only?

2

Entering edit mode

5.5 years ago

MAPK ★ 2.1k

I am using esearch query as $query = "SS1G_03709+AND+gene[filter]";, but it gives me all (gene+ mRNA+genome sequences). What filter do I need to use so I only get gene sequences in my search? I tried a few filters from here, but couldn't find anything to limit my search for genes.

esearch • 4.0k views

ADD COMMENT • link 5.5 years ago by MAPK ★ 2.1k

score 7 · Accepted Answer · 2018-10-25

7

Entering edit mode

5.5 years ago

GenoMax 141k

One way (sequence truncated for brevity):

$ esearch -db nuccore -query "SS1G_01676 [GENE]"  | efetch -format fasta | grep ">" | grep -v "genome" | awk '{print $1}' | epost -db nuccore | efetch -format fasta
>XM_001597432.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA
ATGGCGCCCAAATTTTCGGAAGACGAAATTGATGATTTGATATATTTTGCTCGTATTGGGGACAATGATG
AATTCGAGAAATTGAGGGAGGAGTTATGTAAAAGGGAAGGATGTTCGATTGCTGAACTGCTAGAGACTGC

$ esearch -db nuccore -query "SS1G_03709 [GENE]"  | efetch -format fasta | grep ">" | grep -v "genome" | awk '{print $1}' | epost -db nuccore | efetch -format fasta
>XM_001595570.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA
ATGCATTTCTCAACTGCAAAAACGCTTCTTCCTCTCGCAGTTCTAGTTTCCTATACCACCGCTCAAACAA
CAGCTGCAGCACCACCTGTTGCTAGTGCTCCTACAGGCGGCACTTCTAGTACTTGTCTCGGACAAAATGT

@vkkodali has much neater ways of finding this info in that answer. Taking some inspiration from one of the command there you could ust do

$ esearch -db nuccore -query "Sclerotinia sclerotiorum 1980 [TITLE]"  | efilter -molecule mrna | efetch -format fasta > s_sclerot.fa

to get all of them at one time.

ADD COMMENT • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

@genomax Thank you, but my interest is to get the fasta for these, the actual gene records: https://www.ncbi.nlm.nih.gov/gene/?term=SS1G_01676
OR https://www.ncbi.nlm.nih.gov/nuccore/NW_001820834.1?report=fasta&from=1555069&to=1556099&strand=true

Is it possible?

ADD REPLY • link 5.5 years ago by MAPK ★ 2.1k

1

Entering edit mode

You can use the -format gene_fasta option of efetch to get the FASTA sequence of the genes annotated on a genomic RefSeq as shown below:

esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta | grep -A1 'SS1G_01676'

One issue that you may notice is that we are downloading the sequences of every gene annotated in FASTA format. While this may not be such a big deal for a handful of gene queries, this can become a performance issue with many many queries. In a situation like that, you can use the -format ft of efetch to first get the feature table; extract the coordinates for the gene of your interest and use bash scripting with efetch -seq_start ### -seq_stop ### -format fasta to finally get the sequence of just the gene of your interest.

ADD REPLY • link 5.5 years ago by vkkodali_ncbi ★ 3.7k

1

Entering edit mode

Wonder if @MAPK wants is this kind of header

>NW_001820834.1:c1556099-1555069 Sclerotinia sclerotiorum 1980 UF-70 scaffold_2 genomic scaffold, whole genome shotgun sequence

instead of

>XM_001595570.1 Sclerotinia sclerotiorum 1980 UF-70 hypothetical protein partial mRNA

since the sequence should be identical.

Slightly off-topic: You clearly have deep knowledge about eUtils! Do you work at/for NCBI?

ADD REPLY • link 5.5 years ago by GenoMax 141k

0

Entering edit mode

@vkkodali Thank you, this is what I wanted.

ADD REPLY • link 5.5 years ago by MAPK ★ 2.1k

1

Entering edit mode

$ esearch -db nuccore -query "Sclerotinia sclerotiorum 1980 [TITLE]"  | efilter -feature gene | efetch -format gene_fasta

This should get you the gene record multi-fasta.

ADD REPLY • link 5.5 years ago by GenoMax 141k

score 6 · Accepted Answer · 2018-10-25

If you are interested in just the FASTA sequences of mRNAs, you can use elink as follows:

esearch -db gene -query 'SS1G_01676' | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta

If you happen to know the unique gene ID (for example, it is 5493342 for SS1G_01676), you can skip the esearch command entirely and go directly to the elink step as follows:

elink -db gene -target nuccore -name gene_nuccore_refseqrna  -id 5493342 | efetch -format fasta

Another tool that you may find useful is efilter which you can use to filter results based on source (refseq, genbank, etc) and/or molecule type (genomic, mrna, etc) as follows:

esearch -db nuccore -query "SS1G_01676 [GENE]" | efilter -molecule mrna | efetch -format fasta