Question

How do I loop over a list of gene symbols to extract sequences from NCBI?

2

Entering edit mode

5.7 years ago

MAPK ★ 2.1k

I want to extract a list of sequences from NCBI. I am esearch command mentioned here. For one gene symbol, I could do it like this:

esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/SS1G_01676/{print RT $0}'

I want to use a bash loop to extract a list of sequences and below is what I have tried, but wouldn't yield any results. What am I missing here?

declare -a arr=("SS1G_03709" "SS1G_07286" "SS1G_04907")
for i in "${arr[@]}"
do
myquery="'${i}[gene]'"
echo "myid :" ${i}
echo "my query :" ${myquery}
esearch -db nuccore -q ${myquery} | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/${i}/{print RT $0}' >>text.fasta
done

ncbi eutils bash • 2.9k views

ADD COMMENT • link 5.7 years ago by MAPK ★ 2.1k

3

Entering edit mode

This is not a solution to OP issue. However code can be simplified ( I removed awk part of it and that can be simplified if OP describes what he/she wants to do with fasta output:)

For getting fasta:

$ cat test.txt | while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta ; done

Input:

$ cat test.txt 
SS1G_03709
SS1G_07286
SS1G_04907

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

Thank you! With the awk part, I just wanted to grab fasta sequence (reference gene sequence) for the corresponding gene symbol.

ADD REPLY • link 5.7 years ago by MAPK ★ 2.1k

1

Entering edit mode

awk doesn't take bash variable. You need to convert bash variable to awk variable. Try this (input remains the same as above): MAPK

$ while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done < test.txt

>lcl|NW_001820832.1_gene_616 [locus_tag=SS1G_03709] [db_xref=GeneID:5491341] [partial=5',3'] [location=<1605066..>1605750] [gbkey=Gene]
ATGCATTTCTCAACTGCAAAAACGCTTCTTCCTCTCGCAGTTCTAGTTTCCTATACCACCGCTCAAACAA
CAGCTGCAGCACCACCTGTTGCTAGTGCTCCTACAGGCGGCACTTCTAGTACTTGTCTCGGACAAAATGT
TCTAGATGCTTGTTTGAGTTCTACGACGGCTATTGCTCAGGCATGTCAACCTACCGATTATAATTGTTTA
TGCACAAGTTGGAATGCAGTTTTAACGTACGCTCCTTCCTTATTCACTATCCTTCTAATCCCTTCTCTCC
TCCATCAAATCCCCTTCATTATACTAATTCTCCCCTCCCTTTCTCAGATGCTACAACCAATGTCCCAACG
ACGCAGGTTACGCGGGCGCTCTATCTAATAAACAAACTTATTGTAATAATGCATCAGTTTATACATCTAC
TTCCTCATCAGCTATCTCGAGAGATTGGCCTACGTCTGCTACTGCTACTGCGGCGGGTGGTGCTACTGGT
GCAGGTGCCGGTGCAAGAGCAGGTGCGACAAGTACAGCAGCGAGAGCGAGTTCTATGGGCAGAAATGAGA
GTAAGACGGCTAGTGCGACGGGGGCGGAGAGTAGTGATAAGAGTAGTGGGGCTGGGGAGAGAGGGGTTAG
TGGATTGGTGGGTGTTATTGGGGGTGTGGGGGTGGTGGTGGTGGGGTTTTTGTAG

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

@cpad0112 Thank you, but this would not give me the fasta for the rest of the genes (you get only for the first gene).

ADD REPLY • link 5.7 years ago by MAPK ★ 2.1k

1

Entering edit mode

Sometimes an alternative approach is better. Have you considered NCBI Batch Entrez?

ADD REPLY • link 5.7 years ago by h.mon 35k

score 4 · Accepted Answer · 2018-10-29

You are right, it stops after the first gene. I am not sure why. (EDIT: see explanation below) This works though:

for line in `cat temp.txt`; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done

Apparently, for while loops involving esearch one should add < /dev/null to prevent esearch from "draining" the remaining lines from stdin. See documentation here: https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6.While_Loop