Question: How do I loop over a list of gene symbols to extract sequences from NCBI?
0
gravatar for MAPK
5 months ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I want to extract a list of sequences from NCBI. I am esearch command mentioned here. For one gene symbol, I could do it like this:

esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/SS1G_01676/{print RT $0}'

I want to use a bash loop to extract a list of sequences and below is what I have tried, but wouldn't yield any results. What am I missing here?

declare -a arr=("SS1G_03709" "SS1G_07286" "SS1G_04907")
for i in "${arr[@]}"
do
myquery="'${i}[gene]'"
echo "myid :" ${i}
echo "my query :" ${myquery}
esearch -db nuccore -q ${myquery} | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/${i}/{print RT $0}' >>text.fasta
done
bash eutils ncbi • 339 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by MAPK1.4k
1

This is not a solution to OP issue. However code can be simplified ( I removed awk part of it and that can be simplified if OP describes what he/she wants to do with fasta output:)

For getting fasta:

$ cat test.txt | while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta ; done

Input:

$ cat test.txt 
SS1G_03709
SS1G_07286
SS1G_04907
ADD REPLYlink written 5 months ago by cpad011211k

Thank you! With the awk part, I just wanted to grab fasta sequence (reference gene sequence) for the corresponding gene symbol.

ADD REPLYlink written 5 months ago by MAPK1.4k
1

awk doesn't take bash variable. You need to convert bash variable to awk variable. Try this (input remains the same as above): MAPK

$ while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done < test.txt

>lcl|NW_001820832.1_gene_616 [locus_tag=SS1G_03709] [db_xref=GeneID:5491341] [partial=5',3'] [location=<1605066..>1605750] [gbkey=Gene]
ATGCATTTCTCAACTGCAAAAACGCTTCTTCCTCTCGCAGTTCTAGTTTCCTATACCACCGCTCAAACAA
CAGCTGCAGCACCACCTGTTGCTAGTGCTCCTACAGGCGGCACTTCTAGTACTTGTCTCGGACAAAATGT
TCTAGATGCTTGTTTGAGTTCTACGACGGCTATTGCTCAGGCATGTCAACCTACCGATTATAATTGTTTA
TGCACAAGTTGGAATGCAGTTTTAACGTACGCTCCTTCCTTATTCACTATCCTTCTAATCCCTTCTCTCC
TCCATCAAATCCCCTTCATTATACTAATTCTCCCCTCCCTTTCTCAGATGCTACAACCAATGTCCCAACG
ACGCAGGTTACGCGGGCGCTCTATCTAATAAACAAACTTATTGTAATAATGCATCAGTTTATACATCTAC
TTCCTCATCAGCTATCTCGAGAGATTGGCCTACGTCTGCTACTGCTACTGCGGCGGGTGGTGCTACTGGT
GCAGGTGCCGGTGCAAGAGCAGGTGCGACAAGTACAGCAGCGAGAGCGAGTTCTATGGGCAGAAATGAGA
GTAAGACGGCTAGTGCGACGGGGGCGGAGAGTAGTGATAAGAGTAGTGGGGCTGGGGAGAGAGGGGTTAG
TGGATTGGTGGGTGTTATTGGGGGTGTGGGGGTGGTGGTGGTGGGGTTTTTGTAG
ADD REPLYlink modified 5 months ago • written 5 months ago by cpad011211k

@cpad0112 Thank you, but this would not give me the fasta for the rest of the genes (you get only for the first gene).

ADD REPLYlink modified 5 months ago • written 5 months ago by MAPK1.4k
1

Sometimes an alternative approach is better. Have you considered NCBI Batch Entrez?

ADD REPLYlink written 5 months ago by h.mon24k
3
gravatar for vkkodali
5 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

You are right, it stops after the first gene. I am not sure why. (EDIT: see explanation below) This works though:

for line in `cat temp.txt`; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done

Apparently, for while loops involving esearch one should add < /dev/null to prevent esearch from "draining" the remaining lines from stdin. See documentation here: https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6.While_Loop

ADD COMMENTlink modified 5 months ago • written 5 months ago by vkkodali1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1287 users visited in the last hour