esearch|elink|esummary|xtract randomly skip some accession
0
0
Entering edit mode
12 months ago

Hello Everyone,

I have a total of ~130.000 SRA accession from which I need to retrieve the isolation source and the location.

$head -n 10 SRAyk.txt
DRR095581
SRR11035504
SRR9016627
SRR5826819
SRR11032323
SRR6801753
SRR10144785
SRR12961276
SRR5927939
ERR2563030

Here is the bash loop

for i in $(cat SRAyk.txt) 
do
        location=$(esearch -db sra -query $i < /dev/null | 
        elink -db sra -target biosample -name sra_biosample | 
        esummary | 
        xtract -pattern DocumentSummary -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute -group Attribute -if Attribute@harmonized_name -equals "geo_loc_name" -element Attribute);
        echo -e "$i\t$location";
done

The problem I am facing is that esearch|elink|esummary|xtract skip some SRA accession and this behavior seems to be completely random. The same happens if I use epost instead of esearch.

Is there anything I can do to solve this problem?

The second problem I am facing is that I have too many accession and will probably take days to complete the job. The SRA accessions were recovered from MicrobeAtlas and for each of them, I already have the latitude and longitude but not the name of the location. From this huge list of SRA accessions, I am only interested in those coming from the USA.

Probably I can reduce the number of SRA accession by focusing only on those with latitude values between 0<x<90 and longitude between -180<x<0. Does it make sense?

Thank you!

ps. I have already set-up NCBI_API_KEY as an environmental variable

NCBI E-utilities • 785 views
ADD COMMENT
0
Entering edit mode

I have already set-up NCBI_API_KEY as an environmental variable

Doing 130,000 searches is probably running afoul of some search limits. Add some kind of wait between blocks.

This information may also be in SRA metadata files. It may be better to search those.

ADD REPLY
0
Entering edit mode

Hi GenoMax

I did a bunch of tries with just 100 SRA accession and with sleep 3s at the end of the loop.

It did not really solve the problem. By using the same acessions the performances seems to get worse after the first try: 1) 10 missed; 2) 30 missed; 3) 28 missed; 4) 24 missed; 5) 26 missed.

I also checked the SRA metadata file and the location doesn't seems to be reported in those files.

Because the behavior looks totally random I should probably contact the e-utilities help desk.

ADD REPLY
0
Entering edit mode

There is no harm in asking help desk.

Using API key allows a max of 10 request per second but since you are doing a complicated search the results are likely taking longer so that sleep 3 is probably not helping much.

ADD REPLY

Login before adding your answer.

Traffic: 2286 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6