Question: Is it possible to download the full EpiCov database from GISAID?
2
gravatar for dgarcia54
9 weeks ago by
dgarcia5420
dgarcia5420 wrote:

Hello everybody!

As the tittle of the post says, I need to download all the isolates of SARS-COV-2 from GISAID database... I've been searching for a way to download the whole data but I wasn't able to find out how to do it or if its possible to! Currently I am downloading one by one, but there are more than 700 entries... I hope someone could help...

Thanks in advance!

epicov database gisaid genome • 1.8k views
ADD COMMENTlink modified 7 weeks ago by 5heikki8.7k • written 9 weeks ago by dgarcia5420
3
gravatar for Istvan Albert
9 weeks ago by
Istvan Albert ♦♦ 83k
University Park, USA
Istvan Albert ♦♦ 83k wrote:

There used to be an awkwardly placed button in the low right corner that will get you all the sequences. Then there is another Excel file with some metadata.

To get all the metadata you will have to download each PDF file.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Istvan Albert ♦♦ 83k

Thanks Istvan! I found the button that donwloads an Excel file, but I did not find the other one... Thanks a lot for your answer! I will keep looking for a way to do it!

ADD REPLYlink written 9 weeks ago by dgarcia5420

I also did not find the 'link' for sequences. I sent a help message to through the website. waiting for the reply. one month ago, I choose the most "STUPID" way to download the sequences one-by-one (~200 records). I will not do it anymore (1k+ records up-to-date!)

ADD REPLYlink written 9 weeks ago by wm440

I looked up my records, I complained profusely about this issue more than a month ago, on February 13th, 2020, their "support" personnel sent me the image below as the method for downloading all data at once. Indeed the button was there but in an awkward region, all the way at the bottom left of the page that I've not noticed before.

After pointing out how absurd the whole system is they removed my account and I've not been able to log in.

Does this button not exist anymore?

enter image description here

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Istvan Albert ♦♦ 83k

No Download button there, I checked it again.

Cannot believe they REMOVED your account. I just sent them a message about download issue, hope they will not be very __angry__ about it!

ADD REPLYlink written 9 weeks ago by wm440

Thanks GISAID, I listen a call from the GISAID website. And I received the whole FASTA sequences in one-file. They make the batch download available for me (Button on Bottom right of the page). I was told DO NOT share the data with anyone else, I need to be responsible for the data.

What I did is click the "Contact" button, and ask for help. Hope it will helps. @dgarcia54, you need do it yourself.

ADD REPLYlink written 9 weeks ago by wm440

After attempting to batch download they removed my account as well with no notice. Not sure how to move forward from this point.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by rscruz70

I would contact the GISAID website maintainer through the "Contact" page.

What you mean "batch download". through the "Download button", or you use a script to parse the website. How long you found your account was removed after batch download?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by wm440

I have contacted the Contact page but they haven't gotten back to me. I also found in their terms of use page that they actually have a clause stating that they could remove accounts with no reason/notice/explanation, so it could be expected that they may simply just not respond to my message sent via their Contact page.

The 'Download button' didn't exist at the point I was trying to download so I mean using a script I found on github (that had apparently previouosly successfully accomplished such multiple downloads from the page) to parse the website. I wasn't able to get the script to work on my machine through, but it was approximately 8hours of me attempting to get it to work before I found my account was no longer accessible.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by rscruz70

I known, thanks for your reply.

I do think it is not a good idea using a "spider" to parse the website, this is the main reason your account was banned.

You may keep trying to contact the maintainer, and do not break their rules.

ADD REPLYlink written 9 weeks ago by wm440

Fair. As someone had done it successfully before me though with no negative feedback from GISAID towards them, I thought that it was alright. I had also checked the terms of use and such is not stated as forbidden/against the rules as well.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by rscruz70

data/gisaid_cov2020_sequences.fasta

I am unable to get this from the site.There is no download option for this ,only acknowledgement table is there.

ADD REPLYlink written 8 weeks ago by priya12019510
1

To my experience, I sent the website maintainer a message through "contact" on the top-right of the page. They will activate the "download" button for you.

ADD REPLYlink written 8 weeks ago by wm440

Thanks,I also have msged them by contact,but nothing updated till now.

Why this is so that this feature is activated for some users and not for all?I am not able to figure out this.

ADD REPLYlink written 8 weeks ago by priya12019510

I'm not sure what is going on with that website. I received access this morning. Using my desktop computer, I could not find the download button anywhere on the page. I logged into GISAID from my laptop and the button was magically there - same browser version, same version of Windows.

ADD REPLYlink written 7 weeks ago by billth270
2
gravatar for Michael
9 weeks ago by
Michael150
Swizerland
Michael150 wrote:

I just got access to GISAID. Their interface is pathetic. Please researches, upload your data to INSDC (http://www.insdc.org/) which means uploading to NCBI GenBank, ENA or DDBJ. Or maybe also to Chinese National GenBank. Please keep the data open!

For example NCBI's interfaces are of magnitudes better, why maintaining such secondary (inferior infrastructure)?

Or is there any option for batch download of all genome assemblies? From the comments in this forum I don't think so (or not any more).

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Michael150
1

The download button exists and works just fine. I don't know why some people are having problems with it. The gisaid server seems to be under a high load quite often. I don't know..

Whenever I download the complete file, this is the very first thing I do:

awk '{if(/^>/){h=$0;gsub(">","",$0);gsub("/","_",$0);gsub(" ","_",$0);n=substr($0,1,length($0)-1);print h>n".fna"}else{gsub("-","",$0);print $0>n".fna"}}' gisaid_cov2020_sequences.fasta
mv gisai* ../

for f in *fna; do
    dos2unix "$f"
    awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' "$f" > "$f".tmp
    awk 'BEGIN{FS="\t";OFS="\n"}{gsub(/^N*/,"",$2);gsub(/^n*/,"",$2);gsub(/N*$/,"",$2);gsub(/n*$/,"",$2);print $1,$2}' "$f".tmp > "$f"
    rm "$f".tmp
done
ADD REPLYlink written 8 weeks ago by 5heikki8.7k

OK, now it is there... It was not so far.

Maybe you have to specifically ask for it. I wrote them a mail yesterday that batch download would be very helpful. They might have added a flag for batch download now. Whatever...

EDIT: Thanks for the script by the way!!

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Michael150

I've been using the site since Jan and from what I recall, the button appeared when there were like 200+ genomes..

ADD REPLYlink written 8 weeks ago by 5heikki8.7k

yeah, think about that for a second, you get data from an official repository and the first thing you need to do is fix it up. They used to even label Hong Kong with a space where no spaces are allowed, thereby breaking the fasta id and many tools that rely on unique ids to work properly.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Istvan Albert ♦♦ 83k
1

Well, it's not ideal. However, I do understand that labs want to be acknowledged for their contributions. Especially here it's more than likely that some people involved in the process from sample taking to library preparation have died from the disease..

ADD REPLYlink written 8 weeks ago by 5heikki8.7k

yes fully agree,

but let's also realize that acknowledging/crediting/recognizing/citing and valuing work has absolutely nothing to do with limiting access and relicensing data.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Istvan Albert ♦♦ 83k

Istvan, 100% agree. I think the way NCBI, ENA and others handle this, provides the same level of acknowledgment.

To me it looks a bit like this is just some way of monopolizing the data in order to still be "important".

ADD REPLYlink written 8 weeks ago by Michael150

When you use sequence data from the NCBI, ENA, etc, you do not have to acknowledge the sequence contributor and nobody does. I think in particular GISAID should improve access to the sequence data, but anyway, what we have now is far better than what we had during any past outbreak..

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by 5heikki8.7k

Recognition is implicit though. You always refer to NCBI/ENA accession number(s) that anyone can lookup easily and see metadata associated with it.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax83k

In what way does a statement like "the sequences have been obtained from GISAID and you can only get them from there" acknowledge the original authors in any way?

In general, everyone does acknowledge the sources or at least it is a standard scientific practice to cite the origins of the data as long as you are finding something of interest.

For example, if you say A is most similar to B you will need to cite B for sure. That's the scientific recognition right there.

There is really no need to cite GISAID as the data source, yet that's what is happening, that is what they are after. They want to get cited and reap all recognition.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Istvan Albert ♦♦ 83k

I'm not here to defend nor judge GISAID, but why do you think everyone is uploading their data there instead of say the NCBI? BTW If I had my way, the submitters would be required to share their raw sequence data. Like one third of the genomes are unusable because people have no clue of what they're doing

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by 5heikki8.7k

I think this should be investigated and understood. I don't know how it ended up like that or why. Possible reasons include:

  • the majority of scientists do not fully understand what GISAID does with their data
  • the majority of scientists do not understand that they are not even allowed to relicense their data in the first place
  • GISAID has the first mover's advantage.
  • Reviewers probably ask people to deposit where everyone else is already.

In general, don't mean to imply that GISAID does not do beneficial things for scientists. They should be funded and supported for the value that they add to the process.

The petition and complaint is about allowing data access freely and in an unimpeded manner.

I can't believe that in 2020 during a major pandemic the source to the data is locked away.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Istvan Albert ♦♦ 83k

I guess what would solve the problem is if GISAID would allow INSDC to sync data across their databases. Which is not possible the way GISAID operates at the moment.

Fun thing is: GISAID imports data from open INSDC databases and incorporate it into their EpiFlu database ;)

ADD REPLYlink written 8 weeks ago by Michael150

Istvan, this was my thinking as well! Why is it necessary to execute "dos2unix" first. WTF. :(

One should really encourage every single researcher to submit to NCBI, ENA, et al.

ADD REPLYlink written 8 weeks ago by Michael150

Did this script function for adding the batch download button? Or are there some other measures needed to do?

ADD REPLYlink written 6 weeks ago by Xiang Li0

The script has nothing to do with the web UI

ADD REPLYlink written 6 weeks ago by 5heikki8.7k

Thank you! I have already tried this, it didn't work.

ADD REPLYlink written 6 weeks ago by Xiang Li0
1
gravatar for 5heikki
7 weeks ago by
5heikki8.7k
Finland
5heikki8.7k wrote:

Someone from Canada submitted like 50 genomes with nonsensical collection dates such as 2020-41-04

This is why we can't have nice things :(

edit. I know this is completely offtopic

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by 5heikki8.7k

I thought there was curation before genomes were released into the database?

ADD REPLYlink written 7 weeks ago by genomax83k

now think about how incompetent the GISAID people must be that they cannot automatically detect obviously wrong dates on submission.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Istvan Albert ♦♦ 83k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 950 users visited in the last hour