22 months ago
dgarcia54 ▴ 20

Hello everybody!

As the tittle of the post says, I need to download all the isolates of SARS-COV-2 from GISAID database... I've been searching for a way to download the whole data but I wasn't able to find out how to do it or if its possible to! Currently I am downloading one by one, but there are more than 700 entries... I hope someone could help...

GISAID genome Database EpiCoV • 7.7k views
22 months ago

There used to be an awkwardly placed button in the low right corner that will get you all the sequences. Then there is another Excel file with some metadata.

Thanks Istvan! I found the button that donwloads an Excel file, but I did not find the other one... Thanks a lot for your answer! I will keep looking for a way to do it!

I also did not find the 'link' for sequences. I sent a help message to through the website. waiting for the reply. one month ago, I choose the most "STUPID" way to download the sequences one-by-one (~200 records). I will not do it anymore (1k+ records up-to-date!)

I looked up my records, I complained profusely about this issue more than a month ago, on February 13th, 2020, their "support" personnel sent me the image below as the method for downloading all data at once. Indeed the button was there but in an awkward region, all the way at the bottom left of the page that I've not noticed before.

After pointing out how absurd the whole system is they removed my account and I've not been able to log in.

Does this button not exist anymore?

Thanks GISAID, I listen a call from the GISAID website. And I received the whole FASTA sequences in one-file. They make the batch download available for me (Button on Bottom right of the page). I was told DO NOT share the data with anyone else, I need to be responsible for the data.

What I did is click the "Contact" button, and ask for help. Hope it will helps. @dgarcia54, you need do it yourself.

After attempting to batch download they removed my account as well with no notice. Not sure how to move forward from this point.

I would contact the GISAID website maintainer through the "Contact" page.

I have contacted the Contact page but they haven't gotten back to me. I also found in their terms of use page that they actually have a clause stating that they could remove accounts with no reason/notice/explanation, so it could be expected that they may simply just not respond to my message sent via their Contact page.

The 'Download button' didn't exist at the point I was trying to download so I mean using a script I found on github (that had apparently previouosly successfully accomplished such multiple downloads from the page) to parse the website. I wasn't able to get the script to work on my machine through, but it was approximately 8hours of me attempting to get it to work before I found my account was no longer accessible.

I do think it is not a good idea using a "spider" to parse the website, this is the main reason your account was banned.

You may keep trying to contact the maintainer, and do not break their rules.

Fair. As someone had done it successfully before me though with no negative feedback from GISAID towards them, I thought that it was alright. I had also checked the terms of use and such is not stated as forbidden/against the rules as well.

data/gisaid_cov2020_sequences.fasta

I am unable to get this from the site.There is no download option for this ,only acknowledgement table is there.

To my experience, I sent the website maintainer a message through "contact" on the top-right of the page. They will activate the "download" button for you.

Thanks,I also have msged them by contact,but nothing updated till now.

Why this is so that this feature is activated for some users and not for all?I am not able to figure out this.

I'm not sure what is going on with that website. I received access this morning. Using my desktop computer, I could not find the download button anywhere on the page. I logged into GISAID from my laptop and the button was magically there - same browser version, same version of Windows.

22 months ago
Michael ▴ 240

For example NCBI's interfaces are of magnitudes better, why maintaining such secondary (inferior infrastructure)?

Or is there any option for batch download of all genome assemblies? From the comments in this forum I don't think so (or not any more).

The download button exists and works just fine. I don't know why some people are having problems with it. The gisaid server seems to be under a high load quite often. I don't know..

Whenever I download the complete file, this is the very first thing I do:

awk '{if(/^>/){h=$0;gsub(">","",$0);gsub("/","_",$0);gsub(" ","_",$0);n=substr($0,1,length($0)-1);print h>n".fna"}else{gsub("-","",$0);print$0>n".fna"}}' gisaid_cov2020_sequences.fasta
mv gisai* ../

for f in *fna; do
dos2unix "$f" awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' "$f" > "$f".tmp awk 'BEGIN{FS="\t";OFS="\n"}{gsub(/^N*/,"",$2);gsub(/^n*/,"",$2);gsub(/N*$/,"",$2);gsub(/n*$/,"",$2);print$1,$2}' "$f".tmp > "$f" rm "$f".tmp
done

OK, now it is there... It was not so far.

EDIT: Thanks for the script by the way!!

I've been using the site since Jan and from what I recall, the button appeared when there were like 200+ genomes..

yeah, think about that for a second, you get data from an official repository and the first thing you need to do is fix it up. They used to even label Hong Kong with a space where no spaces are allowed, thereby breaking the fasta id and many tools that rely on unique ids to work properly.

Well, it's not ideal. However, I do understand that labs want to be acknowledged for their contributions. Especially here it's more than likely that some people involved in the process from sample taking to library preparation have died from the disease..

yes fully agree,

but let's also realize that acknowledging/crediting/recognizing/citing and valuing work has absolutely nothing to do with limiting access and relicensing data.

Istvan, 100% agree. I think the way NCBI, ENA and others handle this, provides the same level of acknowledgment.

To me it looks a bit like this is just some way of monopolizing the data in order to still be "important".

When you use sequence data from the NCBI, ENA, etc, you do not have to acknowledge the sequence contributor and nobody does. I think in particular GISAID should improve access to the sequence data, but anyway, what we have now is far better than what we had during any past outbreak..

Recognition is implicit though. You always refer to NCBI/ENA accession number(s) that anyone can lookup easily and see metadata associated with it.

In what way does a statement like "the sequences have been obtained from GISAID and you can only get them from there" acknowledge the original authors in any way?

In general, everyone does acknowledge the sources or at least it is a standard scientific practice to cite the origins of the data as long as you are finding something of interest.

For example, if you say A is most similar to B you will need to cite B for sure. That's the scientific recognition right there.

There is really no need to cite GISAID as the data source, yet that's what is happening, that is what they are after. They want to get cited and reap all recognition.

I'm not here to defend nor judge GISAID, but why do you think everyone is uploading their data there instead of say the NCBI? BTW If I had my way, the submitters would be required to share their raw sequence data. Like one third of the genomes are unusable because people have no clue of what they're doing

I think this should be investigated and understood. I don't know how it ended up like that or why. Possible reasons include:

• the majority of scientists do not fully understand what GISAID does with their data
• the majority of scientists do not understand that they are not even allowed to relicense their data in the first place
• GISAID has the first mover's advantage.
• Reviewers probably ask people to deposit where everyone else is already.

In general, don't mean to imply that GISAID does not do beneficial things for scientists. They should be funded and supported for the value that they add to the process.

The petition and complaint is about allowing data access freely and in an unimpeded manner.

I can't believe that in 2020 during a major pandemic the source to the data is locked away.

I guess what would solve the problem is if GISAID would allow INSDC to sync data across their databases. Which is not possible the way GISAID operates at the moment.

Fun thing is: GISAID imports data from open INSDC databases and incorporate it into their EpiFlu database ;)

Istvan, this was my thinking as well! Why is it necessary to execute "dos2unix" first. WTF. :(

One should really encourage every single researcher to submit to NCBI, ENA, et al.

I tried to ask why, but for example the flu people are just used to submit data to GISAID. It's mostly because other flu data are there too. It's weird because just a few years ago (at least 2014) they used NCBI for that.

Did this script function for adding the batch download button? Or are there some other measures needed to do?

The script has nothing to do with the web UI

Thank you! I have already tried this, it didn't work.

22 months ago
5heikki 10k

Someone from Canada submitted like 50 genomes with nonsensical collection dates such as 2020-41-04

This is why we can't have nice things :(

edit. I know this is completely offtopic

I thought there was curation before genomes were released into the database?

now think about how incompetent the GISAID people must be that they cannot automatically detect obviously wrong dates on submission.