Question

where do you download SARS-CoV-2 sequences data ?

0

Entering edit mode

3.5 years ago

2001linana ▴ 40

From the research paper I have encountered, GISAID and NCBI are the two common open databases / web services that people can use to download sequences data. Lately, I have started to use GISAID and met a certain issue. See how to download data from gisaid ?.

Despite this, I was trying to figure out the meaning of the data. So, in the readme.txt file, the fasta header format is as the following:

Gene name|Isolate name|YYYY-MM-DD|Isolate ID|Passage details/history|Type^^location/state|Host|Originating lab|Submitting lab|Submitter|Location(country)

Here is what I was trying to understand it:

Gene name : a specific gene name.
Isolate name: name of this specific isolate. 
YYYY-MM-DD: isolate date.
Isolate ID: each isolate has an ID. 
Passage details / history:  what is passage details / history. 
Type^^location / state: the sampling city. 
Host: whether it is a human host or other animal host
Originating lab: the lab which originate this isolate
Submitting lab: the lab which submit this isolate
Submitter: the person who submit this isolate
Location(country): country of this isolate

So, did I understand this correctly ? What is passage details / history ? How about NCBI data sets? To do a basic phylogenetic analysis, I might only need the sequences, the sampling date and the sampling location ? What is your opinion ?

Then, I started to work on the file allprot1109.fasta. I'm not sure why is it named like this? So the first piece of data is like the following.

>NSP1|hCoV-19/Wuhan/WIV04/2019|2019-12-30|EPI_ISL_402124|Original|hCoV-19^^Hubei|Human|Wuhan Jinyintan Hospital|Wuhan Institute of Virology|Wuhan Institute of Virology|China
MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGG

So, is NSP1 the gene name? It does not look like that. It does not fit the data format displayed in the file README.txt. Can anyone please hep me to clarify this?

For simply research purpose, maybe I can ignore this, and only focus on the sampling/isolate location, sampling/isolate time, and sequences. So, in the above first piece, I'd like to maybe extract

Wuhan/2019-12-30/MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELN,GG

right? A simple script in Python/C/C++/Matlab can help to extract this. Could anyone bother to share some simple code for this initial elementary data preprocessing step?

One more question, I guess this sequence is an amino acid sequence, as the letters indicate. Is this correct? So, I might also need to transfer this into its nucleotide counterpart.

RNA-Seq sequencing sequence • 1.9k views

ADD COMMENT • link updated 3.5 years ago by Jean-Karim Heriche 27k • written 3.5 years ago by 2001linana ▴ 40

0

Entering edit mode

How about sequences data from NCBI ? Would that be better or worse ?

ADD REPLY • link 3.5 years ago by 2001linana ▴ 40

0

Entering edit mode

There is no better or worse. NCBI's COVID19 portal hosts 39K genomes. You can also get the proteins by clicking on Protein tab (425K+ entries as of now).

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

GISAID is not an open database so there may not be many people here with experience with it. Try using open resources, you'll have more chances of finding people using them and be able to answer your questions, especially as you don't seem to care about where the data comes from.

ADD REPLY • link 3.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I guess this sequence is an amino acid sequence, as the letters indicate. Is this correct?

Yes

NSP1 the gene name?

NSP1 is the name of the protein since this is protein sequence.

What is passage details / history ?

If the virus underwent cell culture then it would indicate how many rounds.

To do a basic phylogenetic analysis, I might only need the sequences, the sampling date and the sampling location ?

You would only need a unique identifier per sequence. That other information is metadata for separate analyses.

NCBI and Nextstrain.org make pre-made phylogenetic analyses available. Do you have to do your own?

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

what do you mean by "pre-made phylogenetic analyses" ?

ADD REPLY • link 3.4 years ago by 2001linana ▴ 40

0

Entering edit mode

NCBI has already done the phylogenetic analyses for you. You can download the tree/alignment data. You can also upload your own sequence in that tool I linked.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

What is the general procedure to do sequence analysis with SARS-CoV-2 data ?

ADD REPLY • link 3.4 years ago by 2001linana ▴ 40

1

Entering edit mode

It depends on the goal of the analysis. Have a look at the workflows for COVID-19 analysis on usegalaxy.*. You can also find SARS-CoV-2/COVID-19 related workflows on the WorkflowHub.

ADD REPLY • link 3.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

NextStrain makes a tutorial available for Genomic epidemiology of SARS-CoV-2 data.

ADD REPLY • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

I was reading the tutorial regarding "preparing your data" section. This link suggests using GISAID to download data for use. However, when I was trying to download data using GISAID there. It does not have the same format I saw several days ago. Certain files are missing. Now I can only see four files with names FASTA header format, allprot1118, spikeprot1118, and nextregions. Could anyone let me know why is that? Is it because GISAID data is not publicly accessible to anyone ?

ADD REPLY • link 3.4 years ago by 2001linana ▴ 40

0

Entering edit mode

GISAID data is not publicly accessible to anyone ?

Yes. You need to apply and they need to accept your application. The problem with GISAID is that you're not allowed to share the data not even in a publication. I believe this is bad scientific practice so I encourage you to use public data instead.

ADD REPLY • link 3.4 years ago by Jean-Karim Heriche 27k

score 1 · Answer 1 · 2020-11-12

1

Entering edit mode

3.5 years ago

Jean-Karim Heriche 27k

Try the COVID19 data portal.

ADD COMMENT • link 3.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi, I just downloaded a huge file (2GB) from COVID19 data portal. I was wondering, do you happen to know any links or references for the initial processing of the sequences data from COVID19 data portal ?

ADD REPLY • link 3.3 years ago by 2001linana ▴ 40