Question about COVID19 data sets
0
0
Entering edit mode
4.0 years ago
rob.costa1234 ▴ 310

I am looking to download or analyze a compilation of following two data sets : 1. Sequencing of virus isolates from COVID 19 patients.( if not all most of them different geographical areas) 2. Covid Positive patents at the time of infection and either same patients / different patients that have recovered from Covid

Is there any source where these all data sets can be downloaded ( or selected ones) can be isolated or any guidance how it can be downloaded in from scattered sources in an efficient manner by selecting the ones based on country/ disease status/ sex etc?

Thanks

General questions data • 1.1k views
ADD COMMENT
0
Entering edit mode

Have you checked NCBI Virus and GISAID?

ADD REPLY
1
Entering edit mode

It is useful. I think it does not have clinical out come of patients as well as hosts sequencing as noted in second point.

ADD REPLY
0
Entering edit mode

I don't expect you will find many sequenced host genomes, if any.

ADD REPLY
0
Entering edit mode

I worked on the first question for a while (genome sequences of virus isolates) and I found out that there are two major data sources : NCBI Virus dataset and GISAID dataset. The later has a very comprehensive meta-information (in my opinion) and has more "entries" (whole genome sequences) for COVID-19, the biggest drawback is that it is not easy to download this dataset. GISAID requires users to register with their institute email address (which is a bit odd) and I did this only to find out that one can download data one accession at a time (there are ~1300 accessions for COVID-19). You can filter your data (if I remember correctly on the basis of geolocation, hosts etc?), however you cannot download the entire dataset (this was my experience 20 days ago). I wrote them an email about this and I have not heard back yet. NCBI Virus has ~93 accessions (this was 20 days ago) for COVID-19 and data is very easy to download (because it is NCBI), however the meta-information is not very consistent (multiple entries where location is missing etc).

edit : I can now see a "Download" button on the top right corner of the dataset, this maybe because I sent them a request on the contact form a while ago

ADD REPLY
1
Entering edit mode

Request for bulk download using the contact form.

ADD REPLY
0
Entering edit mode

Dear Manaswwm, I am currently pursuing a course in genomics data science in Bangalore, after few years of career break in research. I'm mainly doing the course to familiarize myself with the latest developments in research.

as part of my curriculum, i have learnt whole genome analysis in command line using GATK. I have practiced WGS on humans and e coli, next I would like to try the same on virus specifically the corona virus.

I looked for the reference genome in NCBI, to get the FASTA file. I'm finding quiet a few and not sure which one to chose and how to start. there are about 800 fastq files.

Please can someone guide me through this exercise, please be kind to me, I am a beginner, I am keen on learning and this is the best place to approach.

WGS #COVID19 #GATK #Learning

ADD REPLY
0
Entering edit mode

Hello, sars-cov2 has a single whole genome reference sequence and the rest are samples (not sure if sample would be the most accurate word to use here).In NCBI virus database you can search for sars-cov2 sequences and on the result page, there is a "filter" box on the left side of the page where you can mention "RefSeq" and "whole genome" (or something along those lines, I could have told you the exact filters but I am having problems loading the NCBI virus db). Else, you can go into the Genome database of NCBI and search for sars-cov2 : in the result page you will see a section "Reference Genome" where you should be able to access the reference genome (which has RefSeq id NC_045512.2). Good luck!

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6