Finding the raw data used to build reference genomes
1
0
Entering edit mode
2.1 years ago
Aaron • 0

Is the a way to find the raw data (fastq or other) that was used to generate a reference genome? and is there a quick way to do this for a large number of genomes?

reference-genome • 780 views
ADD COMMENT
3
Entering edit mode
2.1 years ago
GenoMax 154k

You could do this using EntrezDirect. I am using random NCBI genbank assembly identifiers below. Once you have the SRA accession you can get at the sequence data.

$ esearch -db assembly -query GCA_008245085 | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -element Identifiers
BioSample: SAMN06711904; Sample name: PFDSM3638; SRA: SRS4513276

One more example (from RefSeq)

$ esearch -db assembly -query GCF_021347895 | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -element Identifiers
BioSample: SAMN16534234; Sample name: KAUSTApolyChrSc; SRA: SRS7576196
ADD COMMENT

Login before adding your answer.

Traffic: 3025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6