Question

Finding the raw data used to build reference genomes

0

Entering edit mode

2.1 years ago

Aaron • 0

Is the a way to find the raw data (fastq or other) that was used to generate a reference genome? and is there a quick way to do this for a large number of genomes?

reference-genome • 780 views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 2.1 years ago by Aaron • 0

score 3 · Accepted Answer · 2023-09-07

You could do this using EntrezDirect. I am using random NCBI genbank assembly identifiers below. Once you have the SRA accession you can get at the sequence data.

$ esearch -db assembly -query GCA_008245085 | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -element Identifiers
BioSample: SAMN06711904; Sample name: PFDSM3638; SRA: SRS4513276

One more example (from RefSeq)

$ esearch -db assembly -query GCF_021347895 | elink -target biosample | efetch -format docsum | xtract -pattern DocumentSummary -element Identifiers
BioSample: SAMN16534234; Sample name: KAUSTApolyChrSc; SRA: SRS7576196