Question

How to retrieve BAM file from NCBI SRA

0

Entering edit mode

8.5 years ago

mufernando ▴ 10

I downloaded the XML of a NCBI SRA BioProject and just realized that they made public the aligned bam files. However I don't know how to retrieve them. I was downloading the fastqs from EBI ENA with wget.

Here is one of the entries of the XML:

<EXPERIMENT_PACKAGE><EXPERIMENT accession="SRX1801508" alias="PA_9_2012_lib"><IDENTIFIERS><PRIMARY_ID>SRX1801508</PRIMARY_ID><SUBMITTER_ID namespace="SUB1300605">PA_9_2012_lib</SUBMITTER_ID></IDENTIFIERS><TITLE>Pool-seq of adult male Drosophila melanogaster</TITLE><STUDY_REF accession="SRP075757"><IDENTIFIERS><PRIMARY_ID>SRP075757</PRIMARY_ID><EXTERNAL_ID namespace="BioProject">PRJNA308584</EXTERNAL_ID></IDENTIFIERS></STUDY_REF><DESIGN><DESIGN_DESCRIPTION>see Methods in Bergland et al 2014.</DESIGN_DESCRIPTION><SAMPLE_DESCRIPTOR accession="SRS1468217"><IDENTIFIERS><PRIMARY_ID>SRS1468217</PRIMARY_ID><SUBMITTER_ID namespace="pda|hmachado">PA_9_2012</SUBMITTER_ID></IDENTIFIERS></SAMPLE_DESCRIPTOR><LIBRARY_DESCRIPTOR><LIBRARY_NAME>PA_9_2012_lib</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT><PAIRED/></LIBRARY_LAYOUT></LIBRARY_DESCRIPTOR></DESIGN><PLATFORM><ILLUMINA><INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL></ILLUMINA></PLATFORM></EXPERIMENT><SUBMISSION lab_name="Biology" center_name="Stanford University" accession="SRA429389" alias="SUB1300605"><IDENTIFIERS><PRIMARY_ID>SRA429389</PRIMARY_ID><SUBMITTER_ID namespace="Stanford University">SUB1300605</SUBMITTER_ID></IDENTIFIERS></SUBMISSION><Organization type="institute"><Name>Stanford University</Name><Address postal_code="94305"><Department>Biology</Department><Institution>Stanford University</Institution><Street>371 Serra</Street><City>Stanford</City><Sub>California</Sub><Country>United States of America</Country></Address><Contact email="hmachado@stanford.edu"><Address postal_code="94305"><Department>Biology</Department><Institution>Stanford University</Institution><Street>371 Serra</Street><City>Stanford</City><Sub>California</Sub><Country>United States of America</Country></Address><Name><First>Heather</First><Last>Machado</Last></Name></Contact></Organization><STUDY center_name="BioProject" alias="PRJNA308584" accession="SRP075757"><IDENTIFIERS><PRIMARY_ID>SRP075757</PRIMARY_ID><EXTERNAL_ID namespace="BioProject" label="primary">PRJNA308584</EXTERNAL_ID></IDENTIFIERS><DESCRIPTOR><STUDY_TITLE>Real Time Evolution Consortium: Tracking the tempo and mode of evolution over ecologically relevant temporal and spatial scales</STUDY_TITLE><STUDY_TYPE existing_study_type="Other"/><STUDY_ABSTRACT>Understanding the forces that shape hereditary variation within and between species is the central task of evolutionary biologists. Over the last 150 years, since the first publication of On The Origin, we have made tremendous strides in identifying that the dominant factors affecting genetic variation include stochastic processes such as drift and demography as well as the deterministic process of selection. However, the relative contribution of these forces in shaping patterns of genetic variation are largely unknown and attempts to build general models of evolution that favor one force over the other have led to some of the most vigorous debates in modern biology. One of the main reasons that there is no general consensus about the predominance of selection, drift and demography is that, historically, there has been a deficit of spatially and temporally structured population genomic data. Data like these would allow researchers to explicitly test models of demography and selection by watching changes in allele and genotype frequency in real time. Clearly, with the advent of next generation sequencing technology, the potential for generating population genomic data has increased many orders of magnitude. However, what is currently lacking are available biological samples, relevant environmental metadata, a cohesive computational system to organize these data in a spatio-temporal framework, and venues to initiate collaborations amongst scientists to use these data to test specific hypotheses about the evolutionary process. As an outgrowth of a 2012 NESCent Catalysis Meeting, we have formed a consortium of scientists to address the fundamental question, “What is the tempo and mode of evolution over ecologically relevant spatial and temporal scales?” We argue that the best way to provide novel answers to this question is through the collaborative actions of a diverse group of scientists and through policies and cyber-infrastructure that promote open data to the broader scientific community. Recent efforts have focused on: 1) obtaining standardized collections of natural D. melanogaster populations from around the world as well as paired seasonal collections (spring through autumn) in a subset of locales; 2) producing high coverage, pooled genomic resequencing of these populations and associated analysis; 3) generating standardized, comprehensive phenotypic data (e.g., body size, stress tolerance, immune response, pigmentation) for isofemale lines derived from these collections.</STUDY_ABSTRACT><CENTER_PROJECT_NAME>Drosophila melanogaster</CENTER_PROJECT_NAME></DESCRIPTOR><STUDY_LINKS><STUDY_LINK><URL_LINK><LABEL>Real Time Evolution Consortium</LABEL><URL>http://sites.sas.upenn.edu/paul-schmidt-lab/pages/opportunities</URL></URL_LINK></STUDY_LINK></STUDY_LINKS></STUDY><SAMPLE alias="PA_9_2012" accession="SRS1468217"><IDENTIFIERS><PRIMARY_ID>SRS1468217</PRIMARY_ID><EXTERNAL_ID namespace="BioSample">SAMN04412693</EXTERNAL_ID><SUBMITTER_ID namespace="pda|hmachado" label="Sample name">PA_9_2012</SUBMITTER_ID></IDENTIFIERS><SAMPLE_NAME><TAXON_ID>7227</TAXON_ID><SCIENTIFIC_NAME>Drosophila melanogaster</SCIENTIFIC_NAME></SAMPLE_NAME><DESCRIPTION>50 pooled individuals</DESCRIPTION><SAMPLE_LINKS><SAMPLE_LINK><XREF_LINK><DB>bioproject</DB><ID>308584</ID><LABEL>PRJNA308584</LABEL></XREF_LINK></SAMPLE_LINK></SAMPLE_LINKS><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE><TAG>strain</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>isolate</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>breed</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>cultivar</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ecotype</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>age</TAG><VALUE>not collected</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>dev_stage</TAG><VALUE>adult</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>sex</TAG><VALUE>male</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>tissue</TAG><VALUE>whole organism</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>biomaterial_provider</TAG><VALUE>Schmidt Lab</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>collection_date</TAG><VALUE>2012-09-13</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>geo_loc_name</TAG><VALUE>USA: Linvilla, Pennsylvania</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>lat_lon</TAG><VALUE>39.5302 N 75.2449 W</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>sample_type</TAG><VALUE>pooled whole organisms</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>BioSampleModel</TAG><VALUE>Model organism or animal</VALUE></SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES></SAMPLE><Pool><Member member_name="" accession="SRS1468217" sample_name="PA_9_2012" sample_title="50 pooled individuals" spots="52872089" bases="9686919752" tax_id="7227" organism="Drosophila melanogaster"><IDENTIFIERS><PRIMARY_ID>SRS1468217</PRIMARY_ID><EXTERNAL_ID namespace="BioSample">SAMN04412693</EXTERNAL_ID><SUBMITTER_ID namespace="pda|hmachado" label="Sample name">PA_9_2012</SUBMITTER_ID></IDENTIFIERS></Member></Pool><RUN_SET><RUN accession="SRR3590563" alias="SU_2012_SPT.sorted.rmdup.realign.bam" total_spots="52872089" total_bases="9686919752" size="4722348576" load_done="true" published="2016-06-06 14:50:46" is_public="true" cluster_name="public" static_data_available="1"><IDENTIFIERS><PRIMARY_ID>SRR3590563</PRIMARY_ID><SUBMITTER_ID namespace="SUB1300605" label="9">SU_2012_SPT.sorted.rmdup.realign.bam</SUBMITTER_ID></IDENTIFIERS><EXPERIMENT_REF accession="SRX1801508"><IDENTIFIERS><SUBMITTER_ID namespace="SUB1300605">PA_9_2012_lib</SUBMITTER_ID></IDENTIFIERS></EXPERIMENT_REF><Pool><Member member_name="" accession="SRS1468217" sample_name="PA_9_2012" sample_title="50 pooled individuals" spots="52872089" bases="9686919752" tax_id="7227" organism="Drosophila melanogaster"><IDENTIFIERS><PRIMARY_ID>SRS1468217</PRIMARY_ID><EXTERNAL_ID namespace="BioSample">SAMN04412693</EXTERNAL_ID><SUBMITTER_ID namespace="pda|hmachado" label="Sample name">PA_9_2012</SUBMITTER_ID></IDENTIFIERS></Member></Pool><AlignInfo path="/netmnt/traces04/sra12/SRR/003506/SRR3590563" cnt="14"><Alignment name="2LHet" seqid="2LHet" local="y"/><Alignment name="2RHet" seqid="2RHet" local="y"/><Alignment name="3LHet" seqid="3LHet" local="y"/><Alignment name="3RHet" seqid="3RHet" local="y"/><Alignment name="ArmU" seqid="ArmU" local="y"/><Alignment name="ArmUextra" seqid="ArmUextra" local="y"/><Alignment name="XHet" seqid="XHet" local="y"/><Alignment name="YHet" seqid="YHet" local="y"/><Alignment name="arm_2L" seqid="arm_2L" local="y"/><Alignment name="arm_2R" seqid="arm_2R" local="y"/><Alignment name="arm_3L" seqid="arm_3L" local="y"/><Alignment name="arm_3R" seqid="arm_3R" local="y"/><Alignment name="arm_4" seqid="arm_4" local="y"/><Alignment name="arm_X" seqid="arm_X" local="y"/></AlignInfo><Statistics nreads="2" nspots="52872089"><Read index="0" count="52463998" average="92" stdev="0"/><Read index="1" count="52828608" average="92" stdev="0"/></Statistics><Databases><Database><Table name="PRIMARY_ALIGNMENT"><Statistics source="meta"><Rows count="101130128"/><Elements count="9303971776"/></Statistics></Table><Table name="REFERENCE"><Statistics source="meta"><Rows count="33750"/><Elements count="168717020"/></Statistics></Table><Table name="SECONDARY_ALIGNMENT"><Statistics source="meta"><Rows count="444146"/><Elements count="40861432"/></Statistics></Table><Table name="SEQUENCE"><Statistics source="meta"><Rows count="52872089"/><Elements count="9686919752"/></Statistics></Table></Database></Databases><Bases cs_native="false" count="9686919752"><Base value="A" count="2827619959"/><Base value="C" count="1986241597"/><Base value="G" count="1998028281"/><Base value="T" count="2821331558"/><Base value="N" count="53698357"/></Bases><Run size="4722363929" date="2016-11-16 00:20:19" accession="/netmnt/traces04/sra12/SRR/003506/SRR3590563" read_length="variable" spot_count="52872089" base_count="9686919752" base_count_bio="9686919752" spot_count_mates="52420517" base_count_bio_mates="9645375128" spot_count_bad="0" base_count_bio_bad="0" spot_count_filtered="0" base_count_bio_filtered="0" cmp_base_count="382947976"><Member member_name="SU_2012_SPT" spot_count="52872089" base_count="9686919752" base_count_bio="9686919752" spot_count_mates="52420517" base_count_bio_mates="9645375128" spot_count_bad="0" base_count_bio_bad="0" spot_count_filtered="0" base_count_bio_filtered="0" cmp_base_count="382947976" library="SU_2012_SPT" sample="SU_2012_SPT"/><Size value="4722348576" units="bytes"/><Bases cs_native="false" count="9686919752"><Base value="A" count="2827619959"/><Base value="C" count="1986241597"/><Base value="G" count="1998028281"/><Base value="T" count="2821331558"/><Base value="N" count="53698357"/></Bases><Statistics nreads="2" nspots="52872089"><Read index="0" count="52463998" average="92" stdev="0"/><Read index="1" count="52828608" average="92" stdev="0"/></Statistics><QualityCount><Quality value="0" count="1298935"/><Quality value="2" count="434461706"/><Quality value="5" count="2625824"/><Quality value="6" count="4443856"/><Quality value="7" count="20094797"/><Quality value="8" count="18826884"/><Quality value="9" count="11322181"/><Quality value="10" count="12973562"/><Quality value="11" count="8004068"/><Quality value="12" count="4576571"/><Quality value="13" count="9086228"/><Quality value="14" count="4712556"/><Quality value="15" count="13915038"/><Quality value="16" count="10075712"/><Quality value="17" count="14774290"/><Quality value="18" count="29452510"/><Quality value="19" count="19598615"/><Quality value="20" count="29127244"/><Quality value="21" count="11586758"/><Quality value="22" count="18531570"/><Quality value="23" count="36166569"/><Quality value="24" count="46797766"/><Quality value="25" count="65185668"/><Quality value="26" count="67973519"/><Quality value="27" count="79466262"/><Quality value="28" count="46215134"/><Quality value="29" count="121162001"/><Quality value="30" count="148177082"/><Quality value="31" count="197467109"/><Quality value="32" count="193602423"/><Quality value="33" count="327305905"/><Quality value="34" count="560952703"/><Quality value="35" count="1356753357"/><Quality value="36" count="508246209"/><Quality value="37" count="502310008"/><Quality value="38" count="496884993"/><Quality value="39" count="922825384"/><Quality value="40" count="1104945587"/><Quality value="41" count="2224993168"/></QualityCount><Databases><Database><Table name="PRIMARY_ALIGNMENT"><Statistics source="meta"><Rows count="101130128"/><Elements count="9303971776"/></Statistics></Table><Table name="REFERENCE"><Statistics source="meta"><Rows count="33750"/><Elements count="168717020"/></Statistics></Table><Table name="SECONDARY_ALIGNMENT"><Statistics source="meta"><Rows count="444146"/><Elements count="40861432"/></Statistics></Table><Table name="SEQUENCE"><Statistics source="meta"><Rows count="52872089"/><Elements count="9686919752"/></Statistics></Table></Database></Databases><Hashes run_hash="5SVar+9/DnfDhGcNQ26k3g==" read_hash="ty3gc8XWcZNkLRv7fbcCdA=="/></Run></RUN></RUN_SET></EXPERIMENT_PACKAGE>

Emphasis to this part:

<IDENTIFIERS><PRIMARY_ID>SRR3590563</PRIMARY_ID><SUBMITTER_ID namespace="SUB1300605" label="9">SU_2012_SPT.sorted.rmdup.realign.bam</SUBMITTER_ID></IDENTIFIERS>

Could anybody point me in the right direction here?

Thank you!

SRA NCBI BAM • 2.0k views

ADD COMMENT • link 8.5 years ago by mufernando ▴ 10

1

Entering edit mode

Don't know what is happening now. At least before, they only provided aligned cSRA files, not BAMs. ERA should have BAMs if submitters uploaded them.

ADD REPLY • link 8.5 years ago by lh3 33k

0

Entering edit mode

Sorry, but what is ERA?

ADD REPLY • link 8.5 years ago by mufernando ▴ 10

1

Entering edit mode

He probably meant ENA, the EBI ENA.

ADD REPLY • link 8.5 years ago by mastal511 ★ 2.1k