Question

E. coli K12 MG1655 datasets from Illumina

2

Entering edit mode

8.8 years ago

midox ▴ 290

Hello,

I want to have the E.coli K12 MG1655 reads generated by Illumina sequencer. Can you give me a link to download the data? I found in the Illumina website but I do not know how to download.

Thank you

E.coli illumina-datasets database • 9.8k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 8.8 years ago by midox ▴ 290

Ram · Answer 1 · 2017-10-12

If you are looking for and Illumina dataset for E. coli K12 strain MG1655 that can be downloaded from a public server (i.e. without a BaseSpace account), and has stable INSDC accession numbers, I recommend using https://www.ebi.ac.uk/ena/data/view/ERX008638. ERX008638 is GAIIx 2x100bp run derived from the ATCC MG1655 type strain and was generated by Illumina in late 2010. The NCBI SRA abstract for ERX008638 states:

E. coli K-12 holds a key position as a model organism in studies of molecular biology, biochemistry, genetics and biotechnology. A high-quality reference sequence of the genome of E. coli K-12 strain MG1655 has previously been generated using the Sanger dideoxy sequencing method (Reference sequence accession: NC_000913). We have re-sequenced the genome of the MG1655 strain using the Illumina Genome Analyzer IIx to provide a paired-end, 100 base sequence dataset for development of de novo sequence assembly algorithms or other downstream analysis tools. We generated a sequencing library with a median insert size of 500 bp following random fragmentation and gel fractionation of genomic DNA. Sequence reads were aligned to the reference genome using ELANDv2e. This study supersedes experiment ERX002508 and run ERR008613 under study ERP000092.

The MG1655 referred to by Adrian Pelin above that is available from the Spades developer's website corresponds to https://www.ebi.ac.uk/ena/data/view/ERX002508. This dataset was originally created by Illumina and submitted to EBI in early 2010, and the tar.gz file provided by the submitter is corrupted and therefore the ENA submission is incomplete (i.e. fastq files are not available). This is probably why the Spades developers provide their own version of the files. Downloads from the Spades site are slow, and the lack of stable IDs and checksums make use of the files from the Spades website less than ideal. The NCBI SRA submission for ERX002508 appears to be valid (I have not thoroughly checked this), but downloads from SRA have the problem of conversion from SRA format to fastq. Furthermore, as noted in the NCBI SRA abstract for ERX008638 above, Illumina say that ERX008638 supercedes ERX002508.

It also appears that ERX008638 is preferred by most people in the community: 11 Open Access papers have been published citing ERX008638 (https://www.ncbi.nlm.nih.gov/pmc/?term=ERX008638+or+ERR022075), in contrast to 3 for ERX002508 (https://www.ncbi.nlm.nih.gov/pmc/?term=ERX002508+or+ERR008613).

Ram · Answer 2 · 2015-10-04

1

Entering edit mode

8.8 years ago

Adrian Pelin ★ 2.6k

I know the SPAdes guys offer it here: http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/

6.2Gb, 28M reads, 2x100bp, insert size ~ 215bp (Illumina Genome Analyzer IIx). E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 8.8 years ago by Adrian Pelin ★ 2.6k