Question: How To Convert Sra-Lite Paired-End Submission To Fastq?
15
gravatar for Casey Bergman
7.4 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

I'm having some trouble converting an Illumina paired end accession from NCBI's SRA to the paired _1 and _2 fastq files using fastq-dump from the SRA toolkit. I'm running fastq-dump version 2.1.0 (June 22, 2011) and following instructions from the NCBI website here.

When I download this (or other accessions from the same project) and convert to fastq, one or the other of the _1 and _2 fastq files has 2x as many sequences, with all of the sequences from the smaller file being included in the larger file, e.g.

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump -A SRR189044 ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Notice how the _1 file has two reads named SRR189044.1, one of which is the corresponding read in the _2 file.

I've checked with the data submitters and NCBi and it looks like there is no duplication of data in the original submission. There is a related post on SeqAnswers that unfortunately does not address or help solve this issue. Any ideas on what might be going on here would be appreciated.

Many thanks, Casey

sra paired fastq conversion • 21k views
ADD COMMENTlink modified 3.4 years ago by Maximilian Haeussler1.3k • written 7.4 years ago by Casey Bergman17k

Adding this self-Q&A to help others with the same problem.

ADD REPLYlink written 7.4 years ago by Casey Bergman17k
24
gravatar for Casey Bergman
7.4 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

The problem you are experiencing is that the version of the SRA toolkit is out of date and that there is now an un(der)documented option in fastq-dump to dump paired end data from an SRA-lite submission. The guidance notes on the NCBI website you refer to are for version 2.0.1, and state that they are not up to date:

This guide is current to SRA Toolkit version 2.0.1 release candidate 1. Instructions for previous versions of the SRA Toolkit may be different from those provided in this guide. We recommend that users stay current with SRA Toolkit updates to benefit from feature additions and bug fixes.

In the latest version of SRA tool 2.1.2 (July 26 2011), there are now options to split paired end reads into separate file:

 --split-files                    Dump each read into a separate file.Files will received suffix corresponding to read number
 --split-3                        Legacy 3-file splitting for mate-pairs:
                                  First 2 biological reads satisfying dumping conditions
                                  are placed in files *_1.fastq and *_2.fastq
                                  If only 1 biological read is dumpable - it is placed in *.fastq

The explanation of the "--split-files" option says that each read will be dumped into a separate file, which is ambiguous and could mean that every reads is put into a separate file. It actually means that each read from a mate pair is put into a _1 or _2 fastq file, which is the desired outcome.

For your example, you should upgrade to SRA Tools v2.1.2 and run the following commands:

$ wget -rq ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/SRX058/SRX058150/
$ fastq-dump --split-files ftp-trace.ncbi.nlm.nih.gov/sra/srainstant/reads/ByExp/litesra/SRX/SRX058/SRX058150/SRR189044/SRR189044.lite.sra
$ head SRR189044*fastq
==> SRR189044_1.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
NGTTTCACGTCTGGCGATTTTGACTCATTTTTGAACGAATGCAATGTAACNNNNNNNNAAAATGCAACAGGACCGN
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
############################################################################
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
NTCGATCGCACTGGCGAAGATGAGGAAGCTGTTCTTTCTGGTGATGCTGACNNNNGTCGCCTCGGCCACCGCCTGG
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
#4767;;7<:>?@@##############################################################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
AACAGATTGTATATGTGTTTTTTTTACATGGCTCATTGGCAAATGTTTTTGNNNNATCGAAATCTTTCTCGTATAC

==> SRR189044_2.fastq <==
@SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
ATTTTTTCTTTTAACTCATCCATAATTTGTCCTTTTTGTTGTCACCCACAGAACATAATGCTAGGATACTGTTTAA
+SRR189044.1 HWI-EAS66_0007_FC705J6:5:1:1044:2452 length=76
-62077328'//5A5:6?:BA5?--:=67C6?>>=5=;8,'--B?:A69?-5?>-<--;64-?:-CD:BA.@;>4>
@SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
AGCTGCTCGCAGGCGAATATCAGCCAAGAGCAGAACATCACGTCGCATAGATTGGAGCGGTTCATCGAGACGAGCA
+SRR189044.2 HWI-EAS66_0007_FC705J6:5:1:1045:19202 length=76
DBDBC:5CC-A-A:CC-C,C55-::CCB,-DD--5?@=D?=:5D=:=?############################
@SRR189044.3 HWI-EAS66_0007_FC705J6:5:1:1047:3502 length=76
TATGATGTTTAATGCGTTCCCCATTTACTCTTATAAAGTCTTCTAGATTTGTTATTGCAATGCAGGAATTAATAGG

Hope this helps!

ADD COMMENTlink written 7.4 years ago by Casey Bergman17k

This explains EVERYTHING. But seriously, this option should be stressed. I can't believe I missed it, and without it my assemblies were making almost no contigs.

ADD REPLYlink written 6.7 years ago by Lee Katz2.9k

Followup question: is there a way to output an interwoven or shuffled file for input to Velvet?

ADD REPLYlink written 6.7 years ago by Lee Katz2.9k

I would also recommend the --helicos option, which makes the generated fastq files smaller :)

ADD REPLYlink written 5.9 years ago by dli220
0
gravatar for Maximilian Haeussler
3.4 years ago by
UCSC
Maximilian Haeussler1.3k wrote:

I had the same problem. This help message is prone to misunderstanding: "Dump each read into a separate file".

ADD COMMENTlink written 3.4 years ago by Maximilian Haeussler1.3k

emailed SRA and pointed it out

ADD REPLYlink written 3.4 years ago by Maximilian Haeussler1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1441 users visited in the last hour