Question: How to split paired end SRA file into 2 correct fastq files
5
gravatar for thustar
2.4 years ago by
thustar100
thustar100 wrote:

Hello Biostars!

In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.

For example, after I ran fastq-dump ERR011087.sra, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGTATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(

After I ran fastq-dump --split-files ERR011087.sra, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGT   
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+

The first read of ERR011087_2.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
ATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA  
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(

It seems that fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?

If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran grep "@ERR" ERR011087.fastq |wc -l and got 11640976, ran grep "@ERR" ERR011087_1.fastq |wc -l and got 11640674, ran grep "@ERR" ERR011087_2.fastq |wc -l and got 11640358. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.

Could anyone explain that?

Thanks.

next-gen fastq sra • 16k views
ADD COMMENTlink modified 2.4 years ago by Antonio R. Franco4.0k • written 2.4 years ago by thustar100
5
gravatar for Devon Ryan
2.4 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

--split-files is splitting things according to how the actual reads should be split. If the original dataset happened to be 2x44, then yes it'll just split things in half. The problem with SRA is that a fair number of uploaded datasets are simply crap, i.e., people uploaded poorly formatted or incorrect data. For all ERR* datasets, do not use SRA. Download the original fastq files from ENA. If those have different numbers of reads then that's what was uploaded.

ADD COMMENTlink written 2.4 years ago by Devon Ryan89k

Thanks for your quick reply.

One more question. In your recommended website, there are many options like Fastq files (ftp) and Submitted files (ftp). Is there any difference? Which file should I download?

ADD REPLYlink written 2.4 years ago by thustar100

Go for the submitted files, noting that they'll have different file names from the accession ID.

ADD REPLYlink written 2.4 years ago by Devon Ryan89k
8
gravatar for Antonio R. Franco
2.4 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

In addition... Once you get the correct sra files, try to use fastq-dump with the legacy --split-3 command, as it happens in some cases that the paired-end files are not synchronized which is what many programs are expecting

With that command, I got sometimes a third fastq file corresponding to those sequences that are not paired or lack its mate for one reason or another

ADD COMMENTlink written 2.4 years ago by Antonio R. Franco4.0k
1

What are the exact contents of 3 files?

ADD REPLYlink written 2.4 years ago by Santosh Anand4.7k
3

Read 1, read 2, and orphaned reads.

ADD REPLYlink written 2.4 years ago by Devon Ryan89k

For what it is worth, +1 bounties to both of you :)

ADD REPLYlink written 2.4 years ago by Santosh Anand4.7k

Agree! This is very helpful!

ADD REPLYlink written 20 months ago by mabelwongting0
4
gravatar for Santosh Anand
2.4 years ago by
Santosh Anand4.7k
Santosh Anand4.7k wrote:

Apart from Devon suggestion, when in doubt, always check the Trace DB https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR011087

There you can see "This run has 2 reads per spot", of 44bp each.

Click in the "Reads" tab there. You can see the individual reads and estimate total number of read-pairs (by checking progressive numbers 1,2,3,.. in header). Click on "1164098" or put that in search bar. It will show you some of the very last reads. Do they look unusual? Does that answer your Q?

ADD COMMENTlink written 2.4 years ago by Santosh Anand4.7k

With crappy SRA files you cannot rely in what is displayed on that web page. This web page just displays data taken from the SRA file. There do exist some SRA files which have been generated from inconsistent datasets.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by piet1.6k

A concrete example would have been much more useful than an opinion. Neither do I believe that a the sra website can get better information than what is present in sra file. But the OP's Q was to get the best out of the crappy sra, if you would like to say so.

ADD REPLYlink written 2.4 years ago by Santosh Anand4.7k

Any complex technical system has some errors. You may encounter them if you use the system intensively. I am pretty sure you could also find some faulty submissions if you would do some serious analysis of SRA data sets, especially older Illumina data from around 2012. OP asked for a data set from 2011. ERR033684 and companions is an example, where even the FASTQ files offered by ENA are faulty.

ADD REPLYlink written 2.4 years ago by piet1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour