Question: How to split paired end SRA file into 2 correct fastq files
gravatar for thustar
4.2 years ago by
thustar100 wrote:

Hello Biostars!

In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.

For example, after I ran fastq-dump ERR011087.sra, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88

After I ran fastq-dump --split-files ERR011087.sra, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  

The first read of ERR011087_2.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44  

It seems that fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?

If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran grep "@ERR" ERR011087.fastq |wc -l and got 11640976, ran grep "@ERR" ERR011087_1.fastq |wc -l and got 11640674, ran grep "@ERR" ERR011087_2.fastq |wc -l and got 11640358. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.

Could anyone explain that?


next-gen fastq sra • 28k views
ADD COMMENTlink modified 5 months ago by Carlo.Ga0 • written 4.2 years ago by thustar100

Hello everyone, I'm replying to this old topic because I'm unfortunately still having a similar issue. I already tried everything I found on this and other similar topics but nothing works for me. I'm trying to download, using fasterq-dump, some SRR* .fastq that are all supposed to be frw and rev (...fastq_1 and ....fastq_2) as they come from illumina sequencing technology. Anyway, some of the files are good, already splitted. Others figure as single fastq files with all reads together (I can see from the lenght that both frw and rev reads are collapsed in one fastq file); I think there was somethig wrong with the NCBI submission.. of course this causes problems for the alignment pipelines. I tried to run a script to manually split and rename the files, that works but takes forever because the files are too many. Any suggestions about some tools to use? I tried from bbmap and all the fastq-dump options already mentioned (even if, for what I understood, the last fasterq-dump now should split the file automatically). Many thanks for the help

ADD REPLYlink written 5 months ago by Carlo.Ga0

Your best bet is to get the data directly as fastq from EBI-ENA. You can use sra-explorer to get the necessary command lines (C: sra-explorer : find SRA and FastQ download URLs in a couple of clicks ).

ADD REPLYlink modified 5 months ago • written 5 months ago by GenoMax95k

Splitting the fastq three ways using --split-e might work here. There may be some erroneous pairs in the set. See fastq-dump help:

 --split-files                 Write reads into separate files. Read
                                 number will be suffixed to the file
                                 name. NOTE! The `--split-3` option is
                                 recommended. In cases where not all
                                 spots have the same number of reads,
                                 this option will produce files that WILL
                                 CAUSE ERRORS in most programs which
                                 process split pair fastq files.
 --split-e                     3-way splitting for mate-pairs. For each
                                 spot, if there are two biological reads
                                 satisfying filter conditions, the first
                                 is placed in the `*_1.fastq` file, and
                                 the second is placed in the `*_2.fastq`
                                 file. If there is only one biological
                                 read satisfying the filter conditions,
                                 it is placed in the `*.fastq` file.All
                                 other reads in the spot are ignored.
ADD REPLYlink written 9 weeks ago by vsevim10
gravatar for Devon Ryan
4.2 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

--split-files is splitting things according to how the actual reads should be split. If the original dataset happened to be 2x44, then yes it'll just split things in half. The problem with SRA is that a fair number of uploaded datasets are simply crap, i.e., people uploaded poorly formatted or incorrect data. For all ERR* datasets, do not use SRA. Download the original fastq files from ENA. If those have different numbers of reads then that's what was uploaded.

ADD COMMENTlink written 4.2 years ago by Devon Ryan98k

Thanks for your quick reply.

One more question. In your recommended website, there are many options like Fastq files (ftp) and Submitted files (ftp). Is there any difference? Which file should I download?

ADD REPLYlink written 4.2 years ago by thustar100

Go for the submitted files, noting that they'll have different file names from the accession ID.

ADD REPLYlink written 4.2 years ago by Devon Ryan98k
gravatar for Antonio R. Franco
4.2 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

In addition... Once you get the correct sra files, try to use fastq-dump with the legacy --split-3 command, as it happens in some cases that the paired-end files are not synchronized which is what many programs are expecting

With that command, I got sometimes a third fastq file corresponding to those sequences that are not paired or lack its mate for one reason or another

ADD COMMENTlink written 4.2 years ago by Antonio R. Franco4.5k

What are the exact contents of 3 files?

ADD REPLYlink written 4.2 years ago by Santosh Anand5.2k

Read 1, read 2, and orphaned reads.

ADD REPLYlink written 4.2 years ago by Devon Ryan98k

For what it is worth, +1 bounties to both of you :)

ADD REPLYlink written 4.2 years ago by Santosh Anand5.2k

Agree! This is very helpful!

ADD REPLYlink written 3.5 years ago by mabelwongting0
gravatar for Santosh Anand
4.2 years ago by
Santosh Anand5.2k
Santosh Anand5.2k wrote:

Apart from Devon suggestion, when in doubt, always check the Trace DB

There you can see "This run has 2 reads per spot", of 44bp each.

Click in the "Reads" tab there. You can see the individual reads and estimate total number of read-pairs (by checking progressive numbers 1,2,3,.. in header). Click on "1164098" or put that in search bar. It will show you some of the very last reads. Do they look unusual? Does that answer your Q?

ADD COMMENTlink written 4.2 years ago by Santosh Anand5.2k

With crappy SRA files you cannot rely in what is displayed on that web page. This web page just displays data taken from the SRA file. There do exist some SRA files which have been generated from inconsistent datasets.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by piet1.8k

A concrete example would have been much more useful than an opinion. Neither do I believe that a the sra website can get better information than what is present in sra file. But the OP's Q was to get the best out of the crappy sra, if you would like to say so.

ADD REPLYlink written 4.2 years ago by Santosh Anand5.2k

Any complex technical system has some errors. You may encounter them if you use the system intensively. I am pretty sure you could also find some faulty submissions if you would do some serious analysis of SRA data sets, especially older Illumina data from around 2012. OP asked for a data set from 2011. ERR033684 and companions is an example, where even the FASTQ files offered by ENA are faulty.

ADD REPLYlink written 4.2 years ago by piet1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour