How to split paired end SRA file into 2 correct fastq files
4
7
Entering edit mode
5.9 years ago
thustar ▴ 120

Hello Biostars!

In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.

For example, after I ran fastq-dump ERR011087.sra, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGTATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(


After I ran fastq-dump --split-files ERR011087.sra, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGT
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+


The first read of ERR011087_2.fastq is

@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
ATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA
+ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44
4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(


It seems that fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?

If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran grep "@ERR" ERR011087.fastq |wc -l and got 11640976, ran grep "@ERR" ERR011087_1.fastq |wc -l and got 11640674, ran grep "@ERR" ERR011087_2.fastq |wc -l and got 11640358. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.

Could anyone explain that?

Thanks.

next-gen sra fastq • 39k views
0
Entering edit mode

Hello everyone, I'm replying to this old topic because I'm unfortunately still having a similar issue. I already tried everything I found on this and other similar topics but nothing works for me. I'm trying to download, using fasterq-dump, some SRR* .fastq that are all supposed to be frw and rev (...fastq_1 and ....fastq_2) as they come from illumina sequencing technology. Anyway, some of the files are good, already splitted. Others figure as single fastq files with all reads together (I can see from the lenght that both frw and rev reads are collapsed in one fastq file); I think there was somethig wrong with the NCBI submission.. of course this causes problems for the alignment pipelines. I tried to run a script to manually split and rename the files, that works but takes forever because the files are too many. Any suggestions about some tools to use? I tried reformat.sh from bbmap and all the fastq-dump options already mentioned (even if, for what I understood, the last fasterq-dump now should split the file automatically). Many thanks for the help

1
Entering edit mode

Your best bet is to get the data directly as fastq from EBI-ENA. You can use sra-explorer to get the necessary command lines (C: sra-explorer : find SRA and FastQ download URLs in a couple of clicks ).

0
Entering edit mode

Splitting the fastq three ways using --split-e might work here. There may be some erroneous pairs in the set. See fastq-dump help:

 --split-files                 Write reads into separate files. Read
number will be suffixed to the file
name. NOTE! The --split-3 option is
recommended. In cases where not all
spots have the same number of reads,
this option will produce files that WILL
CAUSE ERRORS in most programs which
process split pair fastq files.
--split-e                     3-way splitting for mate-pairs. For each
spot, if there are two biological reads
satisfying filter conditions, the first
is placed in the *_1.fastq file, and
the second is placed in the *_2.fastq
file. If there is only one biological
it is placed in the *.fastq file.All
other reads in the spot are ignored.

10
Entering edit mode
5.9 years ago

--split-files is splitting things according to how the actual reads should be split. If the original dataset happened to be 2x44, then yes it'll just split things in half. The problem with SRA is that a fair number of uploaded datasets are simply crap, i.e., people uploaded poorly formatted or incorrect data. For all ERR* datasets, do not use SRA. Download the original fastq files from ENA. If those have different numbers of reads then that's what was uploaded.

0
Entering edit mode

One more question. In your recommended website, there are many options like Fastq files (ftp) and Submitted files (ftp). Is there any difference? Which file should I download?

0
Entering edit mode

Go for the submitted files, noting that they'll have different file names from the accession ID.

10
Entering edit mode
5.9 years ago

In addition... Once you get the correct sra files, try to use fastq-dump with the legacy --split-3 command, as it happens in some cases that the paired-end files are not synchronized which is what many programs are expecting

With that command, I got sometimes a third fastq file corresponding to those sequences that are not paired or lack its mate for one reason or another

1
Entering edit mode

What are the exact contents of 3 files?

5
Entering edit mode

0
Entering edit mode

For what it is worth, +1 bounties to both of you :)

0
Entering edit mode

4
Entering edit mode
5.9 years ago

Apart from Devon suggestion, when in doubt, always check the Trace DB https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR011087

There you can see "This run has 2 reads per spot", of 44bp each.

Click in the "Reads" tab there. You can see the individual reads and estimate total number of read-pairs (by checking progressive numbers 1,2,3,.. in header). Click on "1164098" or put that in search bar. It will show you some of the very last reads. Do they look unusual? Does that answer your Q?

0
Entering edit mode

With crappy SRA files you cannot rely in what is displayed on that web page. This web page just displays data taken from the SRA file. There do exist some SRA files which have been generated from inconsistent datasets.

0
Entering edit mode

A concrete example would have been much more useful than an opinion. Neither do I believe that a the sra website can get better information than what is present in sra file. But the OP's Q was to get the best out of the crappy sra, if you would like to say so.

0
Entering edit mode

Any complex technical system has some errors. You may encounter them if you use the system intensively. I am pretty sure you could also find some faulty submissions if you would do some serious analysis of SRA data sets, especially older Illumina data from around 2012. OP asked for a data set from 2011. ERR033684 and companions is an example, where even the FASTQ files offered by ENA are faulty.

1
Entering edit mode
7 months ago

I have seen that some of this files have the following format:

@sequence_id_1.1 \ sequence \ + \ quality

@sequence_id_1.2 \ sequence \ + \ quality

@sequence_id_2.1 \ sequence \ + \ quality

@sequence_id_2.2 \ sequence \ + \ quality

Assuming that sequence pairs are defined by sequence_id_number.1(forward) and sequence_id_number.2(reverse) we can use awk:

zcat file.fastq.gz | awk 'NR%8==1{c=4} c&&c--' > file_1.fastq


and

zcat file.fastq.gz | awk 'NR%8==5{c=4} c&&c--' > file_2.fastq