Why There Are 3 Fastq File In This Pair-End Data?
2
3
Entering edit mode
12.6 years ago
Hanfei Sun ▴ 60

Raw data: http://www.ebi.ac.uk/ena/data/view/SRR346373&display=html

Also on NCBI: http://www.ncbi.nlm.nih.gov/sra?term=%09SRR346373

I downloaded them and the first 4 lines looks like the following:

SRR346373$ head -4 S*fastq
==> SRR346373_1.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/1
T23133223302220122222232212322320332
+
!%(#$%#$%%####*%#%##&#$##$##&#&#$$,+

==> SRR346373_2.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/2
G0012130112
+
!*)&#$&'###

==> SRR346373.fastq <==
@SRR346373.1 0176_20090623_2_H3K4me3_3_25_119/1
T30200011130100000000000000000000000
+
!%/%%5)&4(%#(7&?2&'6&.,684;.6>',7A?1

It seems obvious that 2 and 1 fastq are within a pair-end data. But what does SRR346373.fastq stands for? It is much smaller than the other two fastq file(1/20 lines of them). Anyone knows what does it means?

paired-end solid barcode • 9.0k views
ADD COMMENT
1
Entering edit mode

It looks like SRR346373 is the first read, SRR346373_1 is the second read and SRR346373_2 is the barcode. The NCBI page you link to has details associating each barcode sequence with the sample and replicate.

ADD REPLY
0
Entering edit mode

I don't think so, because SRR346373_1.fastq and SRR346373_2.fastq both have 87354416 lines and SRR346373.fastq has 4213292 lines, it's possible that SRR346373_1.fastq is paired with SRR346373_2.fastq, but if SRR346373.fastq is the Barcode file, how could it has so few lines..

ADD REPLY
0
Entering edit mode

I read the NCBI page about barcode and try to split the barcode file, but if the barcode file can't map to the pair-end files "Line-by-line", I don't think it make sense.

ADD REPLY
0
Entering edit mode

Hi all,

Sorry to bring you back to this old thread as I noticed something new in relevance to this thread. In the past, when I used wget and local fastqdump, I usually only get the _1.fastq.gz and _2.fastq.gz. But sometimes also the 3rd file for the single reads. However, in my recent direct use of fastqdump (v2.6.3) from the NCBI server with /fastq-dump with --split-files --gzip sraID (no choice as the ftp url is no long available), I got _1.fastq.gz and _3.fastq.gz (instead of _2), which seem to represent the pair-end sequences. In agreement with this, on the sra record, it indicates the barcode is between the two reads. So I guess in this case, the _1 and _3 are for pair-end sequences if --split-files is used, and I haven't tried to use --split-3, perhaps it will produce _1 and _2 and the 3rd . Below are the output of the first read from both _1 and _3.

$ zcat SRR395614_1.fastq.gz |head -n 4
@SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
AAAGAATGGAATCATCAAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNNNNNNNCNTNGNNNNNNNTCCNNNNNAATNATNGNATAAAATCGAA
+SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
<<<???@???@@?@?@@@??#################################################################################
$ zcat SRR395614_3.fastq.gz |head -n 4
@SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
TCGAGTCAATTCGACGATTCTATTCCATTCCCTTCGATGATGATTCCATTTCACTCCATTAGATGATTCCATTCGACTCAATTTGGTGATGATTCAATTCG
+SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101
@@@FFBDEHHHHHGBGIJJGGGHGIGIIHJJJJJJJJGGCGIEHI@FIHIFHEGGDHBFICGEHIJJJEHGHIEHHIJHHCEHHFEBDFEEEFEECEEECD

I also noticed the much slower speed compared to wget, and will try to the option of converting fastq to fastq.gz locally. Any comments/corrections are appreciated.

Thanks a lot.

Ping

ADD REPLY
0
Entering edit mode

If in doubt grab the fastq files from ENA directly.

ADD REPLY
4
Entering edit mode
12.6 years ago

I'd guess it is a file of the remaining unpaired reads.

The _1 and _2 files should have the same sequence IDs in the same order. The third file contains reads for which paired sequence was not generated and may contain reads labeled either /1 or /2.

Structuring the data this way saves having to do the uneven traversal of the two files, you can always assume that the 200th read in the _1 file corresponds to the 200th read in the _2 file.

Being AB_SOLiD data, the _1 file is the Forward [F3] read (T prefix), the _2 file is the Reverse [R3] read (G prefix).

ADD COMMENT
0
Entering edit mode

I think that makes sense, thanks!

ADD REPLY
0
Entering edit mode
12.6 years ago
Ahdf-Lell-Kocks ★ 1.6k

The third file is the barcode, the other two are the paired end reads.

ADD COMMENT

Login before adding your answer.

Traffic: 1112 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6