Question: Why There Are 3 Fastq File In This Pair-End Data?
3
gravatar for Hanfei Sun
7.1 years ago by
Hanfei Sun60
Boston
Hanfei Sun60 wrote:

Raw data: http://www.ebi.ac.uk/ena/data/view/SRR346373&display=html Also on NCBI: http://www.ncbi.nlm.nih.gov/sra?term=%09SRR346373

I downloaded them and the first 4 lines looks like the following:

SRR346373$ head -4 S*fastq
==> SRR346373_1.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/1
T23133223302220122222232212322320332
+
!%(#$%#$%%####*%#%##&#$##$##&#&#$$,+

==> SRR346373_2.fastq <==
@SRR346373.13045 0176_20090623_2_H3K4me3_28_21_20/2
G0012130112
+
!*)&#$&'###

==> SRR346373.fastq <==
@SRR346373.1 0176_20090623_2_H3K4me3_3_25_119/1
T30200011130100000000000000000000000
+
!%/%%5)&4(%#(7&?2&'6&.,684;.6>',7A?1

It seems obvious that *2 and *1 fastq are within a pair-end data. But what does SRR346373.fastq stands for? It is much smaller than the other two fastq file(1/20 lines of them). Anyone knows what does it mean?

barcode paired solid • 3.6k views
ADD COMMENTlink modified 2.6 years ago by liangp640 • written 7.1 years ago by Hanfei Sun60
1

It looks like SRR346373 is the first read, SRR346373_1 is the second read and SRR346373_2 is the barcode. The NCBI page you link to has details associating each barcode sequence with the sample and replicate.

ADD REPLYlink written 7.1 years ago by Brad Chapman9.3k

I don't think so, because SRR346373_1.fastq and SRR346373_2.fastq both have 87354416 lines and SRR346373.fastq has 4213292 lines, it's possible that SRR346373_1.fastq is paired with SRR346373_2.fastq, but if SRR346373.fastq is the Barcode file, how could it has so few lines..

ADD REPLYlink written 7.1 years ago by Hanfei Sun60

I read the NCBI page about barcode and try to split the barcode file, but if the barcode file can't map to the pair-end files "Line-by-line", I don't think it make sense.

ADD REPLYlink written 7.1 years ago by Hanfei Sun60
4
gravatar for Jonathan Manning
7.1 years ago by
Near Boston, MA
Jonathan Manning620 wrote:

I'd guess it is a file of the remaining unpaired reads.

The _1 and _2 files should have the same sequence IDs in the same order. The third file contains reads for which paired sequence was not generated and may contain reads labeled either /1 or /2.

Structuring the data this way saves having to do the uneven traversal of the two files, you can always assume that the 200th read in the _1 file corresponds to the 200th read in the _2 file.

Being AB_SOLiD data, the _1 file is the Forward [F3] read (T prefix), the _2 file is the Reverse [R3] read (G prefix).

ADD COMMENTlink written 7.1 years ago by Jonathan Manning620

I think that makes sense, thanks!

ADD REPLYlink written 7.1 years ago by Hanfei Sun60
0
gravatar for Ahdf-Lell-Kocks
7.1 years ago by
Ahdf-Lell-Kocks1.6k
Ahdf-Lell-Kocks1.6k wrote:

The third file is the barcode, the other two are the paired end reads.

ADD COMMENTlink written 7.1 years ago by Ahdf-Lell-Kocks1.6k
0
gravatar for liangp64
2.6 years ago by
liangp640
liangp640 wrote:

Hi all,

Sorry to bring you back to this old thread as I noticed something new in relevance to this thread. In the past, when I used wget and local fastqdump, I usually only get the _1.fastq.gz and _2.fastq.gz. But sometimes also the 3rd file for the single reads. However, in my recent direct use of fastqdump (v2.6.3) from the NCBI server with /fastq-dump with "--split-files --gzip sraID" (no choice as the ftp url is no long available), I got _1.fastq.gz and _3.fastq.gz (instead of _2), which seem to represent the pair-end sequences. In agreement with this, on the sra record, it indicates the barcode is between the two reads. So I guess in this case, the _1 and _3 are for pair-end sequences if "--split-files" is used, and I haven't tried to use "--split-3", perhaps it will produce _1 and _2 and the 3rd . Below are the output of the first read from both _1 and _3.

$ zcat SRR395614_1.fastq.gz |head -n 4 @SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101 AAAGAATGGAATCATCAAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNNNNNNNCNTNGNNNNNNNTCCNNNNNAATNATNGNATAAAATCGAA +SRR395614.1 D050VACXX110915:1:1101:6706:2140 length=101 <<

I also noticed the much slower speed compared to wget, and will try to the option of converting fastq to fastq.gz locally. Any comments/corrections are appreciated.

Thanks a lot. Ping

ADD COMMENTlink written 2.6 years ago by liangp640

If in doubt grab the fastq files from ENA directly.

ADD REPLYlink written 2.6 years ago by genomax63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1251 users visited in the last hour