Synchronization of fastq files
2
0
Entering edit mode
8.5 years ago

I downloaded paired-end Illumina reads from the NCBI-SRA, and run fastq-dump --split-3 to get a legacy extraction of the corresponding fastq files

I ended with three files. The file_1.fastq.gz, file_2.fastq.gz and a third file.fastq.gz. The third one corresponds to 492919 files whose readlen < 1

Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished. A test to extract and compare the order of the names and coordinates' read sequences will take even a longer time

So I rather ask here for previous experiences..

  1. Should I understand that name_1.fastq and name_2.fastq are synchronized files ?, that is, are the left and right reads are in the same order ?. I ask this because the size difference between the two files (the _1 and the _2) is notable
  2. Is there any script that will allow me to synchronize these two files in case that I need it?
Assembly velvet • 3.6k views
ADD COMMENT
0
Entering edit mode

I answer to myself

Both files, file_1.fastq.gz and file_2.fastq.gz have at least the same number of lanes

ADD REPLY
1
Entering edit mode
8.5 years ago

I've never seen an SRA file where the reads were out of sync, though I suppose it could in theory happen. There's a convenient tool from BBTools ( reformat.sh, I think) to resync things should you ever need to do so (note, I wouldn't bother checking the results of fastq-dump unless you go obviously weird results from mapping/assembly).

ADD COMMENT
0
Entering edit mode
8.5 years ago
piet ★ 1.8k

I ask this because the size difference between the two files (the _1 and the _2) is notable

The difference in size usually results from gzip. If all residues in a read have exactly the same quality, compression by gzip is more efficient as if the quality values are spread over a large range. You should better compare the size of the unzipped files.

Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished.

You can use the wc command to count the number of lines of your fastq files:

zcat file_1.fastq.gz | wc
zcat file_2.fastq.gz | wc

The fastq-dump program emits a variant of FASTQ formatted files, where four lines make up a read.

ADD COMMENT
2
Entering edit mode

I think you'll want wc -l to count lines.

ADD REPLY
0
Entering edit mode

No, I recommend to look at the number of lines, the number of words, and the number of characters: all three numbers at once.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6