Question: Synchronization of fastq files
0
gravatar for Antonio R. Franco
2.9 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.8k wrote:

I downloaded paired-end Illumina reads from the NCBI-SRA, and run fastq-dump --split-3 to get a legacy extraction of the corresponding fastq files

I ended with three files. The file_1.fastq.gz, file_2.fastq.gz and a third file.fastq.gz. The third one corresponds to 492919 files whose readlen < 1

Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished. A test to extract and compare the order of the names and coordinates' read sequences will take even a longer time

So I rather ask here for previous experiences..

1. Should I understand that name_1.fastq and name_2.fastq are synchronized files ?, that is, are the left and right reads are in the same order ?. I ask this because the size difference between the two files (the _1 and the _2)  is notable

2. Is there any script that will allow me to synchronize these two files in case that I need it?

 

velvet assembly • 1.8k views
ADD COMMENTlink modified 2.9 years ago by Devon Ryan85k • written 2.9 years ago by Antonio R. Franco3.8k

I answer to myself

Both files, file_1.fastq.gz and file_2.fastq.gz have at least the same number of lanes

ADD REPLYlink written 2.9 years ago by Antonio R. Franco3.8k
1
gravatar for Devon Ryan
2.9 years ago by
Devon Ryan85k
Freiburg, Germany
Devon Ryan85k wrote:

I've never seen an SRA file where the reads were out of sync, though I suppose it could in theory happen. There's a convenient tool from BBTools ( reformat.sh, I think) to resync things should you ever need to do so (note, I wouldn't bother checking the results of fastq-dump unless you go obviously weird results from mapping/assembly).

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Devon Ryan85k
0
gravatar for piet
2.9 years ago by
piet1.5k
planet earth
piet1.5k wrote:
> I ask this because the size difference between the two files (the _1 and the _2) is notable

The difference in size usually results from gzip. If all residues in a read have exactly the same quality, compression by gzip is more efficient as if the quality values are spread over a large range. You should better compare the size of the unzipped files. 

> Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished.

You can use the 'wc' command to count the number of lines of your fastq files:

zcat file_1.fastq.gz | wc
zcat file_2.fastq.gz | wc

The 'fastq-dump' program emits a variant of FASTQ formatted files, where four lines make up a read.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by piet1.5k
1

I think you'll want wc -l to count lines.

ADD REPLYlink written 2.9 years ago by h.mon20k

No, I recommend to look at the number of lines, the number of words, and the number of characters: all three numbers at once.

ADD REPLYlink written 2.9 years ago by piet1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1730 users visited in the last hour