Question: How To Keep The Raw .Fastq.Gz Files For Rnaseq Data
gravatar for shirley0818
6.0 years ago by
United States
shirley081890 wrote:

Hello, I have 75bp paired-end RNASeq data generated from Illumina HTSeq 2000 using the protocol of 7 samples mixture each lane from lane 1-7 in each flowcell. Each sample has 6bp-index associated with it. Using this protocal, for each sample, there are ~50 small .fastq.gz files for left-read and ~50 small .fastq.gz files for right-read. These small files are generated by the sequencer machine automatically. Now it comes up my questions regarding how to combine and keep the raw .fastq.gz files.

I used the command β€œcat” to combine these 50 small .fastq.gz files into one large .fastq.gz like the following for sample β€œ2894” (is this the right way?) cat 2894_CCTTCA_L00_R1. .fastq.gz > 2894_R1.fastq.gz cat 2894_CCTTCA_L00_R2. .fastq.gz > 2894_R2.fastq.gz

After this, I have two .fastq.gz files for each sample. I think this is the files I want for analysis (TopHat), and also for uploading to public domain (SRA) when I publish my results.

However, the support staff in our sequencing core suggested that it is better to keep the original small .fastq.gz files for two reasons. 1. They are truly raw, that is to say, they are files generated automatically by the machine. 2. Bowtie2/tophat2 can take these small files as input directly.

Keep in mind that our RNASeq project is big, and we are not affording to keep both all small .fastq.gz files and the combined .fastq.gz files for each sample. So I would like to ask suggestions from you. If you can only keep one copy of the raw .fastq.gz files, which one you routinely keep for each sample:

the combined big .fastq.gz file or the original 50 small .fastq.gz files generated by the machine

Many thanks, Shirley

rnaseq data • 3.4k views
ADD COMMENTlink modified 6.0 years ago by Istvan Albert ♦♦ 83k • written 6.0 years ago by shirley081890

We usually combine into one big file, but I think it would depend on your infrastructure, etc.

ADD REPLYlink written 6.0 years ago by Madelaine Gogol5.1k

Having the original smaller files will be helpful in troubleshooting QC problems. In case, a lane in the sequencer is acting weirdly, then all the samples run on that lane will give erratic results. The problem like this would be easier to find out if you maintain the granularity of the data. Also, tools like GATK can also perform base quality recalibration but they will only be able to do it if you supply enough information such as reads that originated from the same lane. I prefer to keep the files in the original form. I don't think it will make much or any difference storing the files individually or merging them and storing them.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Ashutosh Pandey12k

I definitely wouldn't combine different samples or different lane data into one file. I just meant the initial multiple files that the illumina primary analysis pipeline produces...

ADD REPLYlink written 6.0 years ago by Madelaine Gogol5.1k
gravatar for Istvan Albert
6.0 years ago by
Istvan Albert ♦♦ 83k
University Park, USA
Istvan Albert ♦♦ 83k wrote:

I think you have two different things going on here.

One is that for a single sample the instrument may create multiple files if these samples were distributed over different lanes. This is very annoying to handle. In that case you should concatenate all the files that belong to the same sample into a single file.

But you should not concatenate different samples into one for convenience. That is just asking for trouble later on.

Thus in your case for 7 samples you should end up with 14 files (the paired end reads) where each file corresponds to a sample and is named by the sample.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Istvan Albert ♦♦ 83k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour