TruSeq sequencing output
1
0
Entering edit mode
3.2 years ago

Good morning good people of Biostars

Recently I started working with in-house, not published RNA-seq dataset, with the objective of assembling a new transcriptome for my species. One of the experiences I have was sequenced on an Illumina hiseq1500 apparatus, with the library being prepared with the TruSeq protocol. All the samples sequenced using this method have 3 output sequencing files (with the suffix R1, R2, R3). All the files have the same number of reads, with R2 containing very small reads with 8bps, while R1 and R3 sequences having 125bps.

Previous works on my lab have used the R1 and R3 files after processing, but I'm curious what the R2 file is about. My hypothesis is that it maybe related to the demultiplexing process. I have already queried the internet and supervisors, but since this data is a bit old no one remembers.

Does someone here has any experience with this type of data and has any idea what this is about?

RNA-Seq True-seq2 • 1.0k views
ADD COMMENT
1
Entering edit mode
3.2 years ago
GenoMax 141k

All the samples sequenced using this method have 3 output sequencing files (with the suffix R1, R2, R3).

Your sequencer set up to produce a separate file for the index sequences (this is not standard protocol). You are correct in that R2 is indeed the illumina index sequence for each sample. You should be able to reprocess the data to generate just 2 files per sample (in normal Illumina format with index sequences in fastq read headers), if you have access to the original data folder.

You can use the solution posted in post #5 in this thread over at SeqAnswers if you don't have the original data folder to reprocess. You will need to do this with R1 and R3 files (but rename R3 to R2).

paste -d '~' <(zcat R1.fq.gz) <(zcat R2.fq.gz) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' | gzip - > WithBarcode_R1.fq.gz

and

paste -d '~' <(zcat R3.fq.gz) <(zcat R2.fq.gz) | perl -F'~' -lane 'push(@buffer, $F[0]); if($line == 1){@buffer[0] .= "$F[1]"}; if(($line == 3) && @buffer){print join("\n",@buffer); @buffer = ()}; $line = ($line+1) % 4;' | gzip - > WithBarcode_R2.fq.gz
ADD COMMENT
0
Entering edit mode

Thanks for your answer! Since my reads were already divided into several samples I think I can simply ignore the index file for now.

cheers

ADD REPLY
0
Entering edit mode

If any downstream software objects to R3 nomenclature and/or if the illumina index needs to be in the fastq headers then you can use code above to fix your files.

ADD REPLY

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6