Question: Tool to append Illumina Read 1 to Read 2 for downstream demultiplexing
0
gravatar for achamess
2.5 years ago by
achamess50
United States
achamess50 wrote:

Hi. I have paired end reads (75 bp) from a Nextseq run. Read 1 is 18 bp and Read 2 is 52 bp. R2 is where all the cDNA information is, and R1 is UMI (8) and Barcode (10bp). I've already run umi_tools to detect the UMIs and append those to read file info (https://github.com/CGATOxford/UMI-tools). So R1 now just has the 10 bp barcode left. I think the most efficient way is to take that 10 bp barcode and join it directly to the R2 read on the 5' end. Then I can take the joined R2+R1 fastqs through a program likereaper to demultiplex based on barcodes.

So my question is, what tool can I use for a simple join like that? I've seen tools like ea-utils that seem to do a join and merge based on overlapping bps between R1 and R2, but that's not what I want. I just want to add R1 to R2, no merge or overlap. Any guidance would be great.

sequencing rna-seq • 1.5k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by achamess50
1

Which of the two read headers would you want to keep?

ADD REPLYlink written 2.5 years ago by genomax71k

Hi. Thanks for the reply. R2 header.

ADD REPLYlink written 2.5 years ago by achamess50
1

I am sure there is a paste solution that will work. @Pierre, the master of these one liners should be by soon.

ADD REPLYlink written 2.5 years ago by genomax71k
1

Here is a bad solution until someone improves on it (you will have to uncompress the files, replace YOUR_SEQUENCER_ID)

join <(nl cat R1.fastq) <(nl cat R2.fastq) | awk -F ' ' '{if ($2 ~ /^@YOUR_SEQUENCER_ID/ ) {print $4" "$5;} else {print $2$3;}}' > new.fastq
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax71k

Thanks. I'll give it a try. But can I use .fastq.gz? Sorry, I'm not a Unix pro, so this is my attempt:

join <(nl gzip -cd |cat AC1-10_S34_R1_001.fastq.gz) <(nl gzip -cd |cat AC1-10_S34_R2_001.fastq.gz) | awk -F ' ' '{if (@machineID ) {print $4" "$5;} else {print $2$3;}}' > new.fastq

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by achamess50

Sorry I meant to say sequencer ID (not your machine ID in the sense of computer). It should be the @STRING at beginning of all read headers. Changed.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax71k

It worked! Very nice. I need to spend more time learning the power of awk. Here is my update for for .gz input and output. I will probably now make some kind of loop to process all my files. Thank you!

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by achamess50
3
gravatar for achamess
2.5 years ago by
achamess50
United States
achamess50 wrote:

Moving your answer here (with my modifications).

join <(zcat AC1-10_S34_R1_001.fastq.gz | nl ) <(zcat AC1-10_S34_R2_001.fastq.gz|nl) | awk -F ' ' '{if ($2 ~ /^@NB501800/ ) {print $4" "$5;} else {print $2$3;}}' | gzip > new.fastq.gz

Here is the output:

@NB501800:5:HKYVYAFXX:1:11111:19375:5952 2:N:0:TCCGGCTTAT+GGACTCATTG
GTGCGGATGATCTTACGCTTGTAGGCCAGCCTGGGTGGATATATATTGTGTTCCAAGCCAACTTGGTCTA
++
AAAAAEEEEEEEEEEEEEAAAAAE/EEE/<EEE/<E/EEEAEEEEEA/EE<EEEEE/EEEEEEA/AEEEE
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by achamess50

Perfect. Go ahead and accept this answer (green check mark) to provide closure to the thread.

ADD REPLYlink written 2.5 years ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour