Tool to append Illumina Read 1 to Read 2 for downstream demultiplexing
1
0
Entering edit mode
7.0 years ago
achamess ▴ 90

Hi. I have paired end reads (75 bp) from a Nextseq run. Read 1 is 18 bp and Read 2 is 52 bp. R2 is where all the cDNA information is, and R1 is UMI (8) and Barcode (10bp). I've already run umi_tools to detect the UMIs and append those to read file info (https://github.com/CGATOxford/UMI-tools). So R1 now just has the 10 bp barcode left. I think the most efficient way is to take that 10 bp barcode and join it directly to the R2 read on the 5' end. Then I can take the joined R2+R1 fastqs through a program likereaper to demultiplex based on barcodes.

So my question is, what tool can I use for a simple join like that? I've seen tools like ea-utils that seem to do a join and merge based on overlapping bps between R1 and R2, but that's not what I want. I just want to add R1 to R2, no merge or overlap. Any guidance would be great.

sequencing RNA-Seq • 3.8k views
ADD COMMENT
1
Entering edit mode

Which of the two read headers would you want to keep?

ADD REPLY
0
Entering edit mode

Hi. Thanks for the reply. R2 header.

ADD REPLY
1
Entering edit mode

I am sure there is a paste solution that will work. @Pierre, the master of these one liners should be by soon.

ADD REPLY
1
Entering edit mode

Here is a bad solution until someone improves on it (you will have to uncompress the files, replace YOUR_SEQUENCER_ID)

join <(nl cat R1.fastq) <(nl cat R2.fastq) | awk -F ' ' '{if ($2 ~ /^@YOUR_SEQUENCER_ID/ ) {print $4" "$5;} else {print $2$3;}}' > new.fastq
ADD REPLY
0
Entering edit mode

Thanks. I'll give it a try. But can I use .fastq.gz? Sorry, I'm not a Unix pro, so this is my attempt:

join <(nl gzip -cd |cat AC1-10_S34_R1_001.fastq.gz) <(nl gzip -cd |cat AC1-10_S34_R2_001.fastq.gz) | awk -F ' ' '{if (@machineID ) {print $4" "$5;} else {print $2$3;}}' > new.fastq

ADD REPLY
0
Entering edit mode

Sorry I meant to say sequencer ID (not your machine ID in the sense of computer). It should be the @STRING at beginning of all read headers. Changed.

ADD REPLY
0
Entering edit mode

It worked! Very nice. I need to spend more time learning the power of awk. Here is my update for for .gz input and output. I will probably now make some kind of loop to process all my files. Thank you!

ADD REPLY
3
Entering edit mode
7.0 years ago
achamess ▴ 90

Moving your answer here (with my modifications).

join <(zcat AC1-10_S34_R1_001.fastq.gz | nl ) <(zcat AC1-10_S34_R2_001.fastq.gz|nl) | awk -F ' ' '{if ($2 ~ /^@NB501800/ ) {print $4" "$5;} else {print $2$3;}}' | gzip > new.fastq.gz

Here is the output:

@NB501800:5:HKYVYAFXX:1:11111:19375:5952 2:N:0:TCCGGCTTAT+GGACTCATTG
GTGCGGATGATCTTACGCTTGTAGGCCAGCCTGGGTGGATATATATTGTGTTCCAAGCCAACTTGGTCTA
++
AAAAAEEEEEEEEEEEEEAAAAAE/EEE/<EEE/<E/EEEAEEEEEA/EE<EEEEE/EEEEEEA/AEEEE
ADD COMMENT
0
Entering edit mode

Perfect. Go ahead and accept this answer (green check mark) to provide closure to the thread.

ADD REPLY

Login before adding your answer.

Traffic: 1839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6