Question: Tool to append Illumina Read 1 to Read 2 for downstream demultiplexing
0
gravatar for achamess
23 months ago by
achamess40
United States
achamess40 wrote:

Hi. I have paired end reads (75 bp) from a Nextseq run. Read 1 is 18 bp and Read 2 is 52 bp. R2 is where all the cDNA information is, and R1 is UMI (8) and Barcode (10bp). I've already run umi_tools to detect the UMIs and append those to read file info (https://github.com/CGATOxford/UMI-tools). So R1 now just has the 10 bp barcode left. I think the most efficient way is to take that 10 bp barcode and join it directly to the R2 read on the 5' end. Then I can take the joined R2+R1 fastqs through a program likereaper to demultiplex based on barcodes.

So my question is, what tool can I use for a simple join like that? I've seen tools like ea-utils that seem to do a join and merge based on overlapping bps between R1 and R2, but that's not what I want. I just want to add R1 to R2, no merge or overlap. Any guidance would be great.

sequencing rna-seq • 1.2k views
ADD COMMENTlink modified 23 months ago • written 23 months ago by achamess40
1

Which of the two read headers would you want to keep?

ADD REPLYlink written 23 months ago by genomax63k

Hi. Thanks for the reply. R2 header.

ADD REPLYlink written 23 months ago by achamess40
1

I am sure there is a paste solution that will work. @Pierre, the master of these one liners should be by soon.

ADD REPLYlink written 23 months ago by genomax63k
1

Here is a bad solution until someone improves on it (you will have to uncompress the files, replace YOUR_SEQUENCER_ID)

join <(nl cat R1.fastq) <(nl cat R2.fastq) | awk -F ' ' '{if ($2 ~ /^@YOUR_SEQUENCER_ID/ ) {print $4" "$5;} else {print $2$3;}}' > new.fastq
ADD REPLYlink modified 23 months ago • written 23 months ago by genomax63k

Thanks. I'll give it a try. But can I use .fastq.gz? Sorry, I'm not a Unix pro, so this is my attempt:

join <(nl gzip -cd |cat AC1-10_S34_R1_001.fastq.gz) <(nl gzip -cd |cat AC1-10_S34_R2_001.fastq.gz) | awk -F ' ' '{if (@machineID ) {print $4" "$5;} else {print $2$3;}}' > new.fastq

ADD REPLYlink modified 23 months ago • written 23 months ago by achamess40

Sorry I meant to say sequencer ID (not your machine ID in the sense of computer). It should be the @STRING at beginning of all read headers. Changed.

ADD REPLYlink modified 23 months ago • written 23 months ago by genomax63k

It worked! Very nice. I need to spend more time learning the power of awk. Here is my update for for .gz input and output. I will probably now make some kind of loop to process all my files. Thank you!

ADD REPLYlink modified 23 months ago • written 23 months ago by achamess40
3
gravatar for achamess
23 months ago by
achamess40
United States
achamess40 wrote:

Moving your answer here (with my modifications).

join <(zcat AC1-10_S34_R1_001.fastq.gz | nl ) <(zcat AC1-10_S34_R2_001.fastq.gz|nl) | awk -F ' ' '{if ($2 ~ /^@NB501800/ ) {print $4" "$5;} else {print $2$3;}}' | gzip > new.fastq.gz

Here is the output:

@NB501800:5:HKYVYAFXX:1:11111:19375:5952 2:N:0:TCCGGCTTAT+GGACTCATTG
GTGCGGATGATCTTACGCTTGTAGGCCAGCCTGGGTGGATATATATTGTGTTCCAAGCCAACTTGGTCTA
++
AAAAAEEEEEEEEEEEEEAAAAAE/EEE/<EEE/<E/EEEAEEEEEA/EE<EEEEE/EEEEEEA/AEEEE
ADD COMMENTlink modified 23 months ago • written 23 months ago by achamess40

Perfect. Go ahead and accept this answer (green check mark) to provide closure to the thread.

ADD REPLYlink written 23 months ago by genomax63k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1842 users visited in the last hour