Question

conactenate/merge fastq files

0

Entering edit mode

4.2 years ago

jmirobla • 0

I want would like to merge or concatenate reads from 2 fastq files in order to have the following output

file 1

@J00148:193:HFG7MBBXY:2:1101:2372:1279 2:N:0:GAAGGAAG+GACGTCAT
AANCTTCCC
+
-<#FFFFFA

file 2

@J00148:193:HFG7MBBXY:2:1101:2372:1279 1:N:0:GAAGGAAG+GACGTCAT
GTTGGGAAAGAATAGGTCTAGAATTTCTAGTTTACTACAGNTTGTTGCTATTTCGNTTNTTTTNTNANTTCGAGAC
+
AAAA7FJJJJ--FFFAJJJJJ<JJJJJJJJFJAFFFJJFJ#<-F---F-<-<JJ<#-<#AAJJ#J#A#JFFA-7A-

output

@J00148:193:HFG7MBBXY:2:1101:2372:1279 :N:0:GAAGGAAG+GACGTCAT
AANCTTCCCGTTGGGAAAGAATAGGTCTAGAATTTCTAGTTTACTACAGNTTGTTGCTATTTCGNTTNTTTTNTNANTTCGAGAC
+
-<#FFFFFAAAAA7FJJJJ--FFFAJJJJJ<JJJJJJJJFJAFFFJJFJ#<-F---F-<-<JJ<#-<#AAJJ#J#A#JFFA-7A-

Anyone can help with that?

Thanks

rna-seq next-gen alignment • 646 views

ADD COMMENT • link updated 4.2 years ago by ATpoint 82k • written 4.2 years ago by jmirobla • 0

0

Entering edit mode

Technically possible, yes, may I ask though why you want to do this since there might influence the method how to do it.

ADD REPLY • link 4.2 years ago by ATpoint 82k

0

Entering edit mode

file1 is the UMIs fastq file separated from the file2 that is the actual read, I need to put them together again

ADD REPLY • link 4.2 years ago by jmirobla • 0

0

Entering edit mode

I see, please see my answer below.

ADD REPLY • link 4.2 years ago by ATpoint 82k

score 0 · Answer 1 · 2020-02-26

Using only Unix tools:

paste -d "\t" \
  <(tr "\n" "\t" < file1.fq) \
  <(tr "\n" "\t" < file2.fq) \
  | awk 'FS="\t", OFS="\n" {gsub(" ","__");print $1, $2$6, $3, $4$8}' \
  | awk '{gsub("__", " ");gsub("[1-9]:N:0", ":N:0");print}' > merged.fq

First we linearize both files so one read (consisting of four lines) is written as a 4-column tab-separated file and pasted together with the second file, resulting in a 8-column file which we now can easily query with awk. awk then simply prints the first line of the read, then prints the merged read, then the +, then the merged quality. Eventually we collapse from tab-separated format back to newline-separated fastq format. Since there was a whitespace in the header which sometimes might mess up formatting I initially replaced this with a double-underscore as unique delimiter, and then eventually converted this back to whitespace.

$cat merged.fq 
@J00148:193:HFG7MBBXY:2:1101:2372:1279 :N:0:GAAGGAAG+GACGTCAT
AANCTTCCCGTTGGGAAAGAATAGGTCTAGAATTTCTAGTTTACTACAGNTTGTTGCTATTTCGNTTNTTTTNTNANTTCGAGAC
+
-<#FFFFFAAAAA7FJJJJ--FFFAJJJJJ<JJJJJJJJFJAFFFJJFJ#<-F---F-<-<JJ<#-<#AAJJ#J#A#JFFA-7A-