How to sort an interleaved fastq file by the barcode (BX:Z:) from the header
1
0
Entering edit mode
6.1 years ago
guillepalou4 ▴ 20

Hello guys!

I interleaved R1.fastq and R2.fastq files into one file called interleaved.fastq, so that for each read pair, the R1 read in the file comes immediately before the R2 read, followed by the R1 read for the next read pair, and so on. In the header of the interleaved.fastq I also have some barcode information (BX:ACTGTCAATGTCAACT-1). This would look like this:

@HX6_24184:8:2115:12337:28031 BX:Z:CGAGCACCATCGGTTA-1
TTCATTTTTATCGTTTTCCGTTCCTGTTGTTCAAAGCATCTTTATCTTCCGCACAGCCTCTTTTTAAGCCTATGATATAAGGGTGCGGTAAATTTACTCTCTGCAAGCCTTTCCCTTAGCGGCTGAAGACTGACAAGTCTGTACAGATCAT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFJ-JJFJJJJJJJJJJAFJJJJJAFJJJJJJJJJJF77

@HX6_24184:8:2115:12337:28031 BX:Z:CGAGCACCATCGGTTA-1
AGGTTTTTTGGGCGTGAACAGGTAATAGTCGTTGTCCTTTTCTTGTTTAAAAATTTCTTTAAGAAAAGTTCTGCTATAATTTCCCAAACCTGTCTTGTTAAAGAAGGTACGTTTGGCTTCATATCCA
+
AFFFFJFJJFJFFJFJFJF-FF<--<JJFJJJJJJ-7FFFJ<JJFJJF-FJJFFFJAFJJJJJJF-AJJJFAJ-7<A-F<FF7FAFJ-A-77<7FFFJJ<<-777<F7--A7<FF7<A<-<-AAF-A

@HX6_24184:8:2109:23196:7462 BX:Z:ACACTGAAGAGACGAA-1
AGTTTTTTTATCGGTAGATAAAAAAACTTCACTCAACGATGCGTTGCGCACACATAATGTGGCGGTTTAGAACTTATTGCGCTTTTTATGAGTCAACTTTCCGGTTATAAAATTGGATATGAAGCCAAACGTACCTTCTTTAACAAGAC
+
AAFFFJJJJJJFJJFJJAJFJ<JFFJFFJJAFFJJFJJJFJJJJFJ<FJJJFAFJFJJFFFJJJJ-AAFFJJFFJFFAFFJ<FJJAFJJJJJJ7AFJFFJJ<FAAJ7JFJJFAFAJ<FJJJF<FAAFJJJ<JJJF-AAJJJJA<7-<F<

@HX6_24184:8:2109:23196:7462 BX:Z:ACACTGAAGAGACGAA-1
TTGTTTTTTTGTCGGAGTTACTACTATTGCAAAAATAGCAGATAGTGCCTATAGATATACAAATAATAGTAATTCAAGATATGGGTTTGTTGACATAATATTACAACTTGTATCACAGACAAAAGATT
+
FJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJFJJAJFJJJJJJJJJJJJJJJJJJJJJJJJJJJ

What I would like to do is to sort this interleaved.fastq file by the barcode BX:Z:... But I want to maintain the interleave format: forward read-reverse read. Thus I want an interleaved_barcode_sorted.fastq file!

I tried this:

cat interleaved.fastq | paste - - - - | sort -k2,2 -t " " | tr "\t" "\n"  > interleaved_barcode_sorted.fastq

It works partially, because it sorts correctly by the barcode but the read order is not forward-reverse, thus it's no longer an interleave fastq file.

Any ideas?

Thanks!

interleaved fastq barcode sort header • 2.4k views
ADD COMMENT
2
Entering edit mode
6.1 years ago

use paste to linearize on 8 rows, not only 4, use awk to create an new key column .

cat input.fastq | paste - - - - - - - - | awk -F ' ' '{printf("%s\t%s\n",$2,$0);}' | sort -t $'\t' -k1,1  | cut -f 2- | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Amazing, now it works! Thank you so much! :)

ADD REPLY

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6