How To Sort Two Mate Pair (Fastq) Files So That The Order Of The Identifiers Is The Same?
3
5
Entering edit mode
10.0 years ago
Steffi ▴ 570

If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.

fastq paired sort • 14k views
5
Entering edit mode
10.0 years ago

linearize your two fastq files with awk and create a new column with a common "key" (here the name before "/") and sort on the key:

gunzip -c file1.fastq.gz |\
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\ awk '{i=index($1,"/"); printf("%s\t%s\n",substr($1,1,i-1),$0);}' |\
sort -k1,1 -t "   " > sorted1.txt

#same for  file2.fastq.gz
(..) > sorted2.txt


join both files with unix join:

join -t ' ' -1 1 -2 1  sorted1.txt  sorted2.txt


and recreate the two fastq files with cut and awk.

0
Entering edit mode

Thanks a lot for your help! the code worked perfectly when changing the following options: I just used "sort -k1,1 " and "join -1 1 -2 1 sorted1.txt sorted2.txt" (so always without "t" option). Could you maybe also give me a hint how to recreate the two fastq files with cut and awk? Unfortunately I am yet not very fluent with linux shell commands. One additional question: what would happen if I have one ID in one file which is not present in the other file?

1
Entering edit mode
10.0 years ago
Steffi ▴ 10

Following up Pierre's post, once when you have your joined sorted file, I recreated my two mate-pair files in the following way:

cat joined_sorted_file |awk '{print substr($2"\n"$3"\n"$4"\n"$5,1)}' > mate1_sort

cat joined_sorted_file |awk '{print substr($6"\n"$7"\n"$8"\n"$9,1)}' > mate2_sort


I am not sure if this is the most efficient way to do so but at least it works :)

0
Entering edit mode

or you can replace awk by tr "\t" "\n"

0
Entering edit mode

or you can replace awk by cut -f 2,3,4,5 | tr "\t" "\n"

0
Entering edit mode
10.0 years ago
Zhidkov ▴ 580

Hi,

go through this: Selecting Random Pairs From Fastq?

Ilia

1
Entering edit mode

Hi, the idea was to use the sorting approach provided in above thread: turn fasta to one line (tabular), then remove the '#/1' in seq1 library and '#/2' in seq2 library. After that sort each file on first column (corresponding to read ID), after sorting add back the '#/1' and '#/2' to end of first column and turn the tabular fastq back to original four lines per read format.

Ilia

0
Entering edit mode

Actually I do not want to select a random subset but check if the sequences are in the same order and if not then order them so that the first read in the first file is the mate of the first read in the second file. Of course, I could do this within BioC,but if the fastq files are large, it takes quite a while to process them.