Question: How To Sort Two Mate Pair (Fastq) Files So That The Order Of The Identifiers Is The Same?
5
gravatar for Steffi
9.3 years ago by
Steffi570
Germany
Steffi570 wrote:

If there are reads that are just in one of the two files I would like to remove them of the file and store somewhere their ID.

fastq paired sort • 13k views
ADD COMMENTlink written 9.3 years ago by Steffi570
5
gravatar for Pierre Lindenbaum
9.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

linearize your two fastq files with awk and create a new column with a common "key" (here the name before "/") and sort on the key:

gunzip -c file1.fastq.gz |\
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' |\
awk '{i=index($1,"/"); printf("%s\t%s\n",substr($1,1,i-1),$0);}' |\
sort -k1,1 -t "   " > sorted1.txt

#same for  file2.fastq.gz
(..) > sorted2.txt

join both files with unix join:

join -t ' ' -1 1 -2 1  sorted1.txt  sorted2.txt

and recreate the two fastq files with cut and awk.

ADD COMMENTlink written 9.3 years ago by Pierre Lindenbaum133k

Thanks a lot for your help! the code worked perfectly when changing the following options: I just used "sort -k1,1 " and "join -1 1 -2 1 sorted1.txt sorted2.txt" (so always without "t" option). Could you maybe also give me a hint how to recreate the two fastq files with cut and awk? Unfortunately I am yet not very fluent with linux shell commands. One additional question: what would happen if I have one ID in one file which is not present in the other file?

ADD REPLYlink written 9.3 years ago by Steffi570
1
gravatar for Steffi
9.3 years ago by
Steffi10
Steffi10 wrote:

Following up Pierre's post, once when you have your joined sorted file, I recreated my two mate-pair files in the following way:

cat joined_sorted_file |awk '{print substr($2"\n"$3"\n"$4"\n"$5,1)}' > mate1_sort

cat joined_sorted_file |awk '{print substr($6"\n"$7"\n"$8"\n"$9,1)}' > mate2_sort

I am not sure if this is the most efficient way to do so but at least it works :)

ADD COMMENTlink modified 7.6 years ago by Istvan Albert ♦♦ 86k • written 9.3 years ago by Steffi10

or you can replace awk by tr "\t" "\n"

ADD REPLYlink written 9.3 years ago by Pierre Lindenbaum133k

or you can replace awk by cut -f 2,3,4,5 | tr "\t" "\n"

ADD REPLYlink written 9.3 years ago by Pierre Lindenbaum133k
0
gravatar for Zhidkov
9.3 years ago by
Zhidkov570
Israel
Zhidkov570 wrote:

Hi,

go through this: Selecting Random Pairs From Fastq?

Ilia

ADD COMMENTlink modified 16 months ago by _r_am32k • written 9.3 years ago by Zhidkov570
1

Hi, the idea was to use the sorting approach provided in above thread: turn fasta to one line (tabular), then remove the '#/1' in seq1 library and '#/2' in seq2 library. After that sort each file on first column (corresponding to read ID), after sorting add back the '#/1' and '#/2' to end of first column and turn the tabular fastq back to original four lines per read format.

Ilia

ADD REPLYlink written 9.3 years ago by Zhidkov570

Actually I do not want to select a random subset but check if the sequences are in the same order and if not then order them so that the first read in the first file is the mate of the first read in the second file. Of course, I could do this within BioC,but if the fastq files are large, it takes quite a while to process them.

ADD REPLYlink written 9.3 years ago by Steffi570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2553 users visited in the last hour
_