A company performed paired end sequencing of exomic libraries on for us. When I requested the read data in fastq format from them, they mailed me a hard drive with _export.txt files. The alignments were done to UCSC hg18. I want to realign the samples myself. The files from a sample are:
Interestingly, these files do not have the same number of lines s61 has 67,003,423 while s62 has 144 fewer lines. They also are not sorted such that the paired reads are on the same lines in the files.
My first move was to sort both by the cluster coordinates (columns 5 and 6) so that the paired reads would be on the same line in each file.
sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_1_export.txt > s_6_1_sorted.txt &
sort -t $'\t' -k 5n,5 -k 6n,6 -S 40G s_6_2_export.txt > s_6_2_sorted.txt &
However, I do not think this will work out well unless all of the extra reads in the s61 file sort to the end of the resulting file. I was subsequently planning on using casava to convert each of the sorted export.txt files to fastq with the command:
CASAVA -a Export2Fastq -e s_6_1_sorted.txt -o s_6_1.fq --purityFilter=YES
CASAVA -a Export2Fastq -e s_6_2_sorted.txt -o s_6_2.fq --purityFilter=YES
From the resulting fq files I was going to use BWA to do paired end alignment to UCSC hg19 and more downstream analysis with the resulting bam file.
My issue is that I'm baffled as to how to get from the _export.txt files given to me, back to fastq such that the paired nature of the read data is preserved. Could anyone offer guidance or suggestions here?