Converting Solexa Reads To Fastq And Identifying Mismatched Bases
2
0
Entering edit mode
9.7 years ago
soosus ▴ 90

I would like to write a simple script or single line perl to convert the following two Solexa sequence reads (assume these reads in a text file "Solexa.export.txt") into fastq format. (Note: the quality scores are coded as ASCII offset of 64):

ILLUMINA-A93428 66      1       1       7848    1267    CGATGT  1       ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG   ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB chrX.fa         122287709       R       40C21GAC7       269                Y

ILLUMINA-A93428 66      1       1       8782    1281    CGATGT  1       GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG   hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe] chrX.fa         144709339       F       72      335                        Y


Also need help identifying mismatched bases in "Solexa.export.txt"

r fastq • 2.7k views
0
Entering edit mode
9.7 years ago

You might try this script to convert to SAM/BAM. Then, you can use tools like picard Sam2Fastq and samtools calmd to get the information you need.

https://github.com/samtools/samtools/blob/master/misc/export2sam.pl

0
Entering edit mode
9.7 years ago
Mitch Bekritsky ★ 1.3k

Try this:

awk '/./ {print "@"$1"\n"$9"\n+\n"$10}' Solexa.export.txt  The '/./' in the awk command skips any blank lines. When I run this one-liner on your sample script, it produces the following output: @ILLUMINA-A93428 ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG + ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB @ILLUMINA-A93428 GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG + hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe]  When you say that the encoding is offset by 64, does that mean you want to change it, or is that just an FYI? Also, what do you mean that you want to identify mismatched bases? Do you just want to find reads that don't match perfectly? In that case, try this: awk '$14 !~ /^[0-9]+\$/ && /./' Solexa.export.txt


I'm assuming that the 14th field in Solexa.export.txt reports the alignment of the read to the reference genome. In that case, any entry where field 14 isn't only digits would imply that there are mismatched bases. In your second sequence, field 14 reports 72, the length of the read. In the first sequence, it reports 40C21GAC7, implying 40 matched bases, then a mismatched C, etc.

0
Entering edit mode

I mean the chromosomal position of locations where the reference base does not match the alternative base. Also, thanks for your reply!

0
Entering edit mode

Not sure about the exact math, but you can parse the 14th field and use the start position of the read to get the exact position of the mismatch in the reference genome. Not sure which field represents position in your output since it doesn't look like it's in SAM format, but you could take the start site, add 41, and that should be the position of the C mismatch. Should be a little perl script.

Happy to help!