Question: Converting Solexa Reads To Fastq And Identifying Mismatched Bases
0
gravatar for soosus
7.2 years ago by
soosus80
soosus80 wrote:

I would like to write a simple script or single line perl to convert the following two Solexa sequence reads (assume these reads in a text file "Solexa.export.txt") into fastq format. (Note: the quality scores are coded as ASCII offset of 64):

ILLUMINA-A93428 66      1       1       7848    1267    CGATGT  1       ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG   ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB chrX.fa         122287709       R       40C21GAC7       269                Y

ILLUMINA-A93428 66      1       1       8782    1281    CGATGT  1       GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG   hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`] chrX.fa         144709339       F       72      335                        Y

Also need help identifying mismatched bases in "Solexa.export.txt"

R fastq • 2.0k views
ADD COMMENTlink modified 7.2 years ago by Mitch Bekritsky1.2k • written 7.2 years ago by soosus80
0
gravatar for Sean Davis
7.2 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

You might try this script to convert to SAM/BAM. Then, you can use tools like picard Sam2Fastq and samtools calmd to get the information you need.

https://github.com/samtools/samtools/blob/master/misc/export2sam.pl

ADD COMMENTlink written 7.2 years ago by Sean Davis26k
0
gravatar for Mitch Bekritsky
7.2 years ago by
Mitch Bekritsky1.2k
London, England
Mitch Bekritsky1.2k wrote:

Try this:

awk '/./ {print "@"$1"\n"$9"\n+\n"$10}' Solexa.export.txt

The '/./' in the awk command skips any blank lines.

When I run this one-liner on your sample script, it produces the following output:

@ILLUMINA-A93428
ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG
+
ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB
@ILLUMINA-A93428
GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG
+
hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`]

When you say that the encoding is offset by 64, does that mean you want to change it, or is that just an FYI?

Also, what do you mean that you want to identify mismatched bases? Do you just want to find reads that don't match perfectly? In that case, try this:

awk '$14 !~ /^[0-9]+$/ && /./' Solexa.export.txt

I'm assuming that the 14th field in Solexa.export.txt reports the alignment of the read to the reference genome. In that case, any entry where field 14 isn't only digits would imply that there are mismatched bases. In your second sequence, field 14 reports 72, the length of the read. In the first sequence, it reports 40C21GAC7, implying 40 matched bases, then a mismatched C, etc.

ADD COMMENTlink written 7.2 years ago by Mitch Bekritsky1.2k

I mean the chromosomal position of locations where the reference base does not match the alternative base. Also, thanks for your reply!

ADD REPLYlink written 7.2 years ago by soosus80

Not sure about the exact math, but you can parse the 14th field and use the start position of the read to get the exact position of the mismatch in the reference genome. Not sure which field represents position in your output since it doesn't look like it's in SAM format, but you could take the start site, add 41, and that should be the position of the C mismatch. Should be a little perl script.

Happy to help!

ADD REPLYlink written 7.2 years ago by Mitch Bekritsky1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 822 users visited in the last hour