Write Script For Selection Of Fastq File With Sanger Format
1
3
Entering edit mode
12.8 years ago
Bioscientist ★ 1.7k

Now I have many FASTQ files, and I need to select those with sanger format. Advisor told me: zcat *.fastq.gz |grep "J" (seems there's no "J" in the information of sanger format; while you can always find "J" in other format, though I'm not sure about this) If the output is NOTHING, then it's sanger format that we want; if there's anything in the output, then it should be discarded.

Question is how I can write the script to finish this task? I'm beginner and never write code before.thanks!

fastq • 3.2k views
ADD COMMENT
0
Entering edit mode

I'm unclear on the question. Your advisor gave you a series of bash commands that will tell you whether the file is Phred+33 or Phred+64. That should be all you need. Are you trying to figure out how to parse the output?

BTW, for info on different formats, I like the [?]Wikipedia page on fastq format[?]. That's where your advisor got his "J" trick from.

ADD REPLY
0
Entering edit mode

I'm unclear on the question. Your advisor gave you a series of bash commands that will tell you whether the file is Phred+33 or Phred+64. That should be all you need. Are you trying to figure out how to parse the output? BTW, for info on different formats, I like the Wikipedia page on fastq format, http://en.wikipedia.org/wiki/FASTQ_format. That's where your advisor got his "J" trick from.

ADD REPLY
0
Entering edit mode

thx mitch, yes you are right. I'm trying to figure out how to parse output based on the trick advisor told me; ie., I hope sb can give me clues for the SCRIPT for parsing the output. I understand the logic but don't know how to write script.

ADD REPLY
7
Entering edit mode
12.8 years ago
brentp 24k

For kicks, I implemented this here.

It uses the information from the helpful wikipedia page about the FASTQ format. It assumes the input is from STDIN and that it's only getting quality lines, so, for FASTQ, you'll call it like:

awk 'NR % 4 == 0' your.fastq | python guess-encoding.py

Where the awk command takes every 4th line (the quality line) from your file. By default, it will run until it reads the entire file, or narrows down the file-type to a single possible encoding; you can make it check 1000 qual lines like this:

awk 'NR % 4 == 0' your.fastq | python guess-encoding.py -n 1000

Usually that's enough to determine if it's sanger/non-sanger. Or you could just edit RANGES to include only 'Sanger' and 'Illumina'.

ADD COMMENT
0
Entering edit mode

Neat script! One suggestion if you used islice you would not need to pipe it through awk.

stream = islice(sys.stdin, 3, None, 4)

ADD REPLY
0
Entering edit mode

Nice script indeed. But you don't even need the islice trick, either. Why not just add:

if i % 4 == 0: continue

At the start of your for loop in the python script (and, obviously, divide i/4 in your "if done" test)?

ADD REPLY
0
Entering edit mode

Yeah, I originally had % 4 == 0 and that's probably the most user friendly. But then what if you just have a couple chars? As it is now, you can just do echo 'XZA' | python guess-encoding.py. Plus, I think it's probably a lot faster to have awk cut down the number of lines to 1/4th

ADD REPLY
0
Entering edit mode

Nice script, indeed.

ADD REPLY

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6