For an interview I was required to write a python script with the following specifications:
"-send me a python script that conforms to the following specifications:
It takes a single command line argument, which is the full path to a fasta file
It writes one line on the standard out for each fasta record
2a. The line consists of two values, separated by a space: the sequence id, and the percent GC
Below is the script that I came up with. The interviewer got back to me a few weeks later and said that my program had a bug and did not follow specifications. He was not able to specify what my errors were which left me confused because my program was running fine when I sent it to him.
Can anyone point out a bug or how I failed to follow specifications?
Thank you for your help!
def gc_content (filename): gc = at = unknown = 0 with open(filename, 'r') as f: for line in f: if line[:1] == '>': if (gc + at) > 0: # used to skip first ID line total = gc + at percentage = int(round(gc / total * 100)) print "%s %d" %(seq_id, percentage) seq_id = line.strip() # saves the ID line for printing later gc = at = unknown = 0 else: nuc_str = list(line.strip()) for n in nuc_str: if n == 'G' or n == 'g' or n == 'C' or n == 'c': gc += 1.0 elif n == 'A' or n == 'a' or n == 'T' or n == 't': at += 1.0 else: unknown += 1.0 # usually represented by an 'N' in the fasta file. print "%s %d" % (seq_id, percentage)
For some reason indenting is off at the top, but it is correct in my original script
If I see correctly, the GC will not be calculated for the last record, as that calculation happens only when a new sequence header is encountered. In your case, the very last record will be reported with the GC of the previous record.
Other thoughts: Maybe they wanted you to calculate GC over all bases and not just unambiguous ones...
What happens with a sequence that contains only N or other ambiguity codes? That sequence will be omitted from your output, if I see correctly - maybe that violates spec 2.
Other than that, you have some (for my eye) weird constructs in your code, which are not wrong per se, but maybe put them off.
line[:1], why not
nuc_str = line.strip().upper()saves you checking for upper/lowercase and you could e.g. use a
That said, those things are not wrong, but look a bit like beginner's code, so might be off-putting for potential employers.
Agreed, although the last line does take care of the final sequence in the file. Still, lots of stuff in the code that reads like it was kind of dashed together from Stackoverflow posts.
gc = at = unknown = 0is more commonly
gc, at, unknown = (0, 0, 0), although both are technically correct. However the multiple assignment is an anti-pattern in Python:
EDIT: also the question was to accept one argument from the command line. You've written a function that accepts one argument, but you can't run it as a standalone python script in the way the interviewer requests.
The last line takes care of (i.e. prints) the last record but the GC for the last record is never calculated (or am I blind?)
Edit: clarified that take care means print
Yes. The last line prints out the GC content for the last record.
No, it does not because it never runs the calculation for the last sequence. It will instead take the GC% for the second-to-last sequence.
When I run your code on this:
then I'm getting this output
seq3 clearly does not have 50% GC, also it prints the '>', which I think could also be 'off-spec'.
Thank you for pointing this out. I think the last two sequences of the fasta file I tested my script on actually had the same GC content which is why I might have overlooked this, just proving I should have done more tests. Do you think there is a better approach I could have used? It seems like skipping the first header line and retroactively printing is a road to confusions such as these, but I did not see anyway around it.
Biopython SeqIO can spare you many headaches when parsing common filetypes, including fasta. See the excellent tutorial and cookbook. My code below also uses SeqIO
My code boils down to following core functions:
This processes each sequence separately and doesn't complicate stuff. Note that there is no need to add A and T and N, just count G and C and divide by length. My code also rounds the percentage, up to you what you want to do with that. As a bonus it also prints out the length of each sequence...
If something is unclear, please ask and I would be happy to explain.
No, apparently I'm blind :)
Thank you so much for pointing this out! I interpreted my "gc_content("/home/pat/Documents/FTO_Exons.fasta")" as the single command line argument, but I now see that it should have been %python GC_content.py /path_to_fasta_file/.
As an aside, I think this is one of the problems with testing-as-an-interview-screen since the OP could have great biology or reasoning skills but just doesn't nail this specific problem. It's obvious that there's lots of logic in this python function, also understanding of the problem, and an informal chat about how to solve the problem might have been more meaningful than example code. I'd certainly fall into some of the same traps if someone asked me to solve this problem in Perl.
Thank you for the encouragement! And I agree that a lot of this could have been sorted out if I had an opportunity to chat with him in person. Speaking of Perl, I generally do most of my programming in Perl, and I am just starting to pick up python. So if my python looks a little unconventional it is because I am still getting the hang of it.
Sharing code is more convenient (definitely for indentation) using a github gist. When you add the link to the gist in your post it will get recognized and rendered by biostars.
I'll throw in a python script I wrote to get GC content of either a fasta file or intervals in a bed-like file (with header!). Note that I haven't updated it in a while and was less good at programming at that time :-D Also, the
gidictshould be replaced since these are phased out.
Thank you for pointing this out. I am brand new to BioStars and I will keep this in mind for future posts :)