GC Content of Fasta file --- Python Help
0
1
Entering edit mode
6.1 years ago

Hello all!

For an interview I was required to write a python script with the following specifications:

"-send me a python script that conforms to the following specifications:

1. It takes a single command line argument, which is the full path to a fasta file

2. It writes one line on the standard out for each fasta record

2a. The line consists of two values, separated by a space: the sequence id, and the percent GC

Example output:

seq1 100

seq2 50

seq3 0

Below is the script that I came up with. The interviewer got back to me a few weeks later and said that my program had a bug and did not follow specifications. He was not able to specify what my errors were which left me confused because my program was running fine when I sent it to him.

Can anyone point out a bug or how I failed to follow specifications?

Thank you for your help!

def gc_content (filename):
gc = at = unknown = 0
with open(filename, 'r') as f:
for line in f:
if line[:1] == '>':
if (gc + at) > 0: # used to skip first ID line
total = gc + at
percentage = int(round(gc / total * 100))
print "%s %d" %(seq_id, percentage)
seq_id = line.strip()  # saves the ID line for printing later
gc = at = unknown = 0
else:
nuc_str = list(line.strip())
for n in nuc_str:
if n == 'G' or n == 'g' or n == 'C' or n == 'c':
gc += 1.0
elif n == 'A' or n == 'a' or n == 'T' or n == 't':
at += 1.0
else:
unknown += 1.0  # usually represented by an 'N' in the fasta file.
print "%s %d" % (seq_id, percentage)


gc_content("/home/pat/Documents/FTO_Exons.fasta")

For some reason indenting is off at the top, but it is correct in my original script

sequence • 13k views
2
Entering edit mode

If I see correctly, the GC will not be calculated for the last record, as that calculation happens only when a new sequence header is encountered. In your case, the very last record will be reported with the GC of the previous record.

Other thoughts: Maybe they wanted you to calculate GC over all bases and not just unambiguous ones...

What happens with a sequence that contains only N or other ambiguity codes? That sequence will be omitted from your output, if I see correctly - maybe that violates spec 2.

Other than that, you have some (for my eye) weird constructs in your code, which are not wrong per se, but maybe put them off.

Examples:

1. line[:1], why not line[0] or line.startswith()
2. There is no reason to convert the sequence line into a list. You may as well directly iterate over the string.
3. As for the counting: nuc_str = line.strip().upper() saves you checking for upper/lowercase and you could e.g. use a Counter instead.

That said, those things are not wrong, but look a bit like beginner's code, so might be off-putting for potential employers.

2
Entering edit mode

Agreed, although the last line does take care of the final sequence in the file. Still, lots of stuff in the code that reads like it was kind of dashed together from Stackoverflow posts. gc = at = unknown = 0 is more commonly gc, at, unknown = (0, 0, 0), although both are technically correct. However the multiple assignment is an anti-pattern in Python:

>>> x = y = z = []
>>> x.append(1)
>>> y
[1]


EDIT: also the question was to accept one argument from the command line. You've written a function that accepts one argument, but you can't run it as a standalone python script in the way the interviewer requests.

1
Entering edit mode

The last line takes care of (i.e. prints) the last record but the GC for the last record is never calculated (or am I blind?)

Edit: clarified that take care means print

0
Entering edit mode

Yes. The last line prints out the GC content for the last record.

1
Entering edit mode

No, it does not because it never runs the calculation for the last sequence. It will instead take the GC% for the second-to-last sequence.

When I run your code on this:

>seq1
GCGCGC
>seq2
GGCCGAAAAA
>seq3
ATATATTA


then I'm getting this output

>seq1 100
>seq2 50
>seq3 50


seq3 clearly does not have 50% GC, also it prints the '>', which I think could also be 'off-spec'.

0
Entering edit mode

Thank you for pointing this out. I think the last two sequences of the fasta file I tested my script on actually had the same GC content which is why I might have overlooked this, just proving I should have done more tests. Do you think there is a better approach I could have used? It seems like skipping the first header line and retroactively printing is a road to confusions such as these, but I did not see anyway around it.

0
Entering edit mode

Biopython SeqIO can spare you many headaches when parsing common filetypes, including fasta. See the excellent tutorial and cookbook. My code below also uses SeqIO

My code boils down to following core functions:

def getGC(seq):
return(str(int(round((sum([1.0 for nucl in seq if nucl in ['G', 'C']]) / len(seq)) * 100))))

def processFasta(fas):
print("identifier\tGCcontent\tlength")
for record in SeqIO.parse(fas, "fasta"):
print("{}\t{}\t{}".formatrecord.id, getGC(str(record.seq)), str(len(record.seq))))


This processes each sequence separately and doesn't complicate stuff. Note that there is no need to add A and T and N, just count G and C and divide by length. My code also rounds the percentage, up to you what you want to do with that. As a bonus it also prints out the length of each sequence...

If something is unclear, please ask and I would be happy to explain.

0
Entering edit mode

No, apparently I'm blind :)

1
Entering edit mode

Thank you so much for pointing this out! I interpreted my "gc_content("/home/pat/Documents/FTO_Exons.fasta")" as the single command line argument, but I now see that it should have been %python GC_content.py /path_to_fasta_file/.

1
Entering edit mode

As an aside, I think this is one of the problems with testing-as-an-interview-screen since the OP could have great biology or reasoning skills but just doesn't nail this specific problem. It's obvious that there's lots of logic in this python function, also understanding of the problem, and an informal chat about how to solve the problem might have been more meaningful than example code. I'd certainly fall into some of the same traps if someone asked me to solve this problem in Perl.

0
Entering edit mode

Thank you for the encouragement! And I agree that a lot of this could have been sorted out if I had an opportunity to chat with him in person. Speaking of Perl, I generally do most of my programming in Perl, and I am just starting to pick up python. So if my python looks a little unconventional it is because I am still getting the hang of it.

0
Entering edit mode

Sharing code is more convenient (definitely for indentation) using a github gist. When you add the link to the gist in your post it will get recognized and rendered by biostars.

I'll throw in a python script I wrote to get GC content of either a fasta file or intervals in a bed-like file (with header!). Note that I haven't updated it in a while and was less good at programming at that time :-D Also, the gidict should be replaced since these are phased out.

0
Entering edit mode

Thank you for pointing this out. I am brand new to BioStars and I will keep this in mind for future posts :)