Hi all, I finally got that illusive Bioinformatician job and am now undergoing training. My first project is to write a FASTA parsing script that will take a FASTA file and split it into multiple files each containing a gene and its sequence. The task is to be done in pure Python 3.7 so no Biopython for me.
...
TestFile = "/Users/me/Desktop/PythonScrip/testanimals.fasta"
f1 = open(TestFile, "r")
titlelist = []
seqlist=[]
line = f1.readline()
if not line.startswith('>'):
raise TypeError("Not a Fasta, Check file and start again.")
for line in f1:
if line.startswith(">"):
titlelist.append(line)
if not line.startswith(">"):
seqlist.append(line)
if line.startswith('\n'):
seqlist.remove('\n')
else:
print("Carrying on...")
print(titlelist)
print(seqlist)
for name in titlelist:
names=name.strip(">").strip()
for seq in seqlist:
seqs=seq.strip(">").strip()
file = open("/Users/me/Desktop/PythonScrip/gene{}.fasta".format(names),'w')
file.write(StringToWrite)
StringToWrite = "{}\n{}".format(names, seqs)
print(StringToWrite)
print(names)
print(seqs)
My input file is very basic:
>duck
quackquackquack
>dinosaur
roarroarroarroar
>dog
woofwoofwoof
>fish
glubglubglub
>frog
ribbetribbetribbet
... I am currently having a bit of an issue at this point. The script works great, multiple files are created with different names but they all contain the same sequence, eg all of the animals go ribbetribbetribbet. although funny its also annoying. I can't quite get my head around what I've done though I think it has something to do with the 1st for being a priority and essentially ignoring the second.
I know it's ugly code but I am just getting started :) Thanks for the help!
That's a shitty requirement.
Counterpoint: Biopython is slow, and learning to parse files with Python is an invaluable skill. It's a decent exercise for a beginning bioinformatician or someone learning Python.
Writing your own parser for everything is interesting but often unnecessary. Depends on what you want to get out of it I guess. Also fastq has no clear definition, lots of corner cases.
This will fail with fasta files that have linebreaks in seqs, no?
Yeah looks that way to me. The
seqlist
collection only works when each sequence is a single line, and there is a 1 to 1 correspondence between thetitlelist
and theseqlist
, else it will start concatenating sequences from mutliple entries together.@damonlbp, You need to consider a more complicated test case, where each sequence is on multiple lines (which is by far the most common fasta format). In this case, you need to 'chunk' the file between the
>
characters.Here's one option to 'chunk' the file (though it does require having the whole thing in memory which isn't ideal):
Which for this file:
Results in:
it will be much easier to 1) read all content as a single string; 2) split by '>'; 3) split by '\n'. And most of the similar work is actually written quite nicely in the book: http://shop.oreilly.com/product/9780596154516.do
No, you can’t split by ‘\n’, because fasta sequences are line wrapped. You cannot rely on line breaks to delineate sequence from headers.
Plus, 'easier' is subjective. I would challenge you to write something that iterates a whole file and successfully 'chunks' the genes in fewer lines than the regex loop above.
this will print the same result as your code when apply on your sample file without using re.