Sort out fasta file by its description
2
0
Entering edit mode
2.1 years ago
vellryba • 0

Hi,

I am trying to extract all class:2 seqeuences from a fasta file but I am getting this error:

Traceback (most recent call last): File "class-sorter.py", line 23, in <module> main(sys.argv) File "class-sorter.py", line 16, in main if "class:2" in record.description(): TypeError: 'str' object is not callable

My code:

import sys
import argparse
import operator i
import re
improt itertools from Bio import SeqIO

def main (argv):
parser = argparse.ArgumentParser(description='find a location')
args = parser.parse_args()

with open(args.outfile,"w") as of:
for record in SeqIO.parse(args.infile,"fasta"):
if "class:2" in record.description():
of.write(">"+record.id)
of.write(record.seq)

if __name__ == "__main__":
main(sys.argv)


There is an example of the fasta file:

>NEIS2020_129 class:2
ATGAAAAAATCTCTGATTGCTCTGACTTTGGCAGCTTTGCCTGTTGCAGCTATGGCTGAT
GTGACTCTGTACGGTCAAGTTAAAGCCGGTGTTGAAATTTCTCGCATCAAAGCAGGCTCT
GGTGTGACTGATAATGGTCGTTCTTCTACTACTAAAACTGCAACTGAAATCGCTGACCTC
GGTTCTCGTATCGGTTTCAAAGGTCATGAACACCTGAGCAACAACCTGAACGCTATTTGG
CAAGTAGAACAAAATACTTCTGTTGCTGGTACTGACAGCGGCTGGGGTACTCGTGAATCT
TTCATCGGTTTGGAAGGTGGCTTCGGTAAAATTCGTGCTGGTAAACTCAACACTACTCTG
AAAGACAGCACCGACAGCATCGATCCATGGGAATCCAGCGATGCTAATGAACATGTATTG
TCATTGGGTACTTTGGAACGTGTAGATGAGCGTAAAGTGTCTGTTCGCTACGACTCCCCT
GTGTTCTCAGGCTTTAGCGCAAGCGTTCAATACCAACCTCGCGATAACGCCAACTCTAGC
GACAAATATACTCATGCTGCGAAAAGCCGTGAAGCATACTACGCTGGTTTGAACTATGAA
AATTCCGGTTTCTTTGGTCGCTATGCTGGTAAATTTGCAAAACATGATGTAATTACAGCT
AACGAATATGATGTTGCTGTTGATAAAGTTGCTGATGCTGCTTCTACTTTGAAAGTTGGC
GATACTTTGGCTACTGTTAAAGATCATCAAGTTCATCGTTTAGTAGCTGGTTACGACGCG
AACAATGTTTTATTTGCTGTTGCTGGTCAATATGATGCATCAAAAAATGGTGATGTAAGT
GATGCTAACTACGGTAAGAAAAACGAGCAAACCCAAGTTGCTGTAACTGGTGGCTACCGT
ATGGGCAACGTAATGCCTCGTGTTTCTTACGCTCACGGCTTCAAAGCTAAAGAAGATGGC
GAGAAACAAGCTAACAGTCAATACAACCAAGTTATCGTTGGTGCTGACTACGACTTCTCT
AAACGTACTTCTGCTTTGATTTCTGCTGGTTGGTTGAAACAAGGTAAAGGCGTTAACAAA
GTTGAATCTACTGCTGGTTTGGTTGGTCTGCGCCACAAATTCTAA
>NEIS2020_130 class:3
ATGAAAAAATCCCTGATTGCCCTGACTTTGGCAGCCCTTCCTGTTGCAGCAATGGCCGAT
GTTACCCTGTACGGCACCATCAAAGCCGGCGTAGAAACCTACCGTACTGTAAAACACACA
GACGGCAAAGTAACTGAAGTGAAAACCGGCAGCGAAATCGCCGACTTCGGTTCAAAAATC
GGCTTCAAAGGTCAAGAAGATCTCGGCAACGGCCTGAAAGCCATTTGGCAGTTGGAACAA
AGCGCCTCCATCGCCGGCGCTGACAGCGGCTGGGGCAACAAACAATCTTTCATCGGCTTG
AAAGGCGGCTTCGGTACCGTCCGCGCCGGTAACCTGAACAGCATCCTGAAAAGCACCGGC
GACAACGTCAACGCTTGGGAATCCGGCAAGGCTACCGAAGACGTGCTGCAAGTCAGCAAA
ATTTCCGCTCCGGAACACCGCTACGCATCCGTACGCTACGACTCTCCCGAATTTGCCGGC
TTCAGCGGCAGCGTACAATACGCGCCTAAAGACAATTCAGGTGCAAACGGCGAATCTTAC
CACGTTGGTTTGAACTACCAAAACAGCGGTTTCTTCGCACAATACGCCGGCTTGTTCCAA
AGACACGGCGAAGGCACTAAAGCCACAGTCGGCGAGCCTGTTGAAAAACTGCAAGTCCAC
CGTTTGGTCGGCGGTTACGACAATGATGCCCTGTACGCTTCCGTAGCCGTACAACAACAA
GATGCGAAACTGACTGATGCTTCCAATTCGCACAACTCTCAAACCGAAGTTGCCGCTACC
GTGGCATACCGTTTCGGCAACGTAACGCCCCGCGTTTCTTACGCCCACGGCTTCAAAGGC
ACTGTTGCTAAAGCAGACGGCGACAACCGTTACGACCAAGTGGTTGTCGGTGCGGAATAC
GACTTCTCCAAACGCACTTCTGCCTTGGTTTCTGCCGGCTGGTTGCAAGAAGGCAAAGGT
GCAGGCAAAACCGTATCGACTGCCAGCACCGTCGGTCTGCGCCACAAATTCTAA


Can anyone see what am I doing wrong? Cannot figure it out after googling. Thank you

python fasta bipython • 498 views
0
Entering edit mode

If fa file is flattened, this should work:

sed -n '/\sclass:2$/ {p;n;p}' test.fa  If fa file is not flattened, try with seqkit: seqkit grep -nrip '(\sclass:2$)' test.fa | seqkit seq -w0

1
Entering edit mode
2.1 years ago
awk '/^>/ {X=index(\$0,"class:2")>0;} {if(X) print;}' in.fa

1
Entering edit mode
2.1 years ago
Joe 20k

            if "class:2" in record.description():


record.description is not a function, it is an object (a string), so you're getting the error that you cannot 'call' a string (which is correct).

You don't need the brackets.

There are a few other issues:

import sys
import argparse
import operator
import re
import itertools
from Bio import SeqIO

def main (argv):
parser = argparse.ArgumentParser(description='find a location')
args = parser.parse_args()

with open(args.outfile,"w") as of:
for record in SeqIO.parse(args.infile,"fasta"):
if "class:2" in record.description():
of.write(">"+record.id)
of.write(record.seq)

if __name__ == "__main__":
main(sys.argv)


You don't need to use sys.argv if you're using argparse - use one or the other, unless you have a very specific need to use both. Consequently, you don't need to pass sys.argv through main(), so the code would look better in general as:

import argparse
from Bio import SeqIO

def parse_args():
parser = argparse.ArgumentParser(description='find a location')

return parser.parse_args()

def main():
args = parse_args()

with open(args.outfile,"w") as of:
for record in SeqIO.parse(args.infile,"fasta"):
if "class:2" in record.description:
of.write(">"+record.id)
of.write(record.seq)

if __name__ == "__main__":
main()

0
Entering edit mode

I still get error:

Traceback (most recent call last): File "class-sorter.py", line 22, in <module> main() File "class-sorter.py", line 19, in main of.write(record.seq) TypeError: expected a string or other character buffer object

0
Entering edit mode

Ok, well what do you suppose the error TypeError: expected a string or other character buffer object means?

How might you make something in to a string in python? Learning to interpret python errors is an important skill if you're going to write more complicated scripts.

0
Entering edit mode

Ha!

fixed it: of.write("\n"+str(record.seq+"\n"))

So the record.seq is a dictionary? Its not a string after the SeqIO parse? I suppose what I dont get is why the write function doesnt have any problems with the record.id

0
Entering edit mode

record.seq is a Seq object in biopython, not just a pure string. Consequently, write() doesn't know what to do with it, and can't write it to a file.

Basically, a SeqRecord is created which corresponds to each entry of your input file. SeqRecords are a class which holds the sequence, IDs and all the other useful metadata. It's common to trip up on this as a SeqRecord object contains a Seq object which holds the sequence itself. Inside the Seq object, the sequence is actually a string, and the __str__ method for these objects is defined such that calling str() on a Seq object will return the sequence as a string.

Essentially:

 - SeqRecord object
∟ Seq object
∟ Alphabet object
∟ Sequence (as a string)


You could also retrieve the data by using the private variable seq._data, but the _ denotes a private method, and you aren't supposed to use it that way (hence they define str for you as its more 'pythonic')

write has no issue with the id or the description as these are pure strings from the moment they're parsed in to the SeqRecord. The use of a Seq object is to allow a sequence to also have the alphabet that it is encoded by etc attached to it. All of the source code can be found here: https://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html

0
Entering edit mode