BioPython SeqRecord attributes
2
1
Entering edit mode
8.6 years ago
susan.klein ▴ 70

Hi,

I keep coming up against these problem when writing scripts using BioPython SeqIO..

The first, is I can't find in the documentation of BioPython (or the cookbook) how to change a SeqcRecord attribute.. e.g. add a number to a record.id (or record.name) to give it a unique identifier (a common problem in preparing data)..

For example:

c=0

for x in SeqIO.parse("file.fasta","fasta"):
c=c+1
x.name = x.name+str(c)
SeqIO.write(x,"out.fasta","fasta")


...does nothing, the name for each record stays the same as the input.

If I try to build the output manually, simply by writing a text file, I always come across the problem of having no way to find out what all the SeqcRecord attributes are.. fore example, in a fastq record, what is the attribute for the phred33 data?? record.letter_annotations[phred_quality] will give phred values, but not the phred33 coded values.. Does anyone know how to get a full list of attributes (and their 'keys') for a SeqRecord??

Thanks for any help..

Theo

BioPython • 6.5k views
0
Entering edit mode

Just, to clarify.. I can use dir(x.letter_annotations) but that does nto show me where the e.g. phred33 data is stored. Thanks.

0
Entering edit mode

Great, that solved my problem. Thanks.

0
Entering edit mode

BioStars is a Q&A site where the idea is that you mark good answers by accepting them (ticking them), this feeds into the rating system.

5
Entering edit mode
8.6 years ago

I have a feeling there's a minor confusion with ID and name in your example - you can use the print function to see all attributes of a SeqRecord, here for a small test record:

for s in SeqIO.parse('./test.fasta', 'fasta'):
print(s)


prints

ID: Unchanged
Name: Unchanged
Description: Unchanged
Number of features: 0
Seq('ATGCTAGCTAGCTAGCTA', SingleLetterAlphabet())


Now if I change that:

for s in SeqIO.parse('./test.fasta', 'fasta'):
s.id = 'CHANGED'
print s


it prints

ID: CHANGED
Name: Unchanged
Description: Unchanged
Number of features: 0
Seq('ATGCTAGCTAGCTAGCTA', SingleLetterAlphabet())


So as you can see, name and description stay the same, which is probably what happens in your example. If I write this SeqRecord via SeqIO.write(s, 'out.fasta', 'fasta') I get

>CHANGED Unchanged
ATGCTAGCTAGCTAGCTA


This should also answer your second question, using the normal print() function you can see all attributes. You can also use Python's in-built dir() method:

print dir(s)


prints

['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'upper']


Lots of Python standard functions, and a few BioPython specific methods.

The numbers in letter_annotations are ASCII numbers to accommodate for the various offsets, to get your phred+33 numbers back here's one example where the SeqRecord's quality is all '#':

>>> a.letter_annotations['phred_quality']
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


Get the first number, and add 33 for the offset:

>>> print chr(a.letter_annotations['phred_quality'][0] + 33)
'#'


So you can for example do this:

>>> print [chr(x + 33) for x in a.letter_annotations['phred_quality']]
['#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#']


or even

>>> print ''.join([chr(x + 33) for x in a.letter_annotations['phred_quality']])
'################################################################################'


Edit:

In case you're wondering, quality score is stored as ASCII numbers so that the user can directly specify the offset without much hassle (as above with the +33 offset). This is also possible:

print a.format('fastq-illumina') - prints the # quality as B
print a.format('fastq-sanger') - prints the # quality as #
print a.format('fastq-solexa') - prints the # quality as >

1
Entering edit mode
8.6 years ago
Peter 6.0k

As per Philipp's answer - you did change the .name attribute, but you needed to change the .id attribute to have an effect on the FASTA output.

For finding out what is in .annotations and .letter_annotations, these are both Python dictionaries - a fundamental datatype in Python. Use their .keys() method here (or a similar approach).

Regarding the FASTQ string, using Bio.SeqIO.parse(...) like this parsed the raw FASTQ string into a list of integers. If you don't want it parsed into a list of integers you can use the low-level FASTQ parser which gives tuples of strings, try the example here: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/