Question: BioPython SeqRecord attributes
1
gravatar for susan.klein
4.8 years ago by
susan.klein50
Oceania
susan.klein50 wrote:

Hi,

I keep coming up against these problem when writing scripts using BioPython SeqIO..

 

The first, is I can't find in the documentation of BioPython (or the cookbook) how to change a SeqcRecord attribute.. e.g. add a number to a record.id (or record.name) to give it a unique identifier (a common problem in preparing data)..

 

For example:

c=0

for x in SeqIO.parse("file.fasta","fasta"):

    c=c+1

    x.name = x.name+str(c)

    SeqIO.write(x,"out.fasta","fasta")

...does nothing, the name for each record stays the same as the input.

If I try to build the output manuall, simply by writin g a text file, I always come accross the problem of having no way to find out what all the SeqcRecord attributes are.. fore example, in a fastq record, what is the attribute for the phred33 data?? record.letter_annotations[phred_quality] will give phred values, but not the phred33 coded values.. Does anyone know how to get a full list of attributes (and their 'keys') for a SeqRecord??

Thanks for any help..

 

Theo

 

biopython • 3.3k views
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by susan.klein50

Just, to clarify.. I can use dir(x.letter_annotations) but that does nto show me where the e.g. phred33 data is stored. Thanks.

ADD REPLYlink written 4.8 years ago by susan.klein50
5
gravatar for Philipp Bayer
4.8 years ago by
Philipp Bayer6.0k
Australia/Perth/UWA
Philipp Bayer6.0k wrote:

I have a feeling there's a minor confusion with ID and name in your example - you can use the print function to see all attributes of a SeqRecord, here for a small test record:

for s in SeqIO.parse('./test.fasta', 'fasta'):
     print(s)

prints

ID: Unchanged
Name: Unchanged
Description: Unchanged
Number of features: 0
Seq('ATGCTAGCTAGCTAGCTA', SingleLetterAlphabet())

Now if I change that:

for s in SeqIO.parse('./test.fasta', 'fasta'):
    s.id = 'CHANGED'
    print s

it prints

ID: CHANGED
Name: Unchanged
Description: Unchanged
Number of features: 0
Seq('ATGCTAGCTAGCTAGCTA', SingleLetterAlphabet())

So as you can see, name and description stay the same, which is probably what happens in your example. If I write this SeqRecord via SeqIO.write(s, 'out.fasta', 'fasta') I get

>CHANGED Unchanged
ATGCTAGCTAGCTAGCTA

This should also answer your second question, using the normal print() function you can see all attributes. You can also use Python's in-built dir() method:

print dir(s)

prints

['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_per_letter_annotations', '_seq', '_set_per_letter_annotations', '_set_seq', 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'upper']

Lots of Python standard functions, and a few BioPython specific methods.

The numbers in letter_annotations are ASCII numbers to accommodate for the various offsets, to get your phred+33 numbers back here's one example where the SeqRecord's quality is all '#':

>>> a.letter_annotations['phred_quality']
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Get the first number, and add 33 for the offset:

>>> print chr(a.letter_annotations['phred_quality'][0] + 33)
'#'

 

So you can for example do this:

>>> print [chr(x + 33) for x in a.letter_annotations['phred_quality']]
['#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#', '#']

or even

>>> print ''.join([chr(x + 33) for x in a.letter_annotations['phred_quality']])
'################################################################################'

 

Edit:
In case you're wondering, quality score is stored as ASCII numbers so that the user can directly specify the offset without much hassle (as above with the +33 offset). This is also possible:

print a.format('fastq-illumina') - prints the # quality as B

print a.format('fastq-sanger') - prints the # quality as #

print a.format('fastq-solexa') - prints the # quality as >

 

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Philipp Bayer6.0k
1
gravatar for Peter
4.8 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

As per Philipp's answer - you did change the .name attribute, but you needed to change the .id attribute to have an effect on the FASTA output.

For finding out what is in .annotations and .letter_annotations, these are both Python dictionaries - a fundamental datatype in Python. Use their .keys() method here (or a similar approach).

Regarding the FASTQ string, using Bio.SeqIO.parse(...) like this parsed the raw FASTQ string into a list of integers. If you don't want it parsed into a list of integers you can use the low-level FASTQ parser which gives tuples of strings, try the example here: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Peter5.8k
0
gravatar for susan.klein
4.8 years ago by
susan.klein50
Oceania
susan.klein50 wrote:

Great, that solved my problem. Thanks.

 

 

ADD COMMENTlink written 4.8 years ago by susan.klein50

BioStars is a Q&A site where the idea is that you mark good answers by accepting them (ticking them), this feeds into the rating system.

You can also comment on answers individually rather than adding another "answer" of your own.

ADD REPLYlink written 4.8 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour