Unextractible Embl Features For Seqio.Parse
1
0
Entering edit mode
8.6 years ago
blaise.li ▴ 10

I would like to extract information from some of the "SO_feature" features of the following file: https://raw.github.com/cbergman/transposons/master/current/transposon_sequence_set.embl.txt

Is it normal that when I parse the file using BioPython no features get associated with the records. I obtain the records as follows:

record = next(SeqIO.parse("transposon_sequence_set.embl.txt", "embl"))


More generally, what makes a feature extractible or not by BioPython?

biopython feature • 2.0k views
0
Entering edit mode

Have you tried to split the file in individual records and trying with each one?

0
Entering edit mode

The features list is still empty when applying SeqIO.read() to a file containing the first 188 lines of the original file (that should be the first record only). Other attributes like annotations or dbxrefs seem normal. It's my first time trying to read an embl formatted file, so I though I just had made some basic usage error of Biopython. But maybe the records are not well formatted, or maybe there are limitations in BioPython's embl parser.

3
Entering edit mode
8.5 years ago
Peter 6.0k

Where did this file come from? It does not look like a real EMBL file - for a start it is missing the feature table header (which is indirectly why the parser seems to have ignored your features):

FH   Key             Location/Qualifiers
FH


If I add that manually, then Biopython complains:

BiopythonParserWarning: Overindented SO_feature feature?
BiopythonParserWarning: Couldn't parse feature location: 'five_prime_LTR;SO:0000425:1..600'
BiopythonParserWarning: Couldn't parse feature location: 'three_prime_LTR;SO:0000426:6841..7411'
BiopythonParserWarning: Couldn't parse feature location: 'CDS;SO:0000316:<988..2031'
BiopythonParserWarning: Couldn't parse feature location: 'CDS;SO:0000316:<1950..5402'
BiopythonParserWarning: Couldn't parse feature location: 'CDS;SO:0000316:5248..6780'


All the feature locations are very wrong - the SO_feature bit seems to have been inserted and the real feature type (e.g. CDS) pushed to the right.