A somewhat embarrassing question, given that I've been using Bioperl for 12+ years, but here we go.
I've written a small script to extract features of type MOD_RES from files in SwissProt format. Very simple and based entirely on examples in the Feature Annotation how-to. It's running on a server which runs Ubuntu 10.04.4 LTS, Bioperl version 1.6.1.
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $file = shift || die("Usage: modres.pl <uniprot file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "swiss");
while (my $seq = $seqio->next_seq) {
my $id = $seq->display_id;
my $ac = $seq->accession;
for my $feat ($seq->get_SeqFeatures) {
if($feat->primary_tag eq "MOD_RES") {
for my $tag ($feat->get_all_tags) {
for my $value ($feat->get_tag_values($tag)) {
print "$id\t$ac\t$value\n";
}
}
}
}
}
It works fine using an example file, Q8IYH5:
./modres.pl Q8IYH5.txt
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ3_HUMAN Q8IYH5 N6-acetyllysine
It works fine using a file containing the first 10 records from the human reference proteome set, saved as test10.dat:
./modres.pl test10.dat
1433B_HUMAN P31946 N-acetylmethionine; in 14-3-3 protein beta/alpha; alternate
1433B_HUMAN P31946 N-acetylthreonine; in 14-3-3 protein beta/alpha, N-terminally processed
1433B_HUMAN P31946 Phosphoserine (By similarity)
1433B_HUMAN P31946 N6-acetyllysine
1433B_HUMAN P31946 N6-acetyllysine
1433B_HUMAN P31946 Phosphoserine (By similarity)
1433E_HUMAN P62258 N-acetylmethionine
1433E_HUMAN P62258 N6-acetyllysine
1433E_HUMAN P62258 N6-acetyllysine
1433E_HUMAN P62258 N6-acetyllysine
1433E_HUMAN P62258 N6-acetyllysine
1433E_HUMAN P62258 Phosphoserine
1433F_HUMAN Q04917 N-acetylglycine
1433F_HUMAN Q04917 Phosphoserine
1433F_HUMAN Q04917 Phosphoserine (By similarity)
1433F_HUMAN Q04917 Phosphotyrosine (By similarity)
1433G_HUMAN P61981 N-acetylmethionine; in 14-3-3 protein gamma; alternate; partial
1433G_HUMAN P61981 N-acetylvaline; in 14-3-3 protein gamma, N-terminally processed; partial
1433G_HUMAN P61981 N-acetylvaline; partial
1433G_HUMAN P61981 Phosphotyrosine (By similarity)
1433G_HUMAN P61981 Phosphothreonine
1433S_HUMAN P31947 Phosphoserine
1433S_HUMAN P31947 Phosphoserine (By similarity)
1433S_HUMAN P31947 Phosphoserine (By similarity)
1433S_HUMAN P31947 Phosphoserine
1433T_HUMAN P27348 N-acetylmethionine
1433T_HUMAN P27348 N6-acetyllysine
1433T_HUMAN P27348 N6-acetyllysine
1433T_HUMAN P27348 N6-acetyllysine
1433T_HUMAN P27348 N6-acetyllysine
1433T_HUMAN P27348 Phosphoserine
1433Z_HUMAN P63104 N-acetylmethionine
1433Z_HUMAN P63104 N6-acetyllysine
1433Z_HUMAN P63104 Phosphoserine; by PKA and PKB/AKT1
1433Z_HUMAN P63104 N6-acetyllysine
1433Z_HUMAN P63104 Phosphoserine; by MAPK8
1433Z_HUMAN P63104 Phosphoserine
1433Z_HUMAN P63104 Phosphothreonine; by CK1
1A01_HUMAN P30443 Sulfotyrosine (Potential)
1A02_HUMAN P01892 Phosphoserine
However, if I download the complete human reference proteome set, which contains 70 555 entries and is 465 568 340 bytes, the script hangs near the end.
If I redirect output to file modres.txt and look at the end, I can see where the script stopped:
tail -5 modres.txt
ZZEF1_HUMAN O43149 Phosphoserine (By similarity)
ZZEF1_HUMAN O43149 N6-acetyllysine
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ3_HUMAN Q8IYH5 Phosphoserine
ZZZ
I don't see anything strange about the MOD_RES lines for entry Q8IYH5:
grep "MOD_RES" Q8IYH5.txt
FT MOD_RES 113 113 Phosphoserine.
FT MOD_RES 130 130 Phosphoserine.
FT MOD_RES 131 131 Phosphoserine.
FT MOD_RES 135 135 Phosphoserine.
FT MOD_RES 701 701 N6-acetyllysine.
Nor is there anything strange about the entry Q8IYH5 in the larger file; it ends as expected with // and the next entry starts.
Google search has not helped up to now. Anyone seen anything similar recently? Any ideas?
FYI: I just ran this script on a different machine (Linux Mint 14, Bioperl 1.6.901-3) and it hangs at a completely different point in the file.