I downloaded Uniprot files of a group of proteins (n>1000, so manually checking these proteins is no option). The complete data files come as either a flat text file or a XML file. There is a lot of information present in these files (for an example, see here: http://www.uniprot.org/uniprot/?query=organism%3A%22homo+sapiens%22 then go to download and you can look at the first 10 for the complete data, either txt or xml file).
Since there is a lot of information in their I do not need, I have to find a way how to select the information I'm interested in (preferably in a data matrix). For every entry this is:
Wanted information: Text file entry: XML file entry:
Uniprot ID ID <entry><name>
Gene Name GN <gene><name type="primary">
Full protein name RecName: <protein><recommendedName><fullName>
Transmembrane domains (may be more) TRANSMEM <feature type="transmembrane region"><location> This consists of <begin position="xxx"/> and <end position="yyy"/>
Full protein sequence SQ <sequence>
Some entries will not contain all the information (like transmembrane domains), and then a NA might be filled in. Some entries will contain more than 1 time information of the same kind (again like transmembrane domains) and for these, all should be named (if possible in the same cell, separated by "," or ";" or "|").
I am a bit familial with R, but I wasn't able to get to this point with that (might be lack of programming skills). I looked into XML editors (since this seems to be the easiest solution), but I wasn't able to get any to work, I simply couldn't find something that helped me on my way and explained the different steps. I also know that there should be a way to process XML files in R, but the help files there didn't get me where I need to be either. In XMLQuire, the only thing I could download so far, I'm able to see the file, but it keeps crashing on me when I want to do anything (even when I'm just trying to figure out where I can edit the file), so my file might be too long or there's another problem.
Help for this matter would highly be appreciated, I'm hoping to find someone who did a similar thing, but all solutions are welcome, no matter how small and no matter which program I need to use as long as it's freeware.
Also let me know if things are unclear, I really try to be as clear as possible. And sorry for being such a blondie on the subject.
there is a problem with the XSLT formatting on #biostar.
their is another issue when i post a comment or any new post it will be duplicated and i need to remove on of them
Thank you for the response. But where do I put this code in?