Load MSA text from FluDB into Excel
2
0
Entering edit mode
7.8 years ago

Dear All

I'm not sure if this is the right place to ask this (is there somewhere better?)

I have generated a MSA in FluDB.org comprising about 9,800 protein sequences.

I have downloaded it to my Windows laptop in several formats including fasta, xml, pretty and raw/plain.

Now I want to load it into Excel to sort and plot the data against various geographical parameters. I want to preserve the alignment, with a label in the first column, and one letter in each column.

I don't want to install a language on my laptop and write a script unless I really have to - I would like to use an on-line converter.

This seems to be amazingly difficult : (

Any suggestions?

Thx to all, Patrick

alignment • 1.8k views
ADD COMMENT
3
Entering edit mode

enter image description here

ADD REPLY
0
Entering edit mode

10K sequences is not a trivial dataset so hopefully you have enough local hardware resources (e.g. RAM). Try MEGA and the alignment editor in it instead.

As has been discussed here many times, using Excel (except for casual review of results etc) for bioinformatics analysis is not a good idea.

ADD REPLY
0
Entering edit mode
7.8 years ago

Thank you both for your helpful comments

I can see that it is felt that students should learn how to do bioinformatics properly, write scripts etc.

I am not a student, more of a general biologist.

I have written both perl and python scripts in the past, but it would take me several days just to get going if I were to look at that again.

The problems are often with very simple things such as loading files into memory to work on.

And in my opinion there is a great need for a really intuitive language (please don't say python) for non-programmers.

Excel has worked very well - it can handle the large database - just.

Best wishes to all, Patrick

ADD COMMENT
0
Entering edit mode

The problems are often with very simple things such as loading files into memory to work on. (...) And in my opinion there is a great need for a really intuitive language (please don't say python) for non-programmers.

you should try the knime workbench knime.org

ADD REPLY
0
Entering edit mode

I recently had to work with data from fludb. This site is about reannotating Influenza virus sequences which were originally submitted to Genbank. The problem is that the (meta)data you get out of flubase is very badly structured (it is not properly normalized in terms of software engineering). Thus it is not a lack of a "intuitive language", it is a lack of data structure! You will run into some kind of trouble with any tool you use.

ADD REPLY
0
Entering edit mode
7.8 years ago
DG 7.3k

So from the sounds of it you are looking to do some sort of phylogeographic analysis? Unless I'm totally mistaken. If that is the case you should use a tool like GenGIS, which is a tool built specifically for that sort of thing. If you're looking to do something else, I'm not sure what specific tool to recommend, as it sounds like you are looking at plotting some sort of geographic data against sequences, but Excel isn't the tool for the job.

In this case it isn't about needing to learn scripting, although there is a lot to be said for that, but trying to find the right tool for the job. As was said, almost 10K sequences is quite a bit for excel, plus whatever other data you are looking to include, and you would need to have some sort of parser in the first place to get the MSA you downloaded into a format that Excel could even read.

MEGA was also recommended, it may also do what you want to do. There are also a bunch of other programs out there aimed at non-bioinformaticians such as Knime, Geneious, BCBio, etc. None of these offered solutions require scripting on your part. You can also try and find a friendly bioinformatician or student to work with.

ADD COMMENT

Login before adding your answer.

Traffic: 3082 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6