Question: How would you create a table for multiple organisms vs presence of multiple genes using command line blast? [image in description]
0
gravatar for Tom
3.8 years ago by
Tom20
United States
Tom20 wrote:

Here's the situation.

I have proteome files for a bunch of strains. Each strain has its own fasta proteome (strain1.faa, strain2.faa, strain3.faa).

I also have a fasta list of AA sequences, and I want to know if they are present within these strains. That "query" file, looks like this:

>gene 1

MKGMF...*

>gene 2

MQWAEA...*

etc...

What I want in the end is a matrix with the strains in first column, and first row being the genes. I DONT want to have to do a manual blast for every cell because that's impractical. I just want the information. The values in the matrix is the %identity of that gene in that strain. It will look like this: enter image here What is the most parsimonious way to go about this project? I have a lot of strains, and hundreds of genes to test. But, I'm okay with outputing a csv for now. It's such a large task that I'm unsure of how to start it.

blastp blast command line • 1.3k views
ADD COMMENTlink modified 3.8 years ago by Michael Dondrup47k • written 3.8 years ago by Tom20
0
gravatar for 5heikki
3.8 years ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

When you have the cvs load it into R (maybe RStudio) and plot if with ggplot2 like here. One very fast way to get a distance matrix is to use the cool new mash algorithm. I think it should work with proteomes too..

p.s. I don't really understand your picture. How is strain X Y percent some gene?

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by 5heikki8.6k

It's not the heatmap I want. It's just the raw information. I don't want to have to individually do a blast search manually for each cell.

I can't find another google image picture that depicts this very type of project.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Tom20
0
gravatar for Michael Dondrup
3.8 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

You don't need more than one blast run to do this. Put all the reference sequences or genomes on the y-axis into one blast database. Put all query sequence on the x-axis into the query fasta. Run the right blast command (e.g. tblastn, or blastp), and you are done.

ADD COMMENTlink written 3.8 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour