How can I make the bar graph of genes from BLAST result?
1
0
Entering edit mode
8.7 years ago
tcf.hcdg ▴ 70

Hello

I have the blast result in table format. Below are the first three columns.

First column is the query id (in this example we have 2 queries 60317531,60317532)

Second column is the hits against the query sequence and have 3 parts

a) swiss prot id (sp|Q10CQ1|)
b) gene name (MAD14)
c) organism (ORYSJ)

I would like to make the bar chart of genes which are present and how many times they appear against each query. For example for the first query (60317531)

MAD14    2 times
MAD15    1 time
AGL8    2 time
AP1    3 time
# Fields: query id            subject id                  % identity
gi|60317531|gb|AAX18712.1|    sp|Q10CQ1|MAD14_ORYSJ       84.21
gi|60317531|gb|AAX18712.1|    sp|P0C5B1|MAD14_ORYSI       83.40
gi|60317531|gb|AAX18712.1|    sp|Q6Q9I2|MAD15_ORYSJ       68.91
gi|60317531|gb|AAX18712.1|    sp|Q42429|AGL8_SOLTU        57.20
gi|60317531|gb|AAX18712.1|    sp|O22328|AGL8_SOLCO        58.00
gi|60317531|gb|AAX18712.1|    sp|Q41276|AP1_SINAL         65.79
gi|60317531|gb|AAX18712.1|    sp|D7KWY6|AP1_ARALL         65.79
gi|60317531|gb|AAX18712.1|    sp|Q8GTF4|AP1C_BRAOB        64.21
gi|60317532|gb|AAX18713.1|    sp|B4YPV4|AP1C_BRAOA        64.21
gi|60317532|gb|AAX18713.1|    sp|Q96355|1AP1_BRAOT        64.21
gi|60317532|gb|AAX18713.1|    sp|P0DI14|AP1_BRARP

Genes should be on the x axis and frequency of these genes on the y axis while the query id as the title of the graph.

Is there any automatic way, I can do it because I have ~40,000 queries and around ~100 hits against each query in a single file?

Thanks in advance

blast • 2.8k views
ADD COMMENT
1
Entering edit mode

in R you could use strsplit on the subject_id column to extract the gene name and do a table() on it to have the number of occurences for each gene. Then a simple barplot() should work.

ADD REPLY
0
Entering edit mode
8.7 years ago
5heikki 11k
cut -f2,7 -d "|" blastOutputFile | \
  tr "|" "\t" | \
  cut -f1 -d "_" | \
  sort | \
  uniq -c | \
  sed 's/^   //' | \
  tr " " "\t" | \
  awk -F '\t' '{print $2"\t"$3"\t"$1}' > counts.txt`

Then plot it with whatever program you like. ggplot2 makes pretty plots.

ADD COMMENT
0
Entering edit mode

I have ~ 44000 sequences and need plot for each sequence. As for as I could understand from the code it can split the subject id giving genes in a separate column and then sort the genes then count the genes.

How can it differentiate among 44000 queries ( e.g. 60317531, 60317532)

ADD REPLY
0
Entering edit mode
You plan to generate 44k plots? Please be more specific.
ADD REPLY
0
Entering edit mode

Yes I am planning to generate 44k plots.

Actually I make the blast of 44k sequences and ~100 result of each of them.

Now I want the plots of each query against the genes.

ADD REPLY
1
Entering edit mode

Well, it's easy enough to split the output of my command based on column 1 to 44k files that you can loop over with some rscript or something. How many years do you plan to use studying the 44k plots?

ADD REPLY

Login before adding your answer.

Traffic: 2759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6