Question: Plot Sequence Similarity Data
gravatar for thorerges
4.1 years ago by
United States
thorerges60 wrote:


I am trying to do two things, I will try to make this as clear as possible. 

1. I have aligned and downloaded about 500 sequences in BLAST. However, in my FASTA file I just want to show the accession number, not the GI number. 

So convert this:


into this:


Is there a way to do this? I could write a script in Python, but I don't want to re-invent the wheel. 

2. From my alignment, I generated a sequence similarity matrix in a software called MacVector. This assigns a similarity score to all the sequences on the basis of how similar they are. I then plotted this in excel, in the form of a histogram.


It looks like this:

Each bar in the histogram is supposed to be a single sequence, and the x axis the accession number or identifier for that sequence. As you can probably tell, the x-axis is missing a lot of labels (it needs to be 500 labels).

I had this problem before, displaying relatively large data sets cleanly in R, usually I just edited the picture, but for 500 sequences, it is jus too much. I am sure someone has run into this before. Is there a way to clean this up?

Any advice or pointing me in the right direction would be thoroughly appreciated. 

plot python sequence R sequencing • 1.8k views
ADD COMMENTlink modified 4.1 years ago by 5heikki8.3k • written 4.1 years ago by thorerges60
gravatar for 5heikki
4.1 years ago by
5heikki8.3k wrote:
awk 'BEGIN {FS="|"} {if(/^>/) print ">"$4; else print $0}' input.fasta > output.fasta

Field separator is "|". If line begins with ">", print ">" and the 4th field, else print the entire line. As to your plotting problem, I second zlskidmore. Use ggplot. See

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by 5heikki8.3k

Wow, thank you so much. One last question, where can I begin to learn these kind of text file manipulations? I was going to spend a lot of time writing something in Python, but this seems so usable and useful. Thank you again. 

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by thorerges60

I learned by simply googling "how to do X in unix/bash" and by reading man pages of awk, grep, sed, comm, join, cat, rev, tr, head, tail, cut, sort, etc. Every Bioinformatics 101 should really start with an introduction to GNU Coreutils and Bash scripting..

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by 5heikki8.3k
gravatar for zlskidmore
4.1 years ago by
United States, St. Louis, Washington University
zlskidmore280 wrote:

You should be able to convert your fasta header lines with a regexp, something like this would edit your file in place and should work depending on the consistency of your fasta headers:

sed -i -e 's/gi|[0-9]*|[a-zA-z]*|//g' Untitled.fa

As for displaying a large number of sample names on an axis in R I usually try a combination of shrinking the axis labels and increasing the size of the graphics device (I can't recall what R base does but I know GGPLOT will overlay labels regardless of whether they fit), when saving as a pdf you could alter the height and width arguments within the pdf() function. This would work if you wanted a quick look at your data, but may not be best for a presentation.


ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by zlskidmore280
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1249 users visited in the last hour