Plot Sequence Similarity Data
2
0
Entering edit mode
9.2 years ago
thorerges ▴ 70

Hello,

I am trying to do two things, I will try to make this as clear as possible.

1. I have aligned and downloaded about 500 sequences in BLAST. However, in my FASTA file I just want to show the accession number, not the GI number.

So convert this:

>gi|2182117|gb|U95551.1|

into this:

>U95551.1|

Is there a way to do this? I could write a script in Python, but I don't want to re-invent the wheel.

2. From my alignment, I generated a sequence similarity matrix in a software called MacVector. This assigns a similarity score to all the sequences on the basis of how similar they are. I then plotted this in excel, in the form of a histogram.

It looks like this:

< image not found >

Each bar in the histogram is supposed to be a single sequence, and the x axis the accession number or identifier for that sequence. As you can probably tell, the x-axis is missing a lot of labels (it needs to be 500 labels).

I had this problem before, displaying relatively large data sets cleanly in R, usually I just edited the picture, but for 500 sequences, it is just too much. I am sure someone has run into this before. Is there a way to clean this up?

Any advice or pointing me in the right direction would be thoroughly appreciated.

sequencing Python sequence plot R • 3.5k views
ADD COMMENT
2
Entering edit mode
9.2 years ago
5heikki 11k
awk 'BEGIN {FS="|"} {if(/^>/) print ">"$4; else print $0}' input.fasta > output.fasta

Field separator is |. If line begins with >, print > and the 4th field, else print the entire line. As to your plotting problem, I second zlskidmore. Use ggplot. See http://www.cookbook-r.com/Graphs/

ADD COMMENT
0
Entering edit mode

Wow, thank you so much. One last question, where can I begin to learn these kind of text file manipulations? I was going to spend a lot of time writing something in Python, but this seems so usable and useful. Thank you again.

ADD REPLY
0
Entering edit mode

I learned by simply googling "how to do X in unix/bash" and by reading man pages of awk, grep, sed, comm, join, cat, rev, tr, head, tail, cut, sort, etc. Every Bioinformatics 101 should really start with an introduction to GNU Coreutils and Bash scripting..

ADD REPLY
0
Entering edit mode
9.2 years ago
zlskidmore ▴ 290

You should be able to convert your fasta header lines with a regexp, something like this would edit your file in place and should work depending on the consistency of your fasta headers:

sed -i -e 's/gi|[0-9]*|[a-zA-z]*|//g' Untitled.fa

As for displaying a large number of sample names on an axis in R I usually try a combination of shrinking the axis labels and increasing the size of the graphics device (I can't recall what R base does but I know GGPLOT will overlay labels regardless of whether they fit), when saving as a pdf you could alter the height and width arguments within the pdf() function. This would work if you wanted a quick look at your data, but may not be best for a presentation.

ADD COMMENT

Login before adding your answer.

Traffic: 2252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6