Text Histogram From Sequence Lengths Of Fasta File?
3
2
Entering edit mode
11.9 years ago

I swear I have seen this somewhere but I can't find it again. Is there a script somewhere to produce text Stem-and-Leaf Histograms from sequence lengths of fasta file?

fasta • 6.1k views
ADD COMMENT
5
Entering edit mode
11.9 years ago

You can do this in R with a combination of the seqinR and the aplpack packages:

library(seqinr)
data=read.fasta("file.fasta")
l=getLength(x)

library(aplpack)
stem.leaf(l)

There are a number of options to suit your needs:

stem.leaf(l, unit=100)

Just have a look here for more output options and details:

?stem.leaf

This functions extends far more than the simple stem() function in the graphics package.

ADD COMMENT
2
Entering edit mode
11.9 years ago

My initial idea was to use Biopython for the parsing of the fasta file and matplotlib for the drawing of the stem-leaf plot. I couldn't find a function that draws a text histogram in matplotlibto my amazement, so here is my take:

from Bio import SeqIO
from collections import defaultdict

input_file = r'path\to\file.fasta'
lengths = [len(record.seq) for record in SeqIO.parse(open(input_file), 'fasta')]
d = defaultdict(list)
for length in lengths:
    d[length/10].append(length%10)

for nr in range(min(d.keys()), max(d.keys())+1):
    print "{0:>3} | {1}".format(nr, ' '.join(map(str, sorted(d[nr]))) if nr in d else '')

For this fictitious input file it produces this histogram:

  5 | 3 4 8
  6 | 0 0 6 7 9
  7 | 0
  8 | 6
  9 | 0
 10 | 9
 11 | 2 5 6
 12 | 0 0 0 0 5 8
ADD COMMENT
1
Entering edit mode
11.9 years ago

There is something in the kent source tree called textHistogram. It takes a file with one number per line, which you can get from fasta using awk...

textHistogram <(awk < test.fa '$0 !~ /^>/ {print length($0)}')
ADD COMMENT

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6