Question

Text Histogram From Sequence Lengths Of Fasta File?

2

Entering edit mode

11.9 years ago

14134125465346445 ★ 3.6k

I swear I have seen this somewhere but I can't find it again. Is there a script somewhere to produce text Stem-and-Leaf Histograms from sequence lengths of fasta file?

fasta • 6.1k views

ADD COMMENT • link updated 7.6 years ago by Biostar 20 • written 11.9 years ago by 14134125465346445 ★ 3.6k

score 5 · Answer 1 · 2012-05-30

You can do this in R with a combination of the seqinR and the aplpack packages:

library(seqinr)
data=read.fasta("file.fasta")
l=getLength(x)

library(aplpack)
stem.leaf(l)

There are a number of options to suit your needs:

stem.leaf(l, unit=100)

Just have a look here for more output options and details:

?stem.leaf

This functions extends far more than the simple stem() function in the graphics package.

score 2 · Answer 2 · 2012-05-30

My initial idea was to use Biopython for the parsing of the fasta file and matplotlib for the drawing of the stem-leaf plot. I couldn't find a function that draws a text histogram in matplotlibto my amazement, so here is my take:

from Bio import SeqIO
from collections import defaultdict

input_file = r'path\to\file.fasta'
lengths = [len(record.seq) for record in SeqIO.parse(open(input_file), 'fasta')]
d = defaultdict(list)
for length in lengths:
    d[length/10].append(length%10)

for nr in range(min(d.keys()), max(d.keys())+1):
    print "{0:>3} | {1}".format(nr, ' '.join(map(str, sorted(d[nr]))) if nr in d else '')

For this fictitious input file it produces this histogram:

  5 | 3 4 8
  6 | 0 0 6 7 9
  7 | 0
  8 | 6
  9 | 0
 10 | 9
 11 | 2 5 6
 12 | 0 0 0 0 5 8

score 1 · Answer 3 · 2012-05-31

1

Entering edit mode

11.9 years ago

Madelaine Gogol 5.3k

There is something in the kent source tree called textHistogram. It takes a file with one number per line, which you can get from fasta using awk...

textHistogram <(awk < test.fa '$0 !~ /^>/ {print length($0)}')

ADD COMMENT • link 11.9 years ago by Madelaine Gogol 5.3k