Statistics for bins
1
0
Entering edit mode
11 weeks ago
shevch2009 ▴ 20

Hello everyone,

I’m working with several dozen metagenomic bins and need to analyze contig/scaffold length distributions within each bin. Specifically, I’d like to calculate:

Individual lengths of all scaffolds/contigs per bin,

Summary statistics (average, min, max lengths) of all scaffolds/contigs per bin,

Data in table format for plotting (e.g., histograms of Distribution of contig lengths in bins).

I’ve tried:

statswrapper.sh (BBMap): Provides metrics like scaf_max/ctg_max (longest scaffold/contig), but lacks per-sequence lengths or distributions.

QUAST: Excellent for assemblies, but seems cumbersome for many bins and doesn’t output per-bin length tables easily.

I would appreciate any help! Best, Alla

data shotgun • 610 views
ADD COMMENT
0
Entering edit mode
11 weeks ago
GenoMax 153k

Try seqkit (specifically stats) : https://bioinf.shenwei.me/seqkit/usage/#stats

For per-sequence lengths or distributions

seqkit fx2tab --length --name --header-line  foo.fasta

readlength.sh from BBMap suite can also create a read distribution histogram.

ADD COMMENT
0
Entering edit mode

Thanks GenoMax, I will try those tools.

ADD REPLY
0
Entering edit mode

The next code didn't provides what I need

seqkit fx2tab --length --name --header-line  foo.fasta

I got strange table

k127_5845587    2388
k127_1599072    5710
k127_4786624    15259
k127_2662662    8275
k127_537890     25137
k127_538940     24838
k127_4260329    30297
k127_6916585    23944

Which is no use at all for me)

And readlength.sh from BBMap workes only with raw reads and one file at a time if I am not mistaken...

ADD REPLY
1
Entering edit mode
k127_5845587    2388
k127_1599072    5710
k127_4786624    15259

That is the name of the fasta header followed by length of the sequence in the multi fasta used as input. How is that strange?

readlength.sh from BBMap workes only with raw reads and one file at a time if I am not mistaken...

Run it on all contigs via a for loop, one file at a time.

ADD REPLY
0
Entering edit mode

It seems I have found a solution, but without an individual lengths of all scaffolds/contigs per bin....

seqkit stats -a -T /reassembled_bins/*.fa > bin_stats.tsv

This gave me: min_len, avg_len, max_len of contigs, I can work with that :)

ADD REPLY

Login before adding your answer.

Traffic: 4161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6