Hello everyone,
I’m working with several dozen metagenomic bins and need to analyze contig/scaffold length distributions within each bin. Specifically, I’d like to calculate:
Individual lengths of all scaffolds/contigs per bin,
Summary statistics (average, min, max lengths) of all scaffolds/contigs per bin,
Data in table format for plotting (e.g., histograms of Distribution of contig lengths in bins).
I’ve tried:
statswrapper.sh (BBMap): Provides metrics like scaf_max/ctg_max (longest scaffold/contig), but lacks per-sequence lengths or distributions.
QUAST: Excellent for assemblies, but seems cumbersome for many bins and doesn’t output per-bin length tables easily.
I would appreciate any help! Best, Alla
Thanks GenoMax, I will try those tools.
The next code didn't provides what I need
I got strange table
Which is no use at all for me)
And readlength.sh from BBMap workes only with raw reads and one file at a time if I am not mistaken...
That is the name of the fasta header followed by length of the sequence in the multi fasta used as input. How is that strange?
Run it on all contigs via a
for
loop, one file at a time.It seems I have found a solution, but without an individual lengths of all scaffolds/contigs per bin....
seqkit stats -a -T /reassembled_bins/*.fa > bin_stats.tsv
This gave me: min_len, avg_len, max_len of contigs, I can work with that :)