Hello! I have run metabat to separate my metagenomic contigs into bins and obtained a number of files that contain a list of all contig names that belong to a particular bin, like this:
Example bin_1:
k105_10322
k105_20691
k105_133304
k105_31104
...
Now, I would like to assess the quality of my binning using CheckM. From all examples I have seen previously, CheckM wants to have fasta-files as input.
Is it possible to provide a list of the contigs in a bin and a fasta-file with the sequences from all bins as input to CheckM? The file with all the sequences looks like this:
>k105_92090 flag=1 multi=2.0000 len=532
TAACTT...
>k105_102322 flag=1 multi=2.0000 len=528
GGAAGA...
>k105_92091 flag=1 multi=2.0000 len=409
AAAAAA...
>k105_92092 flag=1 multi=2.0000 len=332
TGAATC...
>k105_102323 flag=1 multi=1.0000 len=455
GAATAC...
...
The other option I see is to use grep/awk and extract the contigs from the file that contains all of them, but that would be a bit of a hassle...
Thank you for your help!
Just to be clear, you already did metabat but you don't have the fasta file of each bin e.g.
bin1.fa, bin2.fa, bin3.fa
?Yes! I haven't figured out how to submit a list of contigs to CheckM, but the grep approach wasn't as hard as I anticipated. Here is an example code that should work for others facing the same issue:
I don't know if it's the most efficient way of doing it, but it creating the fasta bin files for ca 120Mbp of data and 160,000 contigs took just a few seconds on 4 cores.
The most efficient way to do this is to launch again metabat without the
-l
option. By omitting-l
you should get a fasta file of each bin