Problem to run checkm.
3.8 years ago
pablo

Hello, I have several files which contain proteic fasta sequences. Each file correspond to a cluster of genes.

UniRef90_100.fasta
UniRef90_101.fasta
UniRef90_102.fasta
UniRef90_103.fasta
UniRef90_104.fasta
UniRef90_105.fasta
UniRef90_10.fasta
UniRef90_11.fasta
UniRef90_12.fasta
UniRef90_13.fasta


I want to determine the contamination of each cluster. For that, I want to run checkm . I used checkm lineage_wf bins checkm but it does not work. I get this error message : checkm: error: unrecognized arguments:followed by all my bin files.

My question is : do these files are bins? Each file is compound as the following structure :

>UniRef90_A0A1B2YXP8 - Cluster: Uncharacterized protein
EGVVTFNVDYGDGSDISGSINPGNSTEHFYSEGTYEATIIGTALDGSTAQATVTVVVSFI
YNVTITALSGGAATTQYSEEITITDPPVFDGFSTFEDFEGEVPGNFSFGGVGNVQVVANP
DNSGINTSTSVMQCTKDQGAEVWGGMGFAVNGHINFNGNNVLRLKSYAPEVGKVVKVKLE
TSAGNVAGLTYEFDMVTTVANQWEILTYDFSGAPDLDYITAIVFYDFGNQNAGVYHFDDV
EVGIGEYIQGIENFEGDVPESFTFGGVGGVEVIPNPDPSGENITGNVLQFVKDEGAEVWG
GMGFAVDVIDFNGASQIHLKSYAPEAGKVVKVKLETSAGNVAGLTHEVDVTTTVANEWET
LIYDFTGAPDLEYVSFIVFYDFGNTVGATYRVDEIQLID
>UniRef90_A0A1B2YXU0 - Cluster: Uncharacterized protein
MKYKILFLSILILFSCNHDNEKLDAIIKEYQNHEGYNYEDYPLGNFSEEYFKAEKEFAES
LLLKLDDIDITKLDENDNISYELLSFVLNDIIAYYDFERFLNPLLSDSGFHSSLVYNVRP
MYNYEQVKNYLNKLNAIPQYVDQYLPLLRKGLEKGVSQPLVIFKGYESTYNDHITKDFES
NYFYSPFNKLPNDISEIQRDSIFVAAKNAIEKSVVPQFIRIKDFFEKEYYKKTRTTIGVS
YVGTSKNSTDPGYYWVNTYDLKSRTLYTIPALTVHEAVPGHHLQSALNNELGDSIPRFRR
AIDYMSKNTALSLHEVNTEIDRYISWPGQALSYKIGELKIRELRNKAKDQLNDKFDIREF
HEKILEYGTVTLPTLERRINNYIEKKNE

checkm contamination fasta • 2.4k views
It would help to provide a link to the package this program belongs to. Have you checked the in-line help to see if that offers any assistance on what the minimal usage needs to look like?

I checked the documentation about this package and I think I did good ... My files are in the good format and the command line I used looks good.

3.8 years ago
Asaf

If the input is protein you should use -g

Wait, I suspect you wrote bins/* is that right? Can you add the full command line?

The -g option worked. Thanks.

Bin Id                  Marker lineage           # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity
UniRef90_1           k__Bacteria (UID203)           5449        104            58         4     17   19   13   7    44      95.34           283.45              6.85
UniRef90_19          k__Bacteria (UID203)           5449        104            58         7     20   20   12   12   33      91.22           255.84             11.38
UniRef90_14         k__Bacteria (UID2570)           433         267           178         61   128   56   13   8    1       79.42           42.78              17.65
UniRef90_12          k__Bacteria (UID203)           5449         99            53         24    24   16   7    4    24      69.27           141.02             21.72
UniRef90_24          k__Bacteria (UID203)           5449        104            58         23    20   29   10   8    14      67.63           103.59             21.25
UniRef90_22          k__Bacteria (UID203)           5449         99            53         35    24   9    6    2    23      62.49


I got this kind of output, it looks bad isn't it?

Pretty bad, yeah. Each bin is a few bacteria.

I only shew you the head of the output. I got some other lineages :

UniRef90_65      p__Proteobacteria (UID3880)        1495        261           164        188    60   13   0    0    0       27.85            5.12              38.46
UniRef90_71        p__Euryarchaeota (UID3)          148         188           125        132    46   10   0    0    0       26.89            4.25              20.00
UniRef90_73     f__Rhodobacteraceae (UID3356)        67         615           329        451   164   0    0    0    0       26.65            0.00               0.00
UniRef90_7         p__Euryarchaeota (UID3)          148         188           125        133    37   15   2    1    0       24.44            9.74              14.81
UniRef90_16          k__Bacteria (UID203)           5449        104            58         69    12   6    5    0    12      24.39           29.06              35.29
UniRef90_67      p__Proteobacteria (UID3880)        1495        261           164        195    61   5    0    0    0       24.39

Quarters of genomes. Usually people use completeness > 70-80% and contamination < 20%. You might get good bins with 0% completeness so watch for those too.

You meant "0% contamination" no?

No, 0% completeness, checkm can't find the proteins it's looking for but other than that the assembly looks good in term of size and N50.

Ok I got it. And is there is a file where the output is stored? I can't find it.

hello, Do you know how can interpret # genomes and # markers and # marker sets columns ?

I think you should open a new question if you're still struggling, with some more background

hi! can you post the command that you used to run on bash terminal? :)

3.8 years ago
vin.darb

By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna.

Example Usage

Assume you have putative genomes in the directory /home/donovan/bins with fa as the file extension and want to store the CheckM results in /home/donovan/checkm. To processes these genomes with 8 threads, simply run:

checkm lineage_wf -t 8 -x fa /home/donovan/bins /home/Donovan/checkm

Did you try to specify ' -x fasta ' ?

Yes I tried but I always get the same results.. I don't understand why

I only need to add the -g option to work on proteic sequences.