Question: Problem to run checkm.
0
gravatar for vincentpailler
5 months ago by
vincentpailler100 wrote:

Hello, I have several files which contain proteic fasta sequences. Each file correspond to a cluster of genes.

UniRef90_100.fasta
UniRef90_101.fasta
UniRef90_102.fasta
UniRef90_103.fasta
UniRef90_104.fasta
UniRef90_105.fasta
UniRef90_10.fasta
UniRef90_11.fasta
UniRef90_12.fasta
UniRef90_13.fasta

I want to determine the contamination of each cluster. For that, I want to run checkm . I used checkm lineage_wf bins checkm but it does not work. I get this error message : checkm: error: unrecognized arguments:followed by all my bin files.

My question is : do these files are bins? Each file is compound as the following structure :

>UniRef90_A0A1B2YXP8 - Cluster: Uncharacterized protein
MRILRNFLGLFLLTAFIFSCVDENESNADFVDTISEPTNISALVSISQDNTGLVTIIPTG
EGVVTFNVDYGDGSDISGSINPGNSTEHFYSEGTYEATIIGTALDGSTAQATVTVVVSFI
APENLVVDILTSSGSYNILVSASADYATSFEVLFGDEAGGDATPMQIGEQLSHSYELAGT
YNVTITALSGGAATTQYSEEITITDPPVFDGFSTFEDFEGEVPGNFSFGGVGNVQVVANP
DNSGINTSTSVMQCTKDQGAEVWGGMGFAVNGHINFNGNNVLRLKSYAPEVGKVVKVKLE
TSAGNVAGLTYEFDMVTTVANQWEILTYDFSGAPDLDYITAIVFYDFGNQNAGVYHFDDV
EVGIGEYIQGIENFEGDVPESFTFGGVGGVEVIPNPDPSGENITGNVLQFVKDEGAEVWG
GMGFAVDVIDFNGASQIHLKSYAPEAGKVVKVKLETSAGNVAGLTHEVDVTTTVANEWET
LIYDFTGAPDLEYVSFIVFYDFGNTVGATYRVDEIQLID
>UniRef90_A0A1B2YXU0 - Cluster: Uncharacterized protein
MKYKILFLSILILFSCNHDNEKLDAIIKEYQNHEGYNYEDYPLGNFSEEYFKAEKEFAES
LLLKLDDIDITKLDENDNISYELLSFVLNDIIAYYDFERFLNPLLSDSGFHSSLVYNVRP
MYNYEQVKNYLNKLNAIPQYVDQYLPLLRKGLEKGVSQPLVIFKGYESTYNDHITKDFES
NYFYSPFNKLPNDISEIQRDSIFVAAKNAIEKSVVPQFIRIKDFFEKEYYKKTRTTIGVS
QTPNGSEFYQNRINYYTTSESYTADEIHQIGLKEVARIKKEMIKIIDELKFKGSFEEFFK
FLRTDEQFYAKTPKELLMYARDISKRADEQLPRFFKTLPRKPYGVAPVPDAIAPKYTGGR
YVGTSKNSTDPGYYWVNTYDLKSRTLYTIPALTVHEAVPGHHLQSALNNELGDSIPRFRR
NLYLSAYGEGWGLYTEFLADEMGIYTTPYEKFGKFTYEMWRACRLVVDTGLHTKGWSKEK
AIDYMSKNTALSLHEVNTEIDRYISWPGQALSYKIGELKIRELRNKAKDQLNDKFDIREF
HEKILEYGTVTLPTLERRINNYIEKKNE
checkm contamination fasta • 257 views
ADD COMMENTlink modified 5 months ago by Asaf6.5k • written 5 months ago by vincentpailler100

It would help to provide a link to the package this program belongs to. Have you checked the in-line help to see if that offers any assistance on what the minimal usage needs to look like?

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax74k

I checked the documentation about this package and I think I did good ... My files are in the good format and the command line I used looks good.

ADD REPLYlink written 5 months ago by vincentpailler100
2
gravatar for Asaf
5 months ago by
Asaf6.5k
Israel
Asaf6.5k wrote:

If the input is protein you should use -g

Wait, I suspect you wrote bins/* is that right? Can you add the full command line?

ADD COMMENTlink modified 5 months ago • written 5 months ago by Asaf6.5k

The -g option worked. Thanks.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                  Marker lineage           # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  UniRef90_1           k__Bacteria (UID203)           5449        104            58         4     17   19   13   7    44      95.34           283.45              6.85          
  UniRef90_19          k__Bacteria (UID203)           5449        104            58         7     20   20   12   12   33      91.22           255.84             11.38          
  UniRef90_14         k__Bacteria (UID2570)           433         267           178         61   128   56   13   8    1       79.42           42.78              17.65          
  UniRef90_12          k__Bacteria (UID203)           5449         99            53         24    24   16   7    4    24      69.27           141.02             21.72          
  UniRef90_24          k__Bacteria (UID203)           5449        104            58         23    20   29   10   8    14      67.63           103.59             21.25          
  UniRef90_22          k__Bacteria (UID203)           5449         99            53         35    24   9    6    2    23      62.49

I got this kind of output, it looks bad isn't it?

ADD REPLYlink written 5 months ago by vincentpailler100

Pretty bad, yeah. Each bin is a few bacteria.

ADD REPLYlink written 5 months ago by Asaf6.5k

I only shew you the head of the output. I got some other lineages :

UniRef90_65      p__Proteobacteria (UID3880)        1495        261           164        188    60   13   0    0    0       27.85            5.12              38.46          
  UniRef90_71        p__Euryarchaeota (UID3)          148         188           125        132    46   10   0    0    0       26.89            4.25              20.00          
  UniRef90_73     f__Rhodobacteraceae (UID3356)        67         615           329        451   164   0    0    0    0       26.65            0.00               0.00          
  UniRef90_7         p__Euryarchaeota (UID3)          148         188           125        133    37   15   2    1    0       24.44            9.74              14.81          
  UniRef90_16          k__Bacteria (UID203)           5449        104            58         69    12   6    5    0    12      24.39           29.06              35.29          
  UniRef90_67      p__Proteobacteria (UID3880)        1495        261           164        195    61   5    0    0    0       24.39
ADD REPLYlink written 5 months ago by vincentpailler100

Quarters of genomes. Usually people use completeness > 70-80% and contamination < 20%. You might get good bins with 0% completeness so watch for those too.

ADD REPLYlink written 5 months ago by Asaf6.5k

You meant "0% contamination" no?

ADD REPLYlink written 5 months ago by vincentpailler100

No, 0% completeness, checkm can't find the proteins it's looking for but other than that the assembly looks good in term of size and N50.

ADD REPLYlink written 5 months ago by Asaf6.5k

Ok I got it. And is there is a file where the output is stored? I can't find it.

ADD REPLYlink written 5 months ago by vincentpailler100

hello, Do you know how can interpret # genomes and # markers and # marker sets columns ?

ADD REPLYlink modified 29 days ago • written 29 days ago by vm.higareda20

I think you should open a new question if you're still struggling, with some more background

ADD REPLYlink written 29 days ago by Asaf6.5k
1
gravatar for darbinator
5 months ago by
darbinator190
darbinator190 wrote:

According ti the READme:

By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna.

Example Usage

Assume you have putative genomes in the directory /home/donovan/bins with fa as the file extension and want to store the CheckM results in /home/donovan/checkm. To processes these genomes with 8 threads, simply run:

checkm lineage_wf -t 8 -x fa /home/donovan/bins /home/Donovan/checkm

Did you try to specify ' -x fasta ' ?

ADD COMMENTlink written 5 months ago by darbinator190

Yes I tried but I always get the same results.. I don't understand why

ADD REPLYlink modified 5 months ago • written 5 months ago by vincentpailler100

I only need to add the -g option to work on proteic sequences.

ADD REPLYlink written 5 months ago by vincentpailler100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1415 users visited in the last hour