Question: Problem to run checkm.
0
gravatar for pablo
15 months ago by
pablo140
pablo140 wrote:

Hello, I have several files which contain proteic fasta sequences. Each file correspond to a cluster of genes.

UniRef90_100.fasta
UniRef90_101.fasta
UniRef90_102.fasta
UniRef90_103.fasta
UniRef90_104.fasta
UniRef90_105.fasta
UniRef90_10.fasta
UniRef90_11.fasta
UniRef90_12.fasta
UniRef90_13.fasta

I want to determine the contamination of each cluster. For that, I want to run checkm . I used checkm lineage_wf bins checkm but it does not work. I get this error message : checkm: error: unrecognized arguments:followed by all my bin files.

My question is : do these files are bins? Each file is compound as the following structure :

>UniRef90_A0A1B2YXP8 - Cluster: Uncharacterized protein
MRILRNFLGLFLLTAFIFSCVDENESNADFVDTISEPTNISALVSISQDNTGLVTIIPTG
EGVVTFNVDYGDGSDISGSINPGNSTEHFYSEGTYEATIIGTALDGSTAQATVTVVVSFI
APENLVVDILTSSGSYNILVSASADYATSFEVLFGDEAGGDATPMQIGEQLSHSYELAGT
YNVTITALSGGAATTQYSEEITITDPPVFDGFSTFEDFEGEVPGNFSFGGVGNVQVVANP
DNSGINTSTSVMQCTKDQGAEVWGGMGFAVNGHINFNGNNVLRLKSYAPEVGKVVKVKLE
TSAGNVAGLTYEFDMVTTVANQWEILTYDFSGAPDLDYITAIVFYDFGNQNAGVYHFDDV
EVGIGEYIQGIENFEGDVPESFTFGGVGGVEVIPNPDPSGENITGNVLQFVKDEGAEVWG
GMGFAVDVIDFNGASQIHLKSYAPEAGKVVKVKLETSAGNVAGLTHEVDVTTTVANEWET
LIYDFTGAPDLEYVSFIVFYDFGNTVGATYRVDEIQLID
>UniRef90_A0A1B2YXU0 - Cluster: Uncharacterized protein
MKYKILFLSILILFSCNHDNEKLDAIIKEYQNHEGYNYEDYPLGNFSEEYFKAEKEFAES
LLLKLDDIDITKLDENDNISYELLSFVLNDIIAYYDFERFLNPLLSDSGFHSSLVYNVRP
MYNYEQVKNYLNKLNAIPQYVDQYLPLLRKGLEKGVSQPLVIFKGYESTYNDHITKDFES
NYFYSPFNKLPNDISEIQRDSIFVAAKNAIEKSVVPQFIRIKDFFEKEYYKKTRTTIGVS
QTPNGSEFYQNRINYYTTSESYTADEIHQIGLKEVARIKKEMIKIIDELKFKGSFEEFFK
FLRTDEQFYAKTPKELLMYARDISKRADEQLPRFFKTLPRKPYGVAPVPDAIAPKYTGGR
YVGTSKNSTDPGYYWVNTYDLKSRTLYTIPALTVHEAVPGHHLQSALNNELGDSIPRFRR
NLYLSAYGEGWGLYTEFLADEMGIYTTPYEKFGKFTYEMWRACRLVVDTGLHTKGWSKEK
AIDYMSKNTALSLHEVNTEIDRYISWPGQALSYKIGELKIRELRNKAKDQLNDKFDIREF
HEKILEYGTVTLPTLERRINNYIEKKNE
checkm contamination fasta • 759 views
ADD COMMENTlink modified 15 months ago by Asaf8.4k • written 15 months ago by pablo140

It would help to provide a link to the package this program belongs to. Have you checked the in-line help to see if that offers any assistance on what the minimal usage needs to look like?

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax89k

I checked the documentation about this package and I think I did good ... My files are in the good format and the command line I used looks good.

ADD REPLYlink written 15 months ago by pablo140
2
gravatar for Asaf
15 months ago by
Asaf8.4k
Israel
Asaf8.4k wrote:

If the input is protein you should use -g

Wait, I suspect you wrote bins/* is that right? Can you add the full command line?

ADD COMMENTlink modified 15 months ago • written 15 months ago by Asaf8.4k

The -g option worked. Thanks.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                  Marker lineage           # genomes   # markers   # marker sets    0     1    2    3    4    5+   Completeness   Contamination   Strain heterogeneity  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  UniRef90_1           k__Bacteria (UID203)           5449        104            58         4     17   19   13   7    44      95.34           283.45              6.85          
  UniRef90_19          k__Bacteria (UID203)           5449        104            58         7     20   20   12   12   33      91.22           255.84             11.38          
  UniRef90_14         k__Bacteria (UID2570)           433         267           178         61   128   56   13   8    1       79.42           42.78              17.65          
  UniRef90_12          k__Bacteria (UID203)           5449         99            53         24    24   16   7    4    24      69.27           141.02             21.72          
  UniRef90_24          k__Bacteria (UID203)           5449        104            58         23    20   29   10   8    14      67.63           103.59             21.25          
  UniRef90_22          k__Bacteria (UID203)           5449         99            53         35    24   9    6    2    23      62.49

I got this kind of output, it looks bad isn't it?

ADD REPLYlink written 15 months ago by pablo140

Pretty bad, yeah. Each bin is a few bacteria.

ADD REPLYlink written 15 months ago by Asaf8.4k

I only shew you the head of the output. I got some other lineages :

UniRef90_65      p__Proteobacteria (UID3880)        1495        261           164        188    60   13   0    0    0       27.85            5.12              38.46          
  UniRef90_71        p__Euryarchaeota (UID3)          148         188           125        132    46   10   0    0    0       26.89            4.25              20.00          
  UniRef90_73     f__Rhodobacteraceae (UID3356)        67         615           329        451   164   0    0    0    0       26.65            0.00               0.00          
  UniRef90_7         p__Euryarchaeota (UID3)          148         188           125        133    37   15   2    1    0       24.44            9.74              14.81          
  UniRef90_16          k__Bacteria (UID203)           5449        104            58         69    12   6    5    0    12      24.39           29.06              35.29          
  UniRef90_67      p__Proteobacteria (UID3880)        1495        261           164        195    61   5    0    0    0       24.39
ADD REPLYlink written 15 months ago by pablo140

Quarters of genomes. Usually people use completeness > 70-80% and contamination < 20%. You might get good bins with 0% completeness so watch for those too.

ADD REPLYlink written 15 months ago by Asaf8.4k

You meant "0% contamination" no?

ADD REPLYlink written 15 months ago by pablo140

No, 0% completeness, checkm can't find the proteins it's looking for but other than that the assembly looks good in term of size and N50.

ADD REPLYlink written 15 months ago by Asaf8.4k

Ok I got it. And is there is a file where the output is stored? I can't find it.

ADD REPLYlink written 15 months ago by pablo140

hello, Do you know how can interpret # genomes and # markers and # marker sets columns ?

ADD REPLYlink modified 11 months ago • written 11 months ago by vm.higareda20

I think you should open a new question if you're still struggling, with some more background

ADD REPLYlink written 11 months ago by Asaf8.4k

hi! can you post the command that you used to run on bash terminal? :)

ADD REPLYlink written 8 months ago by biohacker_tobe40
1
gravatar for darbinator
15 months ago by
darbinator220
darbinator220 wrote:

According ti the READme:

By default, CheckM assumes genomes consist of contigs/scaffolds in nucleotide space and that the files to process end with the extension fna.

Example Usage

Assume you have putative genomes in the directory /home/donovan/bins with fa as the file extension and want to store the CheckM results in /home/donovan/checkm. To processes these genomes with 8 threads, simply run:

checkm lineage_wf -t 8 -x fa /home/donovan/bins /home/Donovan/checkm

Did you try to specify ' -x fasta ' ?

ADD COMMENTlink written 15 months ago by darbinator220

Yes I tried but I always get the same results.. I don't understand why

ADD REPLYlink modified 15 months ago • written 15 months ago by pablo140

I only need to add the -g option to work on proteic sequences.

ADD REPLYlink written 15 months ago by pablo140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 802 users visited in the last hour