Question: Problems with installation of Prodege decontamination software
0
gravatar for tans0307
2.6 years ago by
tans03070
tans03070 wrote:

Hello people,

I have tried downloading the stand-alone version of Prodege but I am having some issues.

Untarring database files
nt_euks.00.nhr
nt_euks.00.nin
nt_euks.00.nsq
nt_euks.01.nhr
nt_euks.01.nin
nt_euks.01.nsq
nt_euks.02.nhr
nt_euks.02.nin
nt_euks.02.nsq
nt_euks.03.nhr
nt_euks.03.nin
nt_euks.03.nsq
nt_euks.04.nhr
nt_euks.04.nin
nt_euks.04.nsq
nt_euks.05.nhr
nt_euks.05.nin
nt_euks.05.nsq
nt_euks.06.nhr
nt_euks.06.nin
nt_euks.06.nsq
nt_euks.07.nhr
nt_euks.07.nin
nt_euks.07.nsq
nt_euks.08.nhr
nt_euks.08.nin
nt_euks.08.nsq
nt_euks.09.nhr
nt_euks.09.nin
nt_euks.09.nsq
nt_euks.10.nhr
nt_euks.10.nin
nt_euks.10.nsq
nt_euks.11.nhr
nt_euks.11.nin
nt_euks.11.nsq
nt_euks.12.nhr
nt_euks.12.nin
nt_euks.12.nsq
nt_euks.13.nhr
nt_euks.13.nin
nt_euks.13.nsq
nt_euks.nal
imgdb.00.nhr
imgdb.00.nin
imgdb.00.nsq
imgdb.01.nhr
imgdb.01.nin
imgdb.01.nsq
imgdb.02.nhr
imgdb.02.nin
imgdb.02.nsq
imgdb.03.nhr
imgdb.03.nin
imgdb.03.nsq
imgdb.04.nhr
imgdb.04.nin
imgdb.04.nsq
imgdb.05.nhr
imgdb.05.nin
imgdb.05.nsq
imgdb.06.nhr
imgdb.06.nin
imgdb.06.nsq
imgdb.07.nhr
imgdb.07.nin
imgdb.07.nsq
imgdb.08.nhr
imgdb.08.nin
imgdb.08.nsq
imgdb.09.nhr
imgdb.09.nin
imgdb.09.nsq
imgdb.10.nhr
imgdb.10.nin
imgdb.10.nsq
imgdb.11.nhr
imgdb.11.nin
imgdb.11.nsq
imgdb.12.nhr
imgdb.12.nin
imgdb.12.nsq
imgdb.13.nhr
imgdb.13.nin
imgdb.13.nsq
imgdb.14.nhr
imgdb.14.nin
imgdb.14.nsq
imgdb.15.nhr
imgdb.15.nin
imgdb.15.nsq
imgdb.16.nhr
imgdb.16.nin
imgdb.16.nsq
imgdb.17.nhr
imgdb.17.nin
imgdb.17.nsq
imgdb.18.nhr
imgdb.18.nin
imgdb.18.nsq
imgdb.19.nhr
imgdb.19.nin
imgdb.19.nsq
imgdb.20.nhr
imgdb.20.nin
imgdb.20.nsq
imgdb.21.nhr
imgdb.21.nin
imgdb.21.nsq
imgdb.22.nhr
imgdb.22.nin
imgdb.22.nsq
imgdb.23.nhr
imgdb.23.nin
imgdb.23.nsq
imgdb.24.nhr
imgdb.24.nin
imgdb.24.nsq
imgdb.25.nhr
imgdb.25.nin
imgdb.25.nsq
imgdb.26.nhr
imgdb.26.nin
imgdb.26.nsq
imgdb.27.nhr
imgdb.27.nin
imgdb.27.nsq
imgdb.28.nhr
imgdb.28.nin
imgdb.28.nsq
imgdb.29.nhr
imgdb.29.nin
imgdb.29.nsq
imgdb.30.nhr
imgdb.30.nin
imgdb.30.nsq
imgdb.31.nhr
imgdb.31.nin
imgdb.31.nsq
imgdb.nal
Formatting blast database


Building a new DB, current time: 01/13/2017 01:40:54
New DB name:   nt_euks
New DB title:  nt_euks.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File nt_euks.fna does not exist
rm: cannot remove 'nt_euks.fna': No such file or directory


Building a new DB, current time: 01/13/2017 01:40:54
New DB name:   imgdb
New DB title:  imgdb.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: File imgdb.fna does not exist
rm: cannot remove 'imgdb.fna': No such file or directory
prodege_install.sh: 100: prodege_install.sh: [[: not found
prodege_install.sh: 100: prodege_install.sh: -e: not found
R packages not installed.  ProDeGe installation unsuccessful.

I have checked that Blast+ and R have both been installed and are added to my ~/.bashrc.

Will appreciate any advice I could get on this.

Thank you!

ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 2.6 years ago by tans03070

Looking at the files you appear to have downloaded pre-created blast index files. You do not need to create the blast indexes again (in case you are trying to re-run that step).

added to my ~/.bashrc

Does which blastn or which R return the correct locations for these programs?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax70k

@genomax2, thanks for your reply.

Just to clarify things:

1.) I do not have to run the sh prodege_install.sh anymore?

2.) How do I define the correct locations?

Which blastn: /home/tanshiming/tools/ncbi-blast-2.2.28+/bin/blastn

Which R: /usr/bin/R

Many thanks for your patience in this. :)

ADD REPLYlink written 2.6 years ago by tans03070

blast and R indeed appear to be available in your $PATH. So that part is fine.

Error is about an R package not installed? Do you know which R package or is ProDeGe an R-package (sorry I am not familiar with this program).

So you are only running the install script (and not downloading these blast indexes manually) which is generating that error?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax70k

@genomax2, these are the requirements for the installation of Prodege:

  • Blast+ 2.2.28
  • Perl 5.16.0 (with modules Bio::SeqIO and Bio::Perl)
  • Prodigal 2.50
  • R 3.0.1 The BLASTN_EXE envirnonmental variable is used to run blastn, R_EXE to run R, and PRODIGAL_EXE to run prodigal. Else, please use 'module load' to set up your environment.

I was running the install script, which led to the downloading of the databases and an error occurred at the end!

ADD REPLYlink written 2.6 years ago by tans03070

Rather than telling you how to run a standalone version of ProDeGe, which I can't do, can I ask you to explain exactly what it is you're trying to do? We do use ProDeGe in one of our pipelines, but it is no longer supported. However, in some cases, there may be alternatives.

ADD REPLYlink written 2.6 years ago by Brian Bushnell16k

@Brian, Thanks for your reply.

I have generated contigs from an MDA-ed sample that was enriched via cell sorting. However, in my negative controls (no template), I have noticed the presence of artefact sequences. As such, I will like to remove these artefact sequences through binning. Prodege tool seems to fit the description of what I am trying to achieve.

I am open to other suggestions that you might have. :)

Thank you.

ADD REPLYlink written 2.6 years ago by tans03070
1) What sequencing platform are you using?
2) What kind of organism is it?
3) Was this library multiplexed with others?
4) Was it prepared on the same plate with others?
5) What kind of artifacts have you found?  E.g., genomic or synthetic, and if genomic, have you BLASTed them?  Note that for some purposes of classification ProDeGe has a 10kbp length cutoff.
6) I guess, generally, any other details you can provide of the experiment overall are helpful, and - 
7) The conditions of gathering, sample-prep, barcoding, sequencing, and demultiplexing that might lead to any kind of cross-contamination are also helpful.
8) Lastly - how much time can be spent on this?  By which I mean, if you have one sample you want to decontaminate, that's very different from designing an automated pipeline to handle dozens of samples per day automatically, which was the goal of ProDeGe.

Assume that any two things that are ever in the same room (not necessarily at the same time) will contaminate each other, and anything within 1m of a sample will contaminate it; it's really just a matter of degree (I would guess, the degree is a quadratic function of distance and linear to time). Also, assume all of your reagents are contaminated (they are). ProDeGe is specifically for removing large assembled contigs that appear to be a different taxonomy than the organism of interest, which is just a small subset of contamination outcomes. But your artifacts have a huge number of possible sources, and the best approach to removing them depends on the source and degree. So the better idea you have about the possible sources of contamination, the easier decontamination is. If you can BLAST your artifact reads and find out exactly what it is, decontamination becomes trivial and you should do it manually rather than using ProDeGe, unless you need to automate the process.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Brian Bushnell16k

Dear @Brian, sorry for the tardy response.

1.) The sequencing was performed using an Illumina Hi Seq 2500

2.) The organism is a bacteria that is unclassified at the genus level

3.) The library was multiplexed with others

4.) I am not sure about this because the sequencing was performed by someone else, but I can find out.

5.) I did a blast and all the contigs are synthetic sequences (I guess I need a software that could do this decontamination).

Other information:

I did a FISH-FACS sorting of 1000 cells from an environmental sample. 16S rRNA of the Hi-Seq reads show a purity of >99%. However, due to MDA, these artefact sequences are generated when I performed a de novo assembly and I will like to remove them. However, due to the novelty of the target cells, I am not sure if taxonomy-homology tools are the best way to go. I will appreciate any advice I could get from here.

ADD REPLYlink written 2.6 years ago by tans03070

I did a blast and all the contigs are synthetic sequences (I guess I need a software that could do this decontamination).

Unfortunately, neither ProDeGe nor any other decontamination tool will help you in this case. It sounds like a library failure. ProDeGe will only separate contigs, so if you have no contigs of your organism, it won't give you any output. What are the synthetic things matching the contigs?

Since you have the synthetic sequences, though, you could use BBDuk to remove all the corresponding reads and try to assemble what's left, if anything. For example:

bbduk.sh in=reads.fq out=clean.fq ref=synthetic.fa k=31

16S rRNA of the Hi-Seq reads show a purity of >99%.

Not sure what you mean by this. Normally you evaluate rRNA in single-cell MDA libraries using Sanger. Can you elaborate?

Generally, if you are multiplexing MDA-amplified single-cells, you will get crosstalk due to barcode miscalls, barcode contamination/impurity, chimerism, and so forth, that will assemble into contaminant contigs if the crosstalk level is sufficient (which we find that it is, using standard Illumina library-prep approaches). You can remove this low-level cross-contamination with BBMap's crossblock tool (run crossblock.sh for usage information); it is designed exactly for this situation. It is not, however, a universal decontamination utility and only deals with cross-contamination from another pooled library (all pooled libraries must be processed together).

ADD REPLYlink written 2.6 years ago by Brian Bushnell16k

Dear @Brian Bushnell,

Do you have a literature that show that cross-library contamination is common with Illumina library-prep approaches?

ADD REPLYlink written 2.4 years ago by tans03070

No, I am not aware of any published studies of the issue, though JGI might publish our data at some point. Note that it is not exactly a library-prep issue, though - cross-contamination occurs at many points, including during and after library-prep. But, for example, we have in the past had cross-contamination occurring on the robots used for preparing plates due to improper fluid levels, and that was some of the worst contamination.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Dear @Brian,

after analyzing my contigs using the ACDC software, I found out that the contigs of a sample that I had multiplexed with was present into my sample of interest. Do you have any suggestions on how I could go about tracing this source of contamination?

Thanks!

ADD REPLYlink written 2.4 years ago by tans03070

Tracing the origin is really difficult. Some of the things you can investigate are:

1) What's the coverage of the contaminant versus target organism?
2) How similar are the barcodes?  E.g., for dual barcodes, do both samples share one of the barcodes?
3) Are these samples adjacent to each other on the plate?
4) Were other samples also contaminated with the same library?
5) How was demultiplexing done - e.g., were mismatches allowed in barcodes?

Sometimes, those can give you an idea of where the contamination may have occurred.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Hello @Brian,

I will like to clarify a few things. The synthetic artefacts were produced from an MDA amplification performed on sterile PBS (no genomic templates). Using the RiboTagger software (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1378-x), no RiboTags were observed in the sample. This probably has a high indication that the PBS was indeed sterile.

When I did a de novo assembly of the reads that were generated from this negative control, contigs up to 6 kbp were produced. A blast search showed that they could not be annotated. I suspect the artefact sequences are a by-product of MDA amplification (http://www.nature.com/nprot/journal/v1/n4/full/nprot.2006.326.html)

Therefore, I predict that these artefact sequences would be produced in an actually sample that contains cells. So the goal here is really to remove these artefacts. But the tricky part is the cells belong to a novel genus.

Generally, if you are multiplexing MDA-amplified single-cells, you will get crosstalk due to barcode miscalls, barcode contamination/impurity, chimerism, and so forth, that will assemble into contaminant contigs

How often does this happen?

ADD REPLYlink written 2.6 years ago by tans03070

Please use ADD REPLY/ADD COMMENT to respond to existing posts to keep threads logically organized.

ADD REPLYlink written 2.6 years ago by genomax70k

Will take note of that, @genomax2

ADD REPLYlink written 2.6 years ago by tans03070
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1832 users visited in the last hour