Question: MaSuRCA mate pair libraries crashing
0
gravatar for eischzj12
2.4 years ago by
eischzj120
eischzj120 wrote:

Hello! I'm having a difficult time figuring out why my MaSuRCA run keeps crashing. I've run it twice now and each run has lasted at least 5 days. The first time I qdel-ed the job myself due to a mismatch between the 'THREAD' and 'ppn'. The second time it crashed itself and I got no output file telling me what I did wrong.

The files that were generated include: combined_0, cutoff.txt, environment.sh, error_correct.log, meanAndStdevByPrefix.pe.txt, pa.renamed.fastq, pe.cor.fa, and pe_data.tmp.

I cross referenced these with this source (http://www.genome.umd.edu/docs/MaSuRCA_QuickStartGuide.pdf) to make sure that I wasn't missing anything, but I didn't find out anything useful. Can I learn anything about my run from these files? If not, what should my next step be?

Here are the contents of my config.txt file:

DATA PE = pa 500 75 /myPath/GSF1092-P1-ampc_S14_R1_001.fastq.gz /myPath/GSF1092-P1-ampc_S14_R2_001.fastq.gz END

PARAMETERS GRAPH_KMER_SIZE=auto USE_LINKING_MATES=1 NUM_THREADS=32 JF_SIZE=22500000000 DO_HOMOPOLYMER_TRIM=0

END

And my qsub:

!/bin/bash --login

PBS -N masurca_qsub

PBS -j oe

PBS -m abe

PBS -M email

PBS -q default

PBS -l nodes=1:ppn=32

workdir=myPath2 cd $workdir

./assemble.sh

Thanks in advance!

masurca mate pair assembly • 1.0k views
ADD COMMENTlink written 2.4 years ago by eischzj120

The error message or log is needed to know the reason. Most of the crashes for denovo assemblies are due to not enough RAM, can you manually change the kmer to a value lower than considered and give it a try. Is you genome-size ~2.25GB, jellyfish itself might crash in the beginning due to RAM insufficiency. Without the error-log nothing can be certain.

ADD REPLYlink written 2.4 years ago by Rohit1.3k

It seems the RAM problem, you even not yet generate the jellyfish output. You should check "error_correct.log".

ADD REPLYlink written 2.4 years ago by caizexi12350

Thank you both, that is very helpful! I've checked the error_correct.log and noticed that most of the content is that it had "skipped pa(some number): no high quality mer". Occasionally it will say "skipped pa(some number): contaminated read". Are these things that you would expect for a RAM issue? Again I tried to research this problem myself, but there isn't much information out there that says what these mean.

ADD REPLYlink written 2.4 years ago by eischzj120

It is not a RAM issue, your data quality does not seem to be good. Did you check the data quality prior, was there any data pre-processing involved? Data quality is the first thing to do, then pre-processing followed by assembly.

ADD REPLYlink written 2.4 years ago by Rohit1.3k

I did pre-processing of my mate pairs via Trimmomatic, but someone recommended to me that I not trim my data as masurca has a built-in error correction. Should I rerun masurca with the trimmed data?

ADD REPLYlink written 2.4 years ago by eischzj120

I have to ask, is that mate-pair or paired-end data? Are you trying to run the assembly directly on mate-pair data? You need to check the duplication rates of your reads first also better check other quality metrics like overrepresented sequences. As Masurca already suggests there seems to be contamination too.

ADD REPLYlink written 2.4 years ago by Rohit1.3k

I'm almost certain that my data is mate-pair, but I'm not entirely certain due to the lack of information provided when I got this research project. How can I distinguish between mate-pair and paired-end?

ADD REPLYlink written 2.4 years ago by eischzj120
1

A quick way to find out whether your data is mate-pair reads.

Circularized Duplicate Junction Adapter

CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Circularized Single Junction Adapter

CTGTCTCTTATACACATCT

Circularized Single Junction Adapter Reverse Complement

AGATGTGTATAAGAGACAG

using ' grep "one of the above adapter sequence" reads_file' to see whether your reads have mate-pair library adapter. And if your data are mate-pair, you will find the adapter sequence.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by caizexi12350

DATA PE = pa 500.. represents your data to be paired-end. This is wrong since you say the data is matepair. Usually insert sizes for matepairs are really high while paired-end can go upto 700bp. Try looking into the insert-size distribution and duplication rates, both are high for mate-pairs.

ADD REPLYlink written 2.4 years ago by Rohit1.3k

The library construction for mate-pair and pair-end is different. And based on the insert size "500", it seems pair-end. So you either asked the people who sequenced the data or map your reads to close-relate species to estimate the insert size. PS, you can run MaSurCA with pre-process data even the manual suggest not.

ADD REPLYlink written 2.4 years ago by caizexi12350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1645 users visited in the last hour