Question: Need help for de novo assembly
0
gravatar for Kiluah
6 weeks ago by
Kiluah0
Kiluah0 wrote:

Hi to everyone,

beforehand: I'm quite new to Linux and also the whole Assembly.

I received my Next-Gens Seq files from Eurofins. Method was Illumina Paired-End 2*150. The two files are each almost 10 million sequences long in fastq.

Somehow I managed to run SPAdes with my two files.

Dataset parameters:
Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['xx1.fastq']
      right reads: ['xx2.fastq']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed
Assembly parameters:
  k: automatic selection based on read length
  Repeat resolution is enabled
  Mismatch careful mode is turned ON
  MismatchCorrector will be used
  Coverage cutoff is turned OFF
Other parameters:
  Dir for temp files: xx/tmp
  Threads: 16
  Memory limit (in Gb): 15

(1) Are those parameters right? I don't get the difference between the paired-end mode, mate-paired and interlaced. We ordered Paired-End seq and I received 2 files called xx_1.fastaq.gz and xx_2.fastaq.gz Since I got two files I think thy are not interlaced, am I right. What's with the other modes and another point. are my fiels fr, rf or ff? I don't even know where to get this information from.

(2) If my parameters are right and SPAdes run through my files I want to map them with Bowtie2. i indexed my reference genome. But what files from SPAdes should I take for that? . I assume the contig.fasta, but what exactly is the scaffold.fasta and all the other files? So I ran bowtie2 with the following command and got this as output:

$ bowtie2 -x yy_REFERENCE -f -U  xx/results/contigs.fasta -S yy/SAM/alignment_contigs.sam
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134

What does that mean? What is error value 134? I'm running Ubuntu 18.04 with 16 GB of RAM. I'm afraid that it means that I have to less RAM, is it possible? I also have access to several clusters with more CPU and RAM, but I have absolutely no clue how to run anything on them.

I had no problems running bowtie2 with the Lambda phage example files. It is also hard for me to find some information about all this. So I would really appreciate it, if someone of you have good books, papers or tutorials for that.

I know these are a lot of question, but I hope you can help me with that.

I'm looking forward to your answers. Kiluah

ADD COMMENTlink modified 6 weeks ago by swbarnes25.2k • written 6 weeks ago by Kiluah0
2

(1) Mate-pair is how you prepare your librairies, mate reads could be like 2k bases away from each other. Paired-end sequencing is a sequencing technique producing reads far from hundred of bases maximum. In these 2 techniques you will end up with 2 files (one for forward strand, another one for reverse strand). Interlaced is where you have these 2 files in a single one.

pe-mp

So I guess here you have paired-end reads not interlaced, as you discribe it

As far as I remember classic Illumina Paired-end sequencing is : first read of the fragment is sequenced as sense (forward) and the second is on the antisense strand (reverse)

So here you have FR, and you get this information checking illumina library preparation kit

2) Scaffolds are an association of contigs joined by N bases, so for a mapping I would go for the contigs files as Bowtie2 will look at an end-to-end comparision

And for your Bowtie2 issue, what was the command line to generate the index, what is the size of your contigs.fasta file ? It is like a classic memory issue but with 16Gb of RAM it should be enought with genome like mouse, human...

What is the exact definition for scaffold?

Is it better to annotate contigs or scaffolds

Trinity strand specific: RF or FR

https://galaxyproject.org/tutorials/ngs/

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Bastien Hervé3.9k

Hello Bastien,

the command for the indexation was the same as in the Lambda example:

yy/bowtie2-build xx/reference/organism.fa organism_REFERENCE

After that I receved the same six files as in the example. My contigs.fasta has 6,78 MB and around 3000 nodes. Somewhere I read, that bowtie2 expects a fasta(q) with a single row for each entry, but the SPAdes output is 60 characters/row, might this be the problem? I will try to change it and run bowtie2 again.

My NGS data is from bacteria ~ 5,5 Mb. So according to you it shouldn'n be a RAM problem, right? But what is the problem then?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kiluah0

Please, use the add reply grey button to add a reply to a comment. This keep the thread readable and well organized. As you can see I moved it but it's not perfect.

Try to run bowtie2 without the -U option

bowtie2 -x yy_REFERENCE -f xx/results/contigs.fasta -S yy/SAM/alignment_contigs.sam
ADD REPLYlink written 6 weeks ago by Bastien Hervé3.9k

Still the same issue. Same error code:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134
ADD REPLYlink written 6 weeks ago by Kiluah0

Generally that error signifies problems related to memory. Have you tested bowtie2 program with a small dataset. Take a couple of contigs (from contigs.fasta) and try running the program to see if program works. If it does not then you will need to find alternate hardware.

ADD REPLYlink written 6 weeks ago by genomax65k

The contigs.fasta file is 6.78MB, that should be OK for Bowtie2

ADD REPLYlink written 6 weeks ago by Bastien Hervé3.9k

You are right. It runs if I cut all contics with >20 kb. So it seems the contigs themself are to long. My longest contig is almost 500 kb.

If I remove some of the longest contigs I get this:

Error: Out of memory allocating 1073741824 __m128i's for DP matrix: 'std::bad_alloc'
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134

So I tried to remove even more and after I am <20 kb I receive that:

3052 reads; of these:
  3052 (100.00%) were unpaired; of these:
    2953 (96.76%) aligned 0 times
    35 (1.15%) aligned exactly 1 time
    64 (2.10%) aligned >1 times
3.24% overall alignment rate

But the alignment rate is pretty low, isn't it? I only removed the first 32 contigs. Is there a way to ask bowtie2 to show me just the aligned sequenzes?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kiluah0

As swbarnes2 said it in an answer, why do you want to align your fasta rather than your fastq files ?

Give a try to BWA-mem or minimap2 if you still want to align your contigs

http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

ADD REPLYlink written 6 weeks ago by Bastien Hervé3.9k
0
gravatar for swbarnes2
6 weeks ago by
swbarnes25.2k
United States
swbarnes25.2k wrote:

Are you aligning your contigs fasta to your reference? And not your fastqs?

ADD COMMENTlink written 6 weeks ago by swbarnes25.2k

Yes, I'm using the contigs.fasta which I received from SPAdes? Isn't that the right approach? As I said before I'm absolutely new to the whole topic and there is no one around to help me out. SPAdes and bowtie2 where two programs that a friend of my supervisor suggested to us, so I decided to try them. I thought the workflow is to make some contigs out of the NGS-files so that I have de novo sequences and map them to a reference which is more or less related and check afterwards for SNPs. Did I misinterpret something?

I will have a look at BWA-mem and minimap2 and also try to use my fastq-files with bowtie2.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kiluah0

You can't map contigs to a reference with a short read mapper. And since the contigs have no depth or quality information, you wouldn't want to try and call SNPs from them.

ADD REPLYlink written 6 weeks ago by swbarnes25.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1924 users visited in the last hour