Question

what is the parametrs i should take in consideration to choose the best assembly

0

Entering edit mode

5.2 years ago

Bioinfo ▴ 20

Hello Biostar ! Please i have question

i did sequencing of five strains with two datasets (Hiseq Data and Miseq Data ) so for each strain i have Hiseq contigs file and Miseq contigs file (and mixed Hiseq Miseq contigs file) for exemple let s say i have strain A , so i have Hiseq_A_R1.fastq Hiseq_A_R2.fastq Miseq_A_R1.fastq Miseq_A_R2.fastq and Mixed_A_R1.fastq Mixed_A_R2.fastq

and after assembly i have three results : Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta

to chose which one i will use for next analyses i tried this , i merged the assembly files of each strain in one file (cat Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta > All_A.fasta )and i eliminate the repeated sequences using cdhit and seqkit but it didn't give me the results i imagined ( number of contigs too high and number of nucleotides too high ) . so what i want to know , is from the three assembly files of each strains , what parameters i should take in consideration and in which order ( i mean what the principal parameter i should take in consideration and the second one and so on ) i wanted to mention that i also did busco analyses to see the assembly results , i just know total sequence length , N50 n50n , Busco Results .

i also want to know if merging the contigs assembly files of the different sequencing technique and elimination of the repeated sequences was a good idea and how can i eliminate duplicated sequences using other techniques )

Please tell for any other clarification

assembly Assembly sequencing sequence alignment • 1.2k views

ADD COMMENT • link updated 5.2 years ago by Michael 56k • written 5.2 years ago by Bioinfo ▴ 20

0

Entering edit mode

How good are the assemblies? I guess taking both miSeq and HiSeq would give the best results

ADD REPLY • link 5.2 years ago by Asaf 10k

0

Entering edit mode

Hello Thank you for your reply , what do you mean by taking both miSeq and HiSeq is taking the mixed contigs file or taking the fileafter removing duplication ?

ADD REPLY • link 5.2 years ago by Bioinfo ▴ 20

0

Entering edit mode

The mixed, basically what Michael wrote below

ADD REPLY • link 5.2 years ago by Asaf 10k

score 2 · Answer 1 · 2020-05-06

What is the best assembly, is debatable and there is normally no single best solution. Of course, you want long contigs, but what would long contigs help if they are incorrect? Whatever assembler you use (which was it?), coverage is key. If you ran the assembly twice on half the coverage, there might be still gaps and ambiguities in the assembly graph that could otherwise have been resolved. Therefore, I would use only the assembly based on all the sequencing data available, and not run the assembly twice on half the data. Some assemblers might also be sensitive to the insert size of libraries for the final contig stitching process.

With respect to metrics, if N50 is very low, you might not get a lot of contigs that contain at least a single complete gene. BUSCO might also be useful to determine completeness. Other relevant aspects: Do you have other reads, e.g. RNA-seq? What is the mapping rate there? Can you get information on linkage of markers, e.g. a linkage-map, what is the amount of contigs that fall into different linkage groups? Could you get long reads, e.g. from Nanopore and compare or re-assemble?

Is the assembly size somewhere near the expected genome size? Therefore, you could also do a kmer analysis of the genome size.

Hope this helps.