what is the parametrs i should take in consideration to choose the best assembly
1
0
Entering edit mode
4.0 years ago
Bioinfo ▴ 20

Hello Biostar ! Please i have question

i did sequencing of five strains with two datasets (Hiseq Data and Miseq Data ) so for each strain i have Hiseq contigs file and Miseq contigs file (and mixed Hiseq Miseq contigs file) for exemple let s say i have strain A , so i have Hiseq_A_R1.fastq Hiseq_A_R2.fastq Miseq_A_R1.fastq Miseq_A_R2.fastq and Mixed_A_R1.fastq Mixed_A_R2.fastq

and after assembly i have three results : Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta

to chose which one i will use for next analyses i tried this , i merged the assembly files of each strain in one file (cat Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta > All_A.fasta )and i eliminate the repeated sequences using cdhit and seqkit but it didn't give me the results i imagined ( number of contigs too high and number of nucleotides too high ) . so what i want to know , is from the three assembly files of each strains , what parameters i should take in consideration and in which order ( i mean what the principal parameter i should take in consideration and the second one and so on ) i wanted to mention that i also did busco analyses to see the assembly results , i just know total sequence length , N50 n50n , Busco Results .

i also want to know if merging the contigs assembly files of the different sequencing technique and elimination of the repeated sequences was a good idea and how can i eliminate duplicated sequences using other techniques )

Please tell for any other clarification

assembly Assembly sequencing sequence alignment • 758 views
ADD COMMENT
0
Entering edit mode

How good are the assemblies? I guess taking both miSeq and HiSeq would give the best results

ADD REPLY
0
Entering edit mode

Hello Thank you for your reply , what do you mean by taking both miSeq and HiSeq is taking the mixed contigs file or taking the fileafter removing duplication ?

ADD REPLY
0
Entering edit mode

The mixed, basically what Michael wrote below

ADD REPLY
2
Entering edit mode
4.0 years ago
Michael 54k

What is the best assembly, is debatable and there is normally no single best solution. Of course, you want long contigs, but what would long contigs help if they are incorrect? Whatever assembler you use (which was it?), coverage is key. If you ran the assembly twice on half the coverage, there might be still gaps and ambiguities in the assembly graph that could otherwise have been resolved. Therefore, I would use only the assembly based on all the sequencing data available, and not run the assembly twice on half the data. Some assemblers might also be sensitive to the insert size of libraries for the final contig stitching process.

With respect to metrics, if N50 is very low, you might not get a lot of contigs that contain at least a single complete gene. BUSCO might also be useful to determine completeness. Other relevant aspects: Do you have other reads, e.g. RNA-seq? What is the mapping rate there? Can you get information on linkage of markers, e.g. a linkage-map, what is the amount of contigs that fall into different linkage groups? Could you get long reads, e.g. from Nanopore and compare or re-assemble?

Is the assembly size somewhere near the expected genome size? Therefore, you could also do a kmer analysis of the genome size.

Hope this helps.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer , and very sorry for late reply i used shovill for assembly ( it contains spades inside ) ahh also i have to mention that i used spades with specefic kmer ( the one gived by Unicycler ) so for each sequencing technic i have two assembly file : Default and kmerX( for exemple A_Hiseq_Contigs_File_Default , A_Hiseq_Contigs_File_kmerX A_Miseq_Contigs_File_Default A_Miseq_Contigs_File_kmerY A_Mixed_Hiseq_Miseq_Contigs_File_Default

what i understand from If you ran the assembly twice on half the coverage, there might be still gaps and ambiguities in the assembly graph that could otherwise have been resolved. Therefore, I would use only the assembly based on all the sequencing data available, and not run the assembly twice on half the data

is that i should focus on the mixed results Ahh i don't have other reads or nanopore sequences , i only have 1 reference sequence with total size 1.67 mg and i compare the total size i found for each assembly file with the reference but this is not the primordial parameter i should take in consideration i guess

please can you tell me how can i ( or tools ) do that please linkage-map, what is the amount of contigs that fall into different linkage groups?

you could also do a kmer analysis of the genome size.

THank you very much for your kind help

ADD REPLY

Login before adding your answer.

Traffic: 2466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6