Hello Biostar ! Please i have question
i did sequencing of five strains with two datasets (Hiseq Data and Miseq Data ) so for each strain i have Hiseq contigs file and Miseq contigs file (and mixed Hiseq Miseq contigs file) for exemple let s say i have strain A , so i have Hiseq_A_R1.fastq Hiseq_A_R2.fastq Miseq_A_R1.fastq Miseq_A_R2.fastq and Mixed_A_R1.fastq Mixed_A_R2.fastq
and after assembly i have three results : Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta
to chose which one i will use for next analyses i tried this , i merged the assembly files of each strain in one file (cat Hiseq_A_Contigs.fasta , Miseq_A_Contigs.fasta Mixed_A_contigs.fasta > All_A.fasta )and i eliminate the repeated sequences using cdhit and seqkit but it didn't give me the results i imagined ( number of contigs too high and number of nucleotides too high ) . so what i want to know , is from the three assembly files of each strains , what parameters i should take in consideration and in which order ( i mean what the principal parameter i should take in consideration and the second one and so on ) i wanted to mention that i also did busco analyses to see the assembly results , i just know total sequence length , N50 n50n , Busco Results .
i also want to know if merging the contigs assembly files of the different sequencing technique and elimination of the repeated sequences was a good idea and how can i eliminate duplicated sequences using other techniques )
Please tell for any other clarification
How good are the assemblies? I guess taking both miSeq and HiSeq would give the best results
Hello Thank you for your reply , what do you mean by taking both miSeq and HiSeq is taking the mixed contigs file or taking the fileafter removing duplication ?
The mixed, basically what Michael wrote below