Parasite genome assembly
1
0
Entering edit mode
13 months ago

Hi All,

(1) I have been working with a parasite genome assembly using the BWA tool. l used the following command to execute assembly (paired-end Illumina short reads).

module load bwa/0.7.15

bwa mem -t 1 -M -R "@RG\tID:reads\tSM: AA_genome" reference_genome.fasta  AA_genome1.fasta.gz  AA_genome2.fasta.gz > AA_genome_aln-pe.sam


(2) l got AA_genome_aln-pe.sam output which is around 50 GB. I also tried to convert this sorted sam file to FASTA format using

samtools bam2fq AA_genome.srt.bam | seqtk seq -A > AA_genome_assembly.fa


However, the final output that l got is in 20 GB. My expected assembly size was approximately 50 MB. How can l get final the assembly in desired output size? Is there still something l am missing in the analysis?

Thank you

parasite assembly BWA genome • 603 views
4
Entering edit mode
13 months ago
GenoMax 123k

Is there still something l am missing in the analysis?

bwa is an NGS data aligner not a genome assembler. If you are looking to assemble the data then you are using the wrong program. You should be using something like SOAPdenovo, SPAdes if you are looking to assemble your genome starting with (do you only have fasta format data or did you convert the fastq files) sequence data.

If you are aligning to a reference genome (which seems to be the case above) then the size of aligned data file has nothing to do with the size of the genome/assembly. That size is simply reflective of alignments found for your reads against the reference.

You can generate a consensus sequence using the bwa aligned data file (generated consensus should be close in size to your reference). This thread will help with that: Generating consensus sequence from bam file

0
Entering edit mode

Thank you so much. I was totally on a different track. Is there any eukaryotic parasite-specific assembler available for Illumina short reads?

1
Entering edit mode

Have a look at the assembler Spades to get started. Theres' plenty, however, see eg Wikipedia https://en.wikipedia.org/wiki/De_novo_sequence_assemblers

1
Entering edit mode

+1 for SPAdes suggestion. With a 50 Mb genome this would be a good place to start.