Question

Parasite genome assembly

0

Entering edit mode

2.5 years ago

kamathshreya70 • 0

Hi All,

(1) I have been working with a parasite genome assembly using the BWA tool. l used the following command to execute assembly (paired-end Illumina short reads).

module load bwa/0.7.15

bwa mem -t 1 -M -R "@RG\tID:reads\tSM: AA_genome" reference_genome.fasta  AA_genome1.fasta.gz  AA_genome2.fasta.gz > AA_genome_aln-pe.sam

(2) l got AA_genome_aln-pe.sam output which is around 50 GB. I also tried to convert this sorted sam file to FASTA format using

samtools bam2fq AA_genome.srt.bam | seqtk seq -A > AA_genome_assembly.fa

However, the final output that l got is in 20 GB. My expected assembly size was approximately 50 MB. How can l get final the assembly in desired output size? Is there still something l am missing in the analysis?

Thank you

parasite assembly BWA genome • 1.2k views

ADD COMMENT • link updated 8 months ago by Buffo ★ 2.4k • written 2.5 years ago by kamathshreya70 • 0

0

Entering edit mode

Hi all,

Thank you for your suggestion. I have same question as @kamathshreya70 and managed to get the assembled genomes using SPAdes. My second question is: after getting the assemblies, how should I check the identity and speciation? Since my assembly is a multi-FASTA file (containing 30k plus scaffolds inside), will BLASTN be working? Or do you have any bioinformatics tools to recommend?

I am looking forwards to your reply. Thank you.

ADD REPLY • link 8 months ago by dante • 0

0

Entering edit mode

Identity against what?, there are many options to compare the similarity between assemblies at genome scale, I would recommend you mummer, to assess the completeness of your assembly you can try BUSCO.

ADD REPLY • link 8 months ago by Buffo ★ 2.4k

score 4 · Answer 1 · 2021-11-02

4

Entering edit mode

2.5 years ago

GenoMax 141k

Is there still something l am missing in the analysis?

bwa is an NGS data aligner not a genome assembler. If you are looking to assemble the data then you are using the wrong program. You should be using something like SOAPdenovo, SPAdes if you are looking to assemble your genome starting with (do you only have fasta format data or did you convert the fastq files) sequence data.

If you are aligning to a reference genome (which seems to be the case above) then the size of aligned data file has nothing to do with the size of the genome/assembly. That size is simply reflective of alignments found for your reads against the reference.

You can generate a consensus sequence using the bwa aligned data file (generated consensus should be close in size to your reference). This thread will help with that: Generating consensus sequence from bam file

ADD COMMENT • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much. I was totally on a different track. Is there any eukaryotic parasite-specific assembler available for Illumina short reads?

ADD REPLY • link 2.5 years ago by kamathshreya70 • 0

1

Entering edit mode

Have a look at the assembler Spades to get started. Theres' plenty, however, see eg Wikipedia https://en.wikipedia.org/wiki/De_novo_sequence_assemblers

ADD REPLY • link 2.5 years ago by colindaven 6.4k

1

Entering edit mode

+1 for SPAdes suggestion. With a 50 Mb genome this would be a good place to start.

ADD REPLY • link 2.5 years ago by GenoMax 141k