(2) l got AA_genome_aln-pe.sam output which is around 50 GB. I also tried to convert this sorted sam file to FASTA format using
samtools bam2fq AA_genome.srt.bam | seqtk seq -A > AA_genome_assembly.fa
However, the final output that l got is in 20 GB. My expected assembly size was approximately 50 MB. How can l get final the assembly in desired output size? Is there still something l am missing in the analysis?
Is there still something l am missing in the analysis?
bwa is an NGS data aligner not a genome assembler. If you are looking to assemble the data then you are using the wrong program. You should be using something like SOAPdenovo, SPAdes if you are looking to assemble your genome starting with (do you only have fasta format data or did you convert the fastq files) sequence data.
If you are aligning to a reference genome (which seems to be the case above) then the size of aligned data file has nothing to do with the size of the genome/assembly. That size is simply reflective of alignments found for your reads against the reference.