I'm trying to produce a genome assembly of a microbat species which is fairly closely related to the previously published Myotis lucifugus. The DNA samples I used to generate my libraries came from a preserved museum specimen. The resulting fragments were pretty small and lead to a library with very small and somewhat inconsistent insert sizes (+30% of read pairs actually overlap by a few bases). As such it hasn't proven very useful for scaffolding de novo contigs together. Instead, I'm now trying to do a reference-based assembly by mapping to the Myotis lucifugus genome and producing a consensus sequence. I have read a lot of old forum posts which suggest the commands below to produce a consensus fasta sequence using samtools/bcftools/vcfutils. I apologize if there is some overlap, however I haven't found my specific question answered.
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
Basically, I want to create a consensus sequence which does not contain any reference bases. In other words, I want to borrow the contig/scaffold structure of the reference genome via alignment, but only include bases from my own species in the consensus sequence, replacing every reference base which isn't covered in the alignment with N's.
Samtools/bcftools/vcfutils will run and produce a consensus fasta file if the ref.fa file is not included, and at a glance it appears that all uncovered bases are indeed replaced with N's*. What I want to know is whether I can trust this sequence. Does samtools somehow still know what the reference bases are for aligned segments? Is it still possible to wind up with bases in my consensus sequence which are different between my species and Myotis lucifugus but where the Myotis lucifugus base is chosen over my own species'?
Really sorry for the long-winded question and I hope this hasn't been answered in detail elsewhere...I really appreciate your help!
*Side note: This isn't precisely true, there is actually missing sequence on every scaffold from the end of the last aligned segment to the end of the scaffold. I wrote a script to fill in N's to correct the total length and also added the -I option to skip indel calling but this is only important if I want to preserve the coordinate system of the Myotis lucifugus genome to use gffread to extract genes from its .gff annotation files.