Method for generating a reference-based genome assembly in fasta with no reference genome bases
0
2
Entering edit mode
6.3 years ago
memory_donk ▴ 330

Hi Biostars,

I'm trying to produce a genome assembly of a microbat species which is fairly closely related to the previously published Myotis lucifugus. The DNA samples I used to generate my libraries came from a preserved museum specimen. The resulting fragments were pretty small and lead to a library with very small and somewhat inconsistent insert sizes (+30% of read pairs actually overlap by a few bases). As such it hasn't proven very useful for scaffolding de novo contigs together. Instead, I'm now trying to do a reference-based assembly by mapping to the Myotis lucifugus genome and producing a consensus sequence. I have read a lot of old forum posts which suggest the commands below to produce a consensus fasta sequence using samtools/bcftools/vcfutils. I apologize if there is some overlap, however I haven't found my specific question answered.

Command:
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

Basically, I want to create a consensus sequence which does not contain any reference bases. In other words, I want to borrow the contig/scaffold structure of the reference genome via alignment, but only include bases from my own species in the consensus sequence, replacing every reference base which isn't covered in the alignment with N's.

Samtools/bcftools/vcfutils will run and produce a consensus fasta file if the ref.fa file is not included, and at a glance it appears that all uncovered bases are indeed replaced with N's*. What I want to know is whether I can trust this sequence. Does samtools somehow still know what the reference bases are for aligned segments? Is it still possible to wind up with bases in my consensus sequence which are different between my species and Myotis lucifugus but where the Myotis lucifugus base is chosen over my own species'?

Really sorry for the long-winded question and I hope this hasn't been answered in detail elsewhere...I really appreciate your help!

*Side note: This isn't precisely true, there is actually missing sequence on every scaffold from the end of the last aligned segment to the end of the scaffold. I wrote a script to fill in N's to correct the total length and also added the -I option to skip indel calling but this is only important if I want to preserve the coordinate system of the Myotis lucifugus genome to use gffread to extract genes from its .gff annotation files.

myposts Assembly alignment SNP • 4.1k views
0
Entering edit mode

I think MIRA may do what you want, see "Mapping assemblies" on its manual.

0
Entering edit mode

Hi memory_donk, did you find a solution for this question?

I'm facing very similar task now - I would like to reconstruct a particular set of transcripts using genomic reads of a species for which I do not have reference while I have a the transcript sequences from a related species. I mapped the reads to the reference transcripts via bowtie 2 and when I visually explore the bam file in IGV, I see reasonably good mapping with some gaps (reference regions with no mapped reads ) and an expected amount of unequivocally supported nucleotide differences between the sequenced species' genomic reads and the reference sequence.

I got a consensus from the bam file via:

samtools mpileup -uf reference.nt mapping.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fasta


This consensus looks quite good, however, I can see that some nucleotide positions are in the consensus adopted from the reference while in the bam file there is different nucleotide unequivocally supported by the reads (although with low coverage - it seems that generally when I have coverage lower than 4 the nucleotide from reference instead of the nucleotide from reads is used in the consensus) and also N's used for the regions which are covered by a single read (while I'd like to have these regions included in the consensus).

I can extract the consensus exactly as I want it (as described above) from the IGV viewer (using function Copy Consensus Sequence as described here: http://software.broadinstitute.org/software/igv/PopupMenus#AlignmentTrack) but I have hundreds of sequences from which I would have to manually copy the consensus.

Thanks for any suggestions!

0
Entering edit mode

I think what I wound up doing was running samtools mpileup but excluding the -f flag and reference genome. At least as far as I could tell, I didn't find bases that came from the reference instead of the data. That could have also been because I had ~50X coverage, had a very high mapping percentage and mapped against the repeat-masked reference to avoid low-quality mappings, together ensuring that the overwhelming majority of mappable reference bases had deep coverage with my data. Using this approach I had to pad the ends of each resulting scaffolds which wind up clipped with N's to preserve the reference coordinate system.

If you don't have as much coverage depth, you could do a messy solution like mapping your full data set >= 4 times, merging your various bam files and then producing a consensus the way described above.

Alternatively, you could map you data with bwa, call variants with samtools or GATK, then write a script to change each identified variant in the reference genome, also replacing any reference bases not covered by your data with "N" characters.

Edit: I'm pretty sure its possible for vcfutils or something to output all bases, but if not, you can use bedtools genomecov to identify what positions in your reference genome with a coverage depth of 0. Those positions would then be replaced with Ns. Be careful though to make sure your vcf/bed and various tools are treating the data the same way (i.e. is the first position 0 or 1. Your scripts would have to account for that).

Edit: sorry for the slow response. I don't use this account much anymore.