Question

creating a consensus fasta that is same length as the reference

1

Entering edit mode

6.2 years ago

vulpecula ▴ 30

Hi all,

I have been trying to create a same-length consensus fasta by mapping reads to a reference genome for input into angsd as the ancestral sequence. However, the scaffolds of the resulting consensus always seem to be a little short, likely due to indels etc.. I have iterated through several versions of code (bcftools call -c or bcftools call -m, not using any data filters, etc.). Is there a specific path or set of filters I should use to be doing this? I found some former issues documented of this in previous versions of samtools/bcftools that had to do with an older version of the program but that doesn't seem to be the issue here. I also tried to use LAST for mapping, but that didn't seem to sort it out.

Apologies for the juvenile question and thanks for any help.

mapping alignment angsd • 2.7k views

ADD COMMENT • link updated 2.4 years ago by sagitaninta ▴ 20 • written 6.2 years ago by vulpecula ▴ 30

0

Entering edit mode

It is not clear what the desired course should be if there are indels in the genome. For example, if the genome is missing 10 bases from the middle how would you "make" that full size again? Would you want to insert 10 bases?

ADD REPLY • link 6.2 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks for your reply! Yes--I agree this is problematic. I think that for indels it would be best to have N's where they are in the consensus sequence. This is discussed here (http://seqanswers.com/forums/showthread.php?p=129352#post129352 the thread is a bit long, sorry, comment #14-16 is relevant to this issue). There is some reference to making a "dummy sam", but I am not exactly sure what they mean by that.

ADD REPLY • link 6.2 years ago by vulpecula ▴ 30

1

Entering edit mode

This is such an unusual problem that there might not be a tool do exactly what you need.

One perhaps the simplest option would be to align your best consensus to the old with bwa mem then run sam2pairwise

https://github.com/mlafave/sam2pairwise

Now you have the two sequences lined up. At that point the "hard" part is done. But you'll still need to write a simple but custom script in say Python to navigate the two in parallel and produce your final consensus.

ADD REPLY • link 6.2 years ago by Istvan Albert 100k

score 0 · Answer 1 · 2021-11-04

I came to this post as I have the same need (making fasta of same length with reference chromosome) and found my own solution for them who might need it in the future.

I found that angsd -doFasta can do the magic (http://www.popgen.dk/angsd/index.php/Fasta). It call consensus fasta from your BAM:

angsd -i $BAM -doFasta 1 -doCounts 1 -out ${BAM%.bam}

Depending on the BAM file size it can only take some minutes. The resulting fasta will have chromosome length the same as the reference chromosome because it adds Ns at the end of the chromosome until the length of the reference chromosome available on the BAM header.

Other that angsd -doFasta, my other attempts are htsbox pileup which is more straightforward and not calling indels (I don't know how angsd -doFasta treat indels) but always calls several bases short from the reference, this needs a custom script to fill the end of chromosome so it is the same length with the reference chromosome.