Hi:
My project is assembly NGS reads to some long contings for downstream analysis.
I use samtools and bcftools to generate consensus sequence with reads bam file, reference genome and VCF file:
samtools mpileup -vf reference.fasta reads.bam | bcftools call -m -O z - > reads.vcf.gz
bcftools index reads.vcf.gz
bcftools consensus -f reference.fasta reads.vcf.gz -o consensus.fast
However, I found out the gap regions were filled with reference sequence and I want to replace these gap sequences with "N",
For example:
Reference: AGTCTGG TTAA GGCTGC
Reads bam: ...T..A ...A..
Consensus: AGTTTGA TTAA GGCAGC
I want to replace "TTAA" as "NNNN" and let consensus sequence output as "AGTTTGA NNNN GGCAGC".
How can I do this?
Thanks
David Wang,
Nation Taiwan Ocean University
awk, sed, tr
and other tools can help, initiate some code and see how far can you reach.Eg. replacing the whitespace characters with N, with spaces in Reads.bam as a reference.