Question

How to present "N" in gap region of consensus in sequence assembly?

2

Entering edit mode

9.0 years ago

kuosheng0629 ▴ 20

Hi:

My project is assembly NGS reads to some long contings for downstream analysis.

I use samtools and bcftools to generate consensus sequence with reads bam file, reference genome and VCF file:

samtools mpileup -vf reference.fasta reads.bam | bcftools call -m -O z - > reads.vcf.gz
bcftools index reads.vcf.gz
bcftools consensus -f reference.fasta reads.vcf.gz -o consensus.fast

However, I found out the gap regions were filled with reference sequence and I want to replace these gap sequences with "N",

For example:

Reference:    AGTCTGG   TTAA   GGCTGC
Reads bam:    ...T..A          ...A..
Consensus:    AGTTTGA   TTAA   GGCAGC

I want to replace "TTAA" as "NNNN" and let consensus sequence output as "AGTTTGA NNNN GGCAGC".

How can I do this?

Thanks

David Wang,
Nation Taiwan Ocean University

alignment next-gen Assembly • 4.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.0 years ago by kuosheng0629 ▴ 20

0

Entering edit mode

awk, sed, tr and other tools can help, initiate some code and see how far can you reach.
Eg. replacing the whitespace characters with N, with spaces in Reads.bam as a reference.

ADD REPLY • link 8.9 years ago by Sukhi Singh 11k

score 3 · Answer 1 · 2015-06-15

3

Entering edit mode

8.9 years ago

Brice Sarver ★ 3.8k

1. Call variants.

2. Identify positions with nocalls (say, by emitting all sites) and convert to a BED:

grep "\./\." [YOUR_VCF] | awk '{OFS="\t"; if ($0 !~ /\#/); print $1, $2-1, $2}' > [YOUR_FILTERED_BED]

3. Use BEDtools' maskfasta.

ADD COMMENT • link 8.9 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Hello, I have the same problem, so thank you for your answer. My VCF file does not contain any gap sequences. I'm using GATK 4. 0.1.2 and I can't find how to emit all the positions. Sorry if this is a basic question, I'm new. Thanks

ADD REPLY • link 6.1 years ago by sere.r • 0

score 1 · Answer 2 · 2016-07-25

1

Entering edit mode

7.8 years ago

magdalena1236 ▴ 10

get your consensus using mpileup + bcftools
Use bedtools genomecov -bga for identifying the sites with no mapped reads ie. your N sites
recover them via | grep -w 0$
use bedtools maskfasta to change your target fasta

ADD COMMENT • link 7.8 years ago by magdalena1236 ▴ 10