Question: How to present "N" in gap region of consensus in sequence assembly?
gravatar for kuosheng0629
5.3 years ago by
kuosheng062920 wrote:

Hi  :

My project is assembly NGS reads to some long contings for downstream analysis.

I  use  samtools and bcftools to generate consensus sequence  with reads bam file, reference genome and VCF file:

samtools mpileup -vf reference.fasta reads.bam | bcftools call -m -O z - > reads.vcf.gz

bcftools index reads.vcf.gz

bcftools consensus -f reference.fasta reads.vcf.gz -o

However, I  found out the  gap regions  were filled with reference sequence  and I want to  replace these gap sequences with "N",


For example:


Reference:     AGTCTGG   TTAA   GGCTGC

Reads bam:    .  . T . . . A               .  . . A . .

Consensus:    AGTTTGA   TTAA   GGCAGC


I  want  to  replace  "TTAA"  as  "NNNN" and let consensus sequence output as "AGTTTGA NNNN GGCAGC".


How can I do this?




David Wang, Nation Taiwan Ocean University   

alignment next-gen forum assembly • 2.9k views
ADD COMMENTlink modified 4.0 years ago by magdalena123610 • written 5.3 years ago by kuosheng062920

awk, sed, tr and other tools can help, initiate some code and see how far can you reach.
Eg. replacing the whitespace characters with N, with spaces in Reads.bam as a reference.

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Sukhdeep Singh10k
gravatar for Brice Sarver
5.2 years ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

1. Call variants.

2. Identify positions with nocalls (say, by emitting all sites) and convert to a BED:

grep "\./\." [YOUR_VCF] | awk '{OFS="\t"; if ($0 !~ /\#/); print $1, $2-1, $2}' > [YOUR_FILTERED_BED]

3. Use BEDtools' maskfasta.

ADD COMMENTlink written 5.2 years ago by Brice Sarver3.5k

Hello, I have the same problem, so thank you for your answer. My VCF file does not contain any gap sequences. I'm using GATK 4. 0.1.2 and I can't find how to emit all the positions. Sorry if this is a basic question, I'm new. Thanks

ADD REPLYlink written 2.4 years ago by sere.r0
gravatar for magdalena1236
4.0 years ago by
magdalena123610 wrote:
  1. get your consensus using mpileup + bcftools
  2. Use bedtools genomecov -bga for identifying the sites with no mapped reads ie. your N sites
  3. recover them via | grep -w 0$
  4. use bedtools maskfasta to change your target fasta
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by magdalena123610
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 740 users visited in the last hour