Question: Can I Convert Fasta Lowercase Bases To 'N'?
gravatar for User 8087
9.1 years ago by
User 808720
User 808720 wrote:

I have mutliple genome consensus sequences, extracted from a BWA BAM file using the following commandline:

samtools mpileup -E -q30 -Q30 -uf REF.fa isolate1.merged.bam | bcftools view -cg - |\ vcf2fq -d3 > isolate1.fq (which I convert to FASTA using a python script).
(Samtools-0.1.18 used)

Whether or not I invoke the -d option in (default is read-depth 3), sites with a read depth of 1 or 2 are still called, just lowercase instead! I'd rather the base at these sites was not called ie 'N'.

This means that for each bacterial genome, I have approximately 48000 sites where a base has been called with a low read depth that I'd consider to be low quality. This includes approximately 650 variant sites that could be false positive SNPs (if they are sequencing errors covered by 1 read only!) and therefore potentially skew our phylogenetic analysis.

Is it possible to convert these lowercase letters to 'N' ie an uncalled base? I can only find tools that convert lowercase to uppercase and viceversa, I can't find any way to convert these low quality letters to 'N'? Of course my FASTA file is a mixture of uppercase bases and lowercase!

Failing that, perhaps does anyone know of a way I could call the consensus sequence from either the BAM file or the VCF file I have (for all sites), where it does filter out for minimum read depth?

Any help much appreciated!

Best Wishes,


read samtools mpileup conversion • 4.8k views
ADD COMMENTlink modified 8.6 years ago by Frédéric Mahé3.1k • written 9.1 years ago by User 808720
gravatar for Frédéric Mahé
8.6 years ago by
France, Montpellier, CIRAD
Frédéric Mahé3.1k wrote:

With sed, you can do it like that:

sed -e '/^>/! s/[[:lower:]]/N/g' in.fasta > out.fasta

/^>/! selects lines not starting with >, [[:lower:]] selects any lowercase letter, with g sed replaces globally (all occurrences).

ADD COMMENTlink written 8.6 years ago by Frédéric Mahé3.1k
gravatar for Scopak
9.1 years ago by
Scopak80 wrote:

Try a Perl one-liner to convert lowercase to N:

perl -e 'while(<>) { if ($_ =~ /^>.*/) { print $_; } else { $_ =~ tr/acgt/N/; print $_;}}' < in.fasta  >out.fasta

Edit: Changed to tr///.



ADD COMMENTlink modified 9.1 years ago • written 9.1 years ago by Scopak80

tr/// is better than s/// here

ADD REPLYlink written 9.1 years ago by Martin A Hansen3.0k

Masha, I agree, tr/// is faster for simple replacements than s///.

use tr/acgt/N/ instead of s/[acgt]/N/ in the code

ADD REPLYlink written 9.1 years ago by Scopak80
gravatar for Martin A Hansen
9.1 years ago by
Martin A Hansen3.0k
Martin A Hansen3.0k wrote:

With Biopieces you do:

read_fasta -i in.fna | transliterate_seq -s 'atcg' -r 'N' | write_fasta -o out.fna -x

ADD COMMENTlink written 9.1 years ago by Martin A Hansen3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1230 users visited in the last hour