How To Generate A Consensus Fasta Sequence From Sam Tools Pileup?
8
9
Entering edit mode
11.0 years ago

SAM tools pileup produces a consensus sequence in vertical format, which is not terribly useful for traditional sequence analysis. Does anyone know of a command line utility to convert SAM tools pileup format to a simple fasta sequence?

I am aware that SAM tools provides tool to convert pileup to fastq samtools.pl pileup2fq), but fastq to fasta conversion presents its own challenges:

http://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk

http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217

Of course I could write a simple script for this task, but I'm hoping someone has already solved this problem for the community. Apologies in advance if this functionality is available in SAM tools itself.

Many thanks, Casey

samtools next-gen sequencing fasta • 40k views
6
Entering edit mode
10.3 years ago
lh3 32k

For fasta->fastq conversion, you may consider seqtk.

Get:

git clone https://github.com/lh3/misc.git
cd misc/seq/seqtk/


Compile:

gcc -O2 seqtk.c -o seqtk -lz


And

seqtk fq2fa in.fastq 20 > out.fasta


converts in.fastq and changes bases with quality lower than 20 to lowercases.

0
Entering edit mode

you should check for the memory leaks, if the stream was open, etc...

0
Entering edit mode

There is no memory leak. As to file opening, if the file is not readable, an immediate segfault; no false results.

0
Entering edit mode

Heng, thanks for this. Confirmed that it works on multi-line fastq as advertised. Would it be possible to embed this as an output option in samtools?

0
Entering edit mode

What if I didn't want the bases with quality lower than to 20 to be converted to lower cases?

I used the command:

seqtk fq2fa  Mini_consensus_genome.fq > Mini_consensus_out.fa


However it still outputted many bases in lower case!

Can you kindly tell me how to avoid this!

Thank you

3
Entering edit mode
11.0 years ago

From here:

Does samtools generate the consensus sequence like Maq?

Yes. Try this:

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq


Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

1
Entering edit mode

Unfortunately, the flavor of fastq output by samtools.pl does not follow the fastq 4-line convention (both the sequence and qualities are split over many lines), so many of the Stack Overflow solutions are not useful here ;^(

0
Entering edit mode

Hi Pierre -

Does this command generate fasta or fastq?

Best, Casey

0
Entering edit mode

I guess it is fastq, but you found a fastq2fasta on SO , don't you ? :-) ?

0
Entering edit mode

The seqret application in the emboss package seems to be working for this type of fastq to fasta conversion. But hacking into the samtools.pl script looks like a better solution.

0
Entering edit mode

Thanks. This command was very helpful. I'd been looking for a way to do this. @Casey: It generates FASTQ. There's an awk command I found to extract the FASTA from it:

awk '/^@chr7$/,/^+$/' consensus.fq | perl -pe "s/@/>/ ; s/\+//" > chr7.fasta


Joe

3
Entering edit mode
10.7 years ago
Rm 8.0k

You can use custom scripts to convert smatool pleup2fq generated fastq to fasta format.

Multi line Consensus.fastq file looks like this

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ


For each chromosome (here chr1) use Awk and perl for Converting multi line fastq to fasta.

awk '/^@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\\+//" > chr1.fasta

0
Entering edit mode

RM, this solution works for GNU AWK 3.1.5 on linux, but not the BSD version of AWK on OSX?!?

0
Entering edit mode

@casey: can you post the error/message from the BSD version.

1
Entering edit mode
10.2 years ago
Swbarnes2 ★ 1.5k

The samtools.pl solution won't even try to deal with indels. Here's what it does with them:

sub p2q_filter_gaps {
my ($seq,$gaps, $l) = @_; for my$g (@$gaps) { my$x = $g >$l? $g -$l : 0;
substr($$seq, x, l + l) = lc(substr($$seq, $x,$l + $l)); } }  That last line means "make that substring of the sequence lowercase". It does that around each puitative gap. If you don't want the output in fastq format, just go to the samtools.pl file in the misc folder, and change the script. sub p2q_post_process { my ($chr, $seq,$qual, $gaps,$l) = @_;
&p2q_filter_gaps($seq,$gaps, $l); print "\@$chr\n"; &p2q_print_str($seq); print "+\n"; &p2q_print_str($qual);
}


That last line is where the quality is printed out, just comment that line out by putting a # in front of it, and it won't print those last two lines. You'll also want to change the @ in the preceeding line to a >.

1
Entering edit mode
10.0 years ago

If a web-based approach would be helpful, please see the Galaxy public server at http://usegalaxy.org

One solution, if you want fasta sequence based on the reference genome (could be a native Galaxy genome, a custom genome in your history, or really any fasta file in your history as long as the mapped "chromosomes" names are identical), is to use the tool "NGS: SAM Tools -> Pileup-to-Interval". Then, to extract fasta sequence based on these coordinates, use the tool "Fetch Sequences -> Extract Genomic DNA".

If you are interested in examining the variation in your data vs the reference, please see the tools under "NGS: Indel Analysis". Combined with the tool "Genome Diversity -> Extract DNA flanking chosen SNPs" this can incorporate your SNPs into the background reference to produce novel fasta sequences.

If still needed, moving from FASTQ to FASTA in Galaxy is very simple using the tool "NGS: QC and manipulation -> FASTQ to FASTA converter ".

If your datasets are very large or strict security is necessary (as when working with certain human data), creating a local install or utilizing Galaxy's Amazon Cloudman version is recommended. Basic installation is easy and adding in 3rd party tools/datasets is designed to be as painless as possible :) http://getgalaxy.org (see the wiki for step-by-step instructions for both)

jen

0
Entering edit mode
10.4 years ago
Parit • 0

for converting fastq to fasta check out fastx toolkit

1
Entering edit mode

Fastx does not work with multiline fastq. It does not accept lowercase bases, either. No, it would not work. Actually they should use my fasta/q parser.

0
Entering edit mode

This is true in my experience as well.

0
Entering edit mode
8.6 years ago

If using data from 1000 genomes, you basically have a BAM file and an FA reference fasta file.

If you want to get a consensus chromosome FASTA out of this data, my understanding was to use the following command:

samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq


And then figure out how to convert fq to fasta.

But, what if you just want to use chromosome 2 (instead of all of them) for example? Is there a way to process only that?

0
Entering edit mode

Hello, you can use the seqret tools from emboss: seqret -sformat fastq-sanger -sequence yoursequence.fq -osformat fasta -outseq yoursequence.fasta

0
Entering edit mode

Using seqtk is simpler and faster:

seqtk subseq in.fq list.txt > out.fq

0
Entering edit mode
7.4 years ago
mobiusklein ▴ 160

To parse samtools's consesus fastq, which breaks the four line convention, you can use the symmetry between the nucleotide sequence lines and the quality lines. There must be the same number of lines in each section, and we know when the quality lines start appearing (after the + line). So by using a bit of statefulness in the parsing routine, we can accumulate a number of lines in the sequence region, and then use that information to know how many of the following lines to take and treat as quality scores. Here's the rough outline in Python:

def parse_file(self):
defline = ''
sequence = ''
qual = ''
count = 0
state = 'start'
for line in open(self.file_path, 'r'):
match = re.search(r'^@(?P<defline>[^@]+)\n', line)
if state == 'qual' and count > 0:
qual += line
count-=1
elif match:
if defline != '':
self.process_record(defline, sequence, qual)
defline = match.groupdict()['defline']
sequence = ''
qual = ''
state = 'sequence'
elif re.search(r'\+\n', line):
state = 'qual'
else:
if state == 'sequence':
sequence += line
count += 1

self.process_record(defline, sequence, qual)
self.parsed = True
`