How To Generate A Consensus Fasta Sequence From Sam Tools Pileup?
8
9
Entering edit mode
13.9 years ago

SAM tools pileup produces a consensus sequence in vertical format, which is not terribly useful for traditional sequence analysis. Does anyone know of a command line utility to convert SAM tools pileup format to a simple fasta sequence?

I am aware that SAM tools provides tool to convert pileup to fastq samtools.pl pileup2fq), but fastq to fasta conversion presents its own challenges:

http://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk

http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217

Of course I could write a simple script for this task, but I'm hoping someone has already solved this problem for the community. Apologies in advance if this functionality is available in SAM tools itself.

Many thanks, Casey

samtools next-gen sequencing fasta • 44k views
ADD COMMENT
6
Entering edit mode
13.2 years ago
lh3 33k

For fasta->fastq conversion, you may consider seqtk.

Get:

git clone https://github.com/lh3/misc.git
cd misc/seq/seqtk/

Compile:

gcc -O2 seqtk.c -o seqtk -lz

And

seqtk fq2fa in.fastq 20 > out.fasta

converts in.fastq and changes bases with quality lower than 20 to lowercases.

ADD COMMENT
0
Entering edit mode

you should check for the memory leaks, if the stream was open, etc...

ADD REPLY
0
Entering edit mode

There is no memory leak. As to file opening, if the file is not readable, an immediate segfault; no false results.

ADD REPLY
0
Entering edit mode

Heng, thanks for this. Confirmed that it works on multi-line fastq as advertised. Would it be possible to embed this as an output option in samtools?

ADD REPLY
0
Entering edit mode

What if I didn't want the bases with quality lower than to 20 to be converted to lower cases?

I used the command:

seqtk fq2fa  Mini_consensus_genome.fq > Mini_consensus_out.fa

However it still outputted many bases in lower case!

Can you kindly tell me how to avoid this!

Thank you

ADD REPLY
3
Entering edit mode
13.9 years ago

From here:

Does samtools generate the consensus sequence like Maq?

Yes. Try this:

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
  

Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

ADD COMMENT
1
Entering edit mode

Unfortunately, the flavor of fastq output by samtools.pl does not follow the fastq 4-line convention (both the sequence and qualities are split over many lines), so many of the Stack Overflow solutions are not useful here ;^(

ADD REPLY
0
Entering edit mode

Hi Pierre -

Does this command generate fasta or fastq?

Best, Casey

ADD REPLY
0
Entering edit mode

I guess it is fastq, but you found a fastq2fasta on SO , don't you ? :-) ?

ADD REPLY
0
Entering edit mode

The seqret application in the emboss package seems to be working for this type of fastq to fasta conversion. But hacking into the samtools.pl script looks like a better solution.

ADD REPLY
0
Entering edit mode

Thanks. This command was very helpful. I'd been looking for a way to do this. @Casey: It generates FASTQ. There's an awk command I found to extract the FASTA from it:

awk '/^@chr7$/,/^+$/' consensus.fq | perl -pe "s/@/>/ ; s/\+//" > chr7.fasta

Joe

ADD REPLY
3
Entering edit mode
13.6 years ago
Rm 8.3k

You can use custom scripts to convert smatool pleup2fq generated fastq to fasta format.

Multi line Consensus.fastq file looks like this

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA  
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ

For each chromosome (here chr1) use Awk and perl for Converting multi line fastq to fasta.

awk '/^@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\\+//" > chr1.fasta
ADD COMMENT
0
Entering edit mode

RM, this solution works for GNU AWK 3.1.5 on linux, but not the BSD version of AWK on OSX?!?

ADD REPLY
0
Entering edit mode

@casey: can you post the error/message from the BSD version.

ADD REPLY
1
Entering edit mode
13.0 years ago
Swbarnes2 ★ 1.6k

The samtools.pl solution won't even try to deal with indels. Here's what it does with them:

sub p2q_filter_gaps {
  my ($seq, $gaps, $l) = @_;
  for my $g (@$gaps) {
    my $x = $g > $l? $g - $l : 0;
    substr($$seq, $x, $l + $l) = lc(substr($$seq, $x, $l + $l));
  }
}

That last line means "make that substring of the sequence lowercase". It does that around each puitative gap.

If you don't want the output in fastq format, just go to the samtools.pl file in the misc folder, and change the script.

sub p2q_post_process {
  my ($chr, $seq, $qual, $gaps, $l) = @_;
  &p2q_filter_gaps($seq, $gaps, $l);
  print "\@$chr\n"; &p2q_print_str($seq);
  print "+\n"; &p2q_print_str($qual);
}

That last line is where the quality is printed out, just comment that line out by putting a # in front of it, and it won't print those last two lines. You'll also want to change the @ in the preceeding line to a >.

ADD COMMENT
1
Entering edit mode
12.8 years ago

If a web-based approach would be helpful, please see the Galaxy public server at http://usegalaxy.org

One solution, if you want fasta sequence based on the reference genome (could be a native Galaxy genome, a custom genome in your history, or really any fasta file in your history as long as the mapped "chromosomes" names are identical), is to use the tool "NGS: SAM Tools -> Pileup-to-Interval". Then, to extract fasta sequence based on these coordinates, use the tool "Fetch Sequences -> Extract Genomic DNA".

If you are interested in examining the variation in your data vs the reference, please see the tools under "NGS: Indel Analysis". Combined with the tool "Genome Diversity -> Extract DNA flanking chosen SNPs" this can incorporate your SNPs into the background reference to produce novel fasta sequences.

If still needed, moving from FASTQ to FASTA in Galaxy is very simple using the tool "NGS: QC and manipulation -> FASTQ to FASTA converter ".

If your datasets are very large or strict security is necessary (as when working with certain human data), creating a local install or utilizing Galaxy's Amazon Cloudman version is recommended. Basic installation is easy and adding in 3rd party tools/datasets is designed to be as painless as possible :) http://getgalaxy.org (see the wiki for step-by-step instructions for both)

jen

ADD COMMENT
0
Entering edit mode
13.2 years ago
Parit • 0

for converting fastq to fasta check out fastx toolkit

ADD COMMENT
1
Entering edit mode

Fastx does not work with multiline fastq. It does not accept lowercase bases, either. No, it would not work. Actually they should use my fasta/q parser.

ADD REPLY
0
Entering edit mode

This is true in my experience as well.

ADD REPLY
0
Entering edit mode
11.4 years ago

If using data from 1000 genomes, you basically have a BAM file and an FA reference fasta file.

If you want to get a consensus chromosome FASTA out of this data, my understanding was to use the following command:

samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

And then figure out how to convert fq to fasta.

But, what if you just want to use chromosome 2 (instead of all of them) for example? Is there a way to process only that?

ADD COMMENT
0
Entering edit mode

Hello, you can use the seqret tools from emboss: seqret -sformat fastq-sanger -sequence yoursequence.fq -osformat fasta -outseq yoursequence.fasta

ADD REPLY
0
Entering edit mode

Using seqtk is simpler and faster:

seqtk subseq in.fq list.txt > out.fq
ADD REPLY
0
Entering edit mode
10.2 years ago
mobiusklein ▴ 180

To parse samtools's consesus fastq, which breaks the four line convention, you can use the symmetry between the nucleotide sequence lines and the quality lines. There must be the same number of lines in each section, and we know when the quality lines start appearing (after the + line). So by using a bit of statefulness in the parsing routine, we can accumulate a number of lines in the sequence region, and then use that information to know how many of the following lines to take and treat as quality scores. Here's the rough outline in Python:

def parse_file(self):
    defline = ''
    sequence = ''
    qual = ''
    count = 0
    state = 'start'
    for line in open(self.file_path, 'r'):
        match = re.search(r'^@(?P<defline>[^@]+)\n', line)
        if state == 'qual' and count > 0:
            qual += line
            count-=1
        elif match:
            if defline != '':
                self.process_record(defline, sequence, qual)
            defline = match.groupdict()['defline']
            sequence = ''
            qual = ''
            state = 'sequence'
        elif re.search(r'\+\n', line):
            state = 'qual'
        else:
            if state == 'sequence':
                sequence += line
                count += 1

    self.process_record(defline, sequence, qual)
    self.parsed = True
ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6