Question: How To Generate A Consensus Fasta Sequence From Sam Tools Pileup?
9
gravatar for Casey Bergman
8.1 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

SAM tools pileup produces a consensus sequence in vertical format, which is not terribly useful for traditional sequence analysis. Does anyone know of a command line utility to convert SAM tools pileup format to a simple fasta sequence?

I am aware that SAM tools provides tool to convert pileup to fastq samtools.pl pileup2fq), but fastq to fasta conversion presents its own challenges:

http://stackoverflow.com/questions/1542306/converting-fastq-to-fasta-with-sed-awk

http://ukpmc.ac.uk/articlerender.cgi?accid=PMC2847217

Of course I could write a simple script for this task, but I'm hoping someone has already solved this problem for the community. Apologies in advance if this functionality is available in SAM tools itself.

Many thanks, Casey

ADD COMMENTlink modified 4.5 years ago by mobiusklein150 • written 8.1 years ago by Casey Bergman17k
6
gravatar for lh3
7.4 years ago by
lh331k
United States
lh331k wrote:

For fasta->fastq conversion, you may consider seqtk.

Get:

git clone <https://github.com/lh3/misc.git>
cd misc/seq/seqtk/

Compile:

gcc -O2 seqtk.c -o seqtk -lz

And

seqtk fq2fa in.fastq 20 > out.fasta

converts in.fastq and changes bases with quality lower than 20 to lowercases.

ADD COMMENTlink modified 7.3 years ago by Casey Bergman17k • written 7.4 years ago by lh331k

you should check for the memory leaks, if the stream was open, etc...

ADD REPLYlink written 7.4 years ago by Pierre Lindenbaum110k

There is no memory leak. As to file opening, if the file is not readable, an immediate segfault; no false results.

ADD REPLYlink written 7.4 years ago by lh331k

Heng, thanks for this. Confirmed that it works on multi-line fastq as advertised. Would it be possible to embed this as an output option in samtools?

ADD REPLYlink written 7.3 years ago by Casey Bergman17k

What if I didn't want the bases with quality lower than to 20 to be converted to lower cases?

I used the command seqtk fq2fa Mini_consensus_genome.fq > Mini_consensus_out.fa

However it still outputted many bases in lower case!

Can you kindly tell me how to avoid this!

Thank you

ADD REPLYlink written 4.8 years ago by modi202020
3
gravatar for Pierre Lindenbaum
8.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum110k wrote:

from : http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#Does_samtools_generate_the_consensus_sequence_like_Maq.3F

Does samtools generate the consensus sequence like Maq?

Yes. Try this:

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

ADD COMMENTlink written 8.1 years ago by Pierre Lindenbaum110k
1

Unfortunately, the flavor of fastq output by samtools.pl does not follow the fastq 4-line convention (both the sequence and qualities are split over many lines), so many of the Stack Overflow solutions are not useful here ;^(

ADD REPLYlink written 8.1 years ago by Casey Bergman17k

Hi Pierre -

Does this command generate fasta or fastq?

Best, Casey

ADD REPLYlink written 8.1 years ago by Casey Bergman17k

I guess it is fastq, but you found a fastq2fasta on SO , don't you ? :-) ?

ADD REPLYlink written 8.1 years ago by Pierre Lindenbaum110k

The seqret application in the emboss package seems to be working for this type of fastq to fasta conversion. But hacking into the samtools.pl script looks like a better solution.

ADD REPLYlink written 7.1 years ago by Vitis1.6k

Thanks. This command was very helpful. I'd been looking for a way to do this. @Casey: It generates FASTQ. There's an awk command I found to extract the FASTA from it:

awk '/^@chr7$/,/^+$/' consensus.fq | perl -pe "s/@/>/ ; s/\+//" > chr7.fasta

Joe

ADD REPLYlink written 5.6 years ago by joseph.a.white10
3
gravatar for Rm
7.8 years ago by
Rm7.7k
Danville, PA
Rm7.7k wrote:

you can use custom scripts to convert smatool pleup2fq generated fastq to fasta format.

Multi line Consensus.fastq file looks like this

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA  
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ

for each chromosome (here chr1) use Awk and perl for Converting multi line fastq to fasta.

awk '/^@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\\+//" > chr1.fasta
ADD COMMENTlink modified 7.3 years ago by Casey Bergman17k • written 7.8 years ago by Rm7.7k

RM, this solution works for GNU AWK 3.1.5 on linux, but not the BSD version of AWK on OSX?!?

ADD REPLYlink written 7.3 years ago by Casey Bergman17k

@casey: can you post the error/message from the BSD version.

ADD REPLYlink written 7.3 years ago by Rm7.7k
1
gravatar for Swbarnes2
7.3 years ago by
Swbarnes21.4k
Swbarnes21.4k wrote:

The samtools.pl solution won't even try to deal with indels. Here's what it does with them:

sub p2q_filter_gaps {
  my ($seq, $gaps, $l) = @_;
  for my $g (@$gaps) {
    my $x = $g > $l? $g - $l : 0;
    substr($$seq, $x, $l + $l) = lc(substr($$seq, $x, $l + $l));
  }
}

That last line means "make that substring of the sequence lowercase". It does that around each puitative gap.

If you don't want the output in fastq format, just go to the samtools.pl file in the misc folder, and change the script.

sub p2q_post_process {
  my ($chr, $seq, $qual, $gaps, $l) = @_;
  &p2q_filter_gaps($seq, $gaps, $l);
  print "\@$chr\n"; &p2q_print_str($seq);
  print "+\n"; &p2q_print_str($qual);
}

That last line is where the quality is printed out, just comment that line out by putting a # in front of it, and it won't print those last two lines. You'll also want to change the @ in the preceeding line to a >.

ADD COMMENTlink written 7.3 years ago by Swbarnes21.4k
1
gravatar for Jennifer Hillman Jackson
7.1 years ago by
Bay Area, CA
Jennifer Hillman Jackson320 wrote:

If a web-based approach would be helpful, please see the Galaxy public server at http://usegalaxy.org

One solution, if you want fasta sequence based on the reference genome (could be a native Galaxy genome, a custom genome in your history, or really any fasta file in your history as long as the mapped "chromosomes" names are identical), is to use the tool "NGS: SAM Tools -> Pileup-to-Interval". Then, to extract fasta sequence based on these coordinates, use the tool "Fetch Sequences -> Extract Genomic DNA".

If you are interested in examining the variation in your data vs the reference, please see the tools under "NGS: Indel Analysis". Combined with the tool "Genome Diversity -> Extract DNA flanking chosen SNPs" this can incorporate your SNPs into the background reference to produce novel fasta sequences.

If still needed, moving from FASTQ to FASTA in Galaxy is very simple using the tool "NGS: QC and manipulation -> FASTQ to FASTA converter ".

If your datasets are very large or strict security is necessary (as when working with certain human data), creating a local install or utilizing Galaxy's Amazon Cloudman version is recommended. Basic installation is easy and adding in 3rd party tools/datasets is designed to be as painless as possible :) http://getgalaxy.org (see the wiki for step-by-step instructions for both)

jen

ADD COMMENTlink written 7.1 years ago by Jennifer Hillman Jackson320
0
gravatar for Parit
7.5 years ago by
Parit0
Parit0 wrote:

for converting fastq to fasta check out fastx toolkit

ADD COMMENTlink written 7.5 years ago by Parit0
1

Fastx does not work with multiline fastq. It does not accept lowercase bases, either. No, it would not work. Actually they should use my fasta/q parser.

ADD REPLYlink written 7.4 years ago by lh331k

This is true in my experience as well.

ADD REPLYlink written 7.3 years ago by Casey Bergman17k
0
gravatar for Gabeloooooo
5.7 years ago by
Gabeloooooo0 wrote:

If using data from 1000 genomes, you basically have a BAM file and an FA reference fasta file.

If you want to get a consensus chromosome FASTA out of this data, my understanding was to use the following command: samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

And then figure out how to convert fq to fasta.

But, what if you just want to use chromosome 2 (instead of all of them) for example? Is there a way to process only that?

ADD COMMENTlink written 5.7 years ago by Gabeloooooo0

Hello, you can use the seqret tools from emboss: seqret -sformat fastq-sanger -sequence yoursequence.fq -osformat fasta -outseq yoursequence.fasta

ADD REPLYlink written 5.3 years ago by overlaine0

Using seqtk is simpler and faster: seqtk subseq in.fq list.txt > out.fq

ADD REPLYlink written 5.3 years ago by lh331k
0
gravatar for mobiusklein
4.5 years ago by
mobiusklein150
United States
mobiusklein150 wrote:

To parse samtools's consesus fastq, which breaks the four line convention, you can use the symmetry between the nucleotide sequence lines and the quality lines. There must be the same number of lines in each section, and we know when the quality lines start appearing (after the + line). So by using a bit of statefulness in the parsing routine, we can accumulate a number of lines in the sequence region, and then use that information to know how many of the following lines to take and treat as quality scores. Here's the rough outline in Python:

    def parse_file(self):
    defline = ''
    sequence = ''
    qual = ''
    count = 0
    state = 'start'
    for line in open(self.file_path, 'r'):
        match = re.search(r'^@(?P<defline>[^@]+)\n', line)
        if state == 'qual' and count > 0:
            qual += line
            count-=1
        elif match:
            if defline != '':
                self.process_record(defline, sequence, qual)
            defline = match.groupdict()['defline']
            sequence = ''
            qual = ''
            state = 'sequence'
        elif re.search(r'\+\n', line):
            state = 'qual'
        else:
            if state == 'sequence':
                sequence += line
                count += 1

    self.process_record(defline, sequence, qual)
    self.parsed = True
ADD COMMENTlink written 4.5 years ago by mobiusklein150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 470 users visited in the last hour