Name of the consensus fasta sequence from a Minimap + Samtools alignment
1
1
Entering edit mode
3.4 years ago
lordbleys ▴ 10

Hello,

I am assembling a virus genome from IonTorrent reads based on a reference fasta sequence, using a combination of Minimap2, Samtools, bcftools and such.

The problem is: my final consensus fasta sequence is always named after the reference sequence.

I want to have the name of the fasta sequence to be the one of the original sequence file, not the reference.

Here is the script I use:

# align reads to reference sequence with minimap2
minimap2 -ax sr reference.fasta SAMPLE1_trimmed.fastq > SAMPLE1_aln.sam

# Convert sam to bam
samtools view -S -b SAMPLE1_aln.sam > SAMPLE1_aln.bam

# Sort the alignment
samtools sort SAMPLE1_aln.bam -o SAMPLE1_aln_sorted.bam

# Get consensus fastq file
samtools mpileup -uf reference.fasta SAMPLE1_aln_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE1_consensus.fastq

# Convert .fastq to .fasta
seqtk seq -aQ64 -q13 -n N SAMPLE1_consensus.fastq > SAMPLE1.fasta

In this example, the fasta sequence in the SAMPLE1.fasta file should be named ">SAMPLE1", but it is always named ">reference".

I checked quite a few forums and the manuals of each software, but I can't figure out what to do. Indeed most of the questions relating to this problem deal with multiple sequences inside an alignment (chromosomes, multiple samples, etc). In my case I just have 1 sample, 1 sequence (the final consensus from the alignment) so I don't know where the name is fetched. And since I am running this job in parallel for multiple samples (each producing a unique final sequence), I want to find a solution more elegant than simply renaming the sequences 1 by 1...

Help me Obi-Wan Kenobi, you are my only hope.

Seb

alignment Assembly sequencing samtools fasta • 2.0k views
ADD COMMENT
0
Entering edit mode
3.4 years ago

Renaming fasta ids is probably straightforward to automate:

cat foo.fa
>foo
AAAAA

then:

cat foo.fa | seqret --filter -sid bar
>bar
AAAAA

or:

cat foo.fa | seqkit replace -p foo -r bar
>bar
AAAAA

and probably many other solutions exist

ADD COMMENT
0
Entering edit mode

Thanks! I came to the same conclusion and made a little awk script to use the name of the final output file as the name of the fasta sequence. However I was hoping there would be a direct option in samtools or minimap to adress the issue.

ADD REPLY

Login before adding your answer.

Traffic: 2553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6