Question

Name of the consensus fasta sequence from a Minimap + Samtools alignment

1

Entering edit mode

3.4 years ago

lordbleys ▴ 10

Hello,

I am assembling a virus genome from IonTorrent reads based on a reference fasta sequence, using a combination of Minimap2, Samtools, bcftools and such.

The problem is: my final consensus fasta sequence is always named after the reference sequence.

I want to have the name of the fasta sequence to be the one of the original sequence file, not the reference.

Here is the script I use:

# align reads to reference sequence with minimap2
minimap2 -ax sr reference.fasta SAMPLE1_trimmed.fastq > SAMPLE1_aln.sam

# Convert sam to bam
samtools view -S -b SAMPLE1_aln.sam > SAMPLE1_aln.bam

# Sort the alignment
samtools sort SAMPLE1_aln.bam -o SAMPLE1_aln_sorted.bam

# Get consensus fastq file
samtools mpileup -uf reference.fasta SAMPLE1_aln_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE1_consensus.fastq

# Convert .fastq to .fasta
seqtk seq -aQ64 -q13 -n N SAMPLE1_consensus.fastq > SAMPLE1.fasta

In this example, the fasta sequence in the SAMPLE1.fasta file should be named ">SAMPLE1", but it is always named ">reference".

I checked quite a few forums and the manuals of each software, but I can't figure out what to do. Indeed most of the questions relating to this problem deal with multiple sequences inside an alignment (chromosomes, multiple samples, etc). In my case I just have 1 sample, 1 sequence (the final consensus from the alignment) so I don't know where the name is fetched. And since I am running this job in parallel for multiple samples (each producing a unique final sequence), I want to find a solution more elegant than simply renaming the sequences 1 by 1...

Help me Obi-Wan Kenobi, you are my only hope.

Seb

alignment Assembly sequencing samtools fasta • 2.0k views

ADD COMMENT • link updated 3.4 years ago by Istvan Albert 100k • written 3.4 years ago by lordbleys ▴ 10

score 0 · Answer 1 · 2020-11-12

0

Entering edit mode

3.4 years ago

Istvan Albert 100k

Renaming fasta ids is probably straightforward to automate:

cat foo.fa
>foo
AAAAA

then:

cat foo.fa | seqret --filter -sid bar
>bar
AAAAA

or:

cat foo.fa | seqkit replace -p foo -r bar
>bar
AAAAA

and probably many other solutions exist

ADD COMMENT • link 3.4 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks! I came to the same conclusion and made a little awk script to use the name of the final output file as the name of the fasta sequence. However I was hoping there would be a direct option in samtools or minimap to adress the issue.

ADD REPLY • link 3.4 years ago by lordbleys ▴ 10