Question: Name of the consensus fasta sequence from a Minimap + Samtools alignment
0
gravatar for lordbleys
10 weeks ago by
lordbleys0
lordbleys0 wrote:

Hello,

I am assembling a virus genome from IonTorrent reads based on a reference fasta sequence, using a combination of Minimap2, Samtools, bcftools and such.

The problem is: my final consensus fasta sequence is always named after the reference sequence.

I want to have the name of the fasta sequence to be the one of the original sequence file, not the reference.

Here is the script I use:

# align reads to reference sequence with minimap2
minimap2 -ax sr reference.fasta SAMPLE1_trimmed.fastq > SAMPLE1_aln.sam

# Convert sam to bam
samtools view -S -b SAMPLE1_aln.sam > SAMPLE1_aln.bam

# Sort the alignment
samtools sort SAMPLE1_aln.bam -o SAMPLE1_aln_sorted.bam

# Get consensus fastq file
samtools mpileup -uf reference.fasta SAMPLE1_aln_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE1_consensus.fastq

# Convert .fastq to .fasta
seqtk seq -aQ64 -q13 -n N SAMPLE1_consensus.fastq > SAMPLE1.fasta

In this example, the fasta sequence in the SAMPLE1.fasta file should be named ">SAMPLE1", but it is always named ">reference".

I checked quite a few forums and the manuals of each software, but I can't figure out what to do. Indeed most of the questions relating to this problem deal with multiple sequences inside an alignment (chromosomes, multiple samples, etc). In my case I just have 1 sample, 1 sequence (the final consensus from the alignment) so I don't know where the name is fetched. And since I am running this job in parallel for multiple samples (each producing a unique final sequence), I want to find a solution more elegant than simply renaming the sequences 1 by 1...

Help me Obi-Wan Kenobi, you are my only hope.

Seb

ADD COMMENTlink modified 10 weeks ago by Istvan Albert ♦♦ 86k • written 10 weeks ago by lordbleys0
0
gravatar for Istvan Albert
10 weeks ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Renaming fasta ids is probably straightforward to automate:

cat foo.fa
>foo
AAAAA

then:

cat foo.fa | seqret --filter -sid bar
>bar
AAAAA

or:

cat foo.fa | seqkit replace -p foo -r bar
>bar
AAAAA

and probably many other solutions exist

ADD COMMENTlink written 10 weeks ago by Istvan Albert ♦♦ 86k

Thanks! I came to the same conclusion and made a little awk script to use the name of the final output file as the name of the fasta sequence. However I was hoping there would be a direct option in samtools or minimap to adress the issue.

ADD REPLYlink written 10 weeks ago by lordbleys0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1700 users visited in the last hour
_