Hello,
I am assembling a virus genome from IonTorrent reads based on a reference fasta sequence, using a combination of Minimap2, Samtools, bcftools and such.
The problem is: my final consensus fasta sequence is always named after the reference sequence.
I want to have the name of the fasta sequence to be the one of the original sequence file, not the reference.
Here is the script I use:
# align reads to reference sequence with minimap2
minimap2 -ax sr reference.fasta SAMPLE1_trimmed.fastq > SAMPLE1_aln.sam
# Convert sam to bam
samtools view -S -b SAMPLE1_aln.sam > SAMPLE1_aln.bam
# Sort the alignment
samtools sort SAMPLE1_aln.bam -o SAMPLE1_aln_sorted.bam
# Get consensus fastq file
samtools mpileup -uf reference.fasta SAMPLE1_aln_sorted.bam | bcftools call -c | vcfutils.pl vcf2fq > SAMPLE1_consensus.fastq
# Convert .fastq to .fasta
seqtk seq -aQ64 -q13 -n N SAMPLE1_consensus.fastq > SAMPLE1.fasta
In this example, the fasta sequence in the SAMPLE1.fasta file should be named ">SAMPLE1", but it is always named ">reference".
I checked quite a few forums and the manuals of each software, but I can't figure out what to do. Indeed most of the questions relating to this problem deal with multiple sequences inside an alignment (chromosomes, multiple samples, etc). In my case I just have 1 sample, 1 sequence (the final consensus from the alignment) so I don't know where the name is fetched. And since I am running this job in parallel for multiple samples (each producing a unique final sequence), I want to find a solution more elegant than simply renaming the sequences 1 by 1...
Help me Obi-Wan Kenobi, you are my only hope.
Seb
Thanks! I came to the same conclusion and made a little awk script to use the name of the final output file as the name of the fasta sequence. However I was hoping there would be a direct option in samtools or minimap to adress the issue.