I am in the process of finishing a small script which I will use to determine the genotype of Hepatitis C FASTQ samples.
Hepatitis C has 7 known genotypes. I would like to be able to feed FASTQ files to the script and they are aligned to all the 7 genotypes, and then obtain the most likely genotype depending on the one with the highest score.
In order to align to multiple references at once, I decide to use one reference genome FASTA file containing all the 7 HepC reference genotypes together. [Does this make sense?]
The intermediate files seem to be working fine. I get different VCF files for different FASTQ files I input and I seem to be getting slightly different consensus sequences.
In order to obtain the alignment scores, I take the consensus sequence obtained and compare it with the original composite reference FASTA using BLAST [make blastdb]
I have written the script and everything seems to run OK. However, when I tested it with multiple known samples (of varying genotypes) I keep getting results as Genotype 1 (clearly wrong as samples verified with other systems show it to be Genotype 3).
Is there a better way to compare with multiple references at once in BASH? Or is there something I can fix/check in my script to make it work?
Attached is the code I used in BASH:
echo "Welcome to the test Hepatitis C genotyping pipeline"
#Notes down start time
start=$SECONDS
#Enter samples numbers here
#SAMPLE_LIST="HCV176 HCV242"
#Allows user to enter sample numbers as arguments after enter script name
for SAMPLE_ID in $@
do
#Index the aligner with the reference genomes [all of them in one composite file]
bwa index -a is HCV.fasta
# Align reads 1 of samples to reference genome
bwa aln HCV.fasta $SAMPLE_ID.1.fastq >$SAMPLE_ID.1.sai
# Align reads 2 of samples to reference genome
bwa aln HCV.fasta $SAMPLE_ID.2.fastq >$SAMPLE_ID.2.sai
# Merge and change into SAM format for both samples
bwa sampe HCV.fasta $SAMPLE_ID.1.sai $SAMPLE_ID.2.sai $SAMPLE_ID.1.fastq $SAMPLE_ID.2.fastq > $SAMPLE_ID.sam
# Change the SAM to BAM format for both samples
samtools view -bt HCV.fasta $SAMPLE_ID.sam>$SAMPLE_ID.bam
# Sort the BAM files
samtools sort $SAMPLE_ID.bam -o $SAMPLE_ID.sorted.bam
# Create the index file for the BAM files
samtools index -b $SAMPLE_ID.sorted.bam
#Create mpileup file and VCF file containing SNPs
samtools mpileup -IEuDSf HCV.fasta $SAMPLE_ID.sorted.bam | bcftools view -v snps - > $SAMPLE_ID.vcf
#Compress the VCF with tabix and its index
bgzip -c $SAMPLE_ID.vcf > $SAMPLE_ID.vcf.gz
tabix -p vcf $SAMPLE_ID.vcf.gz
# Create a consensus sequence in FASTA format from newly-created VCF and the reference sequence
cat HCV.fasta | bcftools consensus $SAMPLE_ID.vcf.gz > $SAMPLE_ID-consensus.fa
done
# Run BLAST and compare consensus sequence to the different HCV genotypes
makeblastdb -in HCV.fasta -dbtype nucl
blastn -query $SAMPLE_ID-consensus.fa -db HCV.fasta > $SAMPLE_ID-output.txt
#Produce running time of script
end=$SECONDS
duration=$(( end - start ))
echo "The process took $end seconds to complete"
Can you clarify if you expect to have multiple genotypes in the same sample or will each sample have a unique genotype? If latter, you should align the samples to the references independently.
If it is the former then your task would be significantly difficult unless you have UMI's incorporated into your reads. Once you offer a clarification I may have more to add.
Most samples tend to have one major genotype but it is not uncommon to have 2 genotypes.
In the case of a unique genotype, how do I align independently? I was trying to implement running multiple scripts (one for each reference) placed in a master script but I was getting a bunch of errors. I tried something like
In the case of multiple genotypes, what would the ideal procedure be, if UMI's are present
You will need to be mindful of how many subprocesses can be spawned by the scripts, but the way to do this is best via
GNU parallel
I heard about GNU parallel but haven't really tested it out. Thanks for the suggestion
Sometimes doing something simple is the way to go, If you expect one or two genotypes to be prevalent then you should still align the data independently. Aligning to multiple almost identical genomes is going to cause issues with multi-mapping (
bwa
is going to have a default number of alignments it is going to report and then stop).If you have UMI's then you could use an aligner that allows an unlimited number of mappings (to not lose any hits). Extract the reads that map over the gene you are interested in. Sort them via UMI's and then realign to individual genes (from 7 types) to confirm genotypes. This is just off top of my head will need to think about this some more.
There is no targeted capture/selection involved, is that correct?
I see. Sounds complicated for me to implement.
No there is no target capture involved.
Thanks for your help
What determines the genotype (forgive me, I'm not a virologist)?
If its just a particular gene, I wouldn't align against the whole genome unless you need to. Seems a bit computationally intense unecessarily.
If it's not dictated by particular SNPs or alleles (anything that can change based on a single base pair), maybe you're looking for something like
mash
distances to see which ones your reads are most like (though I think this requires that you assemble the genomes so maybe isn't totally the same as what you're doing?)No worries. I am fairly green myself. The NS5B gene is most often used for genotyping HepC.
Different SNPs in the gene will affect the particular genotype [as far as I know]