I am in the process of finishing a small script which I will use to determine the genotype of Hepatitis C FASTQ samples.
Hepatitis C has 7 known genotypes. I would like to be able to feed FASTQ files to the script and they are aligned to all the 7 genotypes, and then obtain the most likely genotype depending on the one with the highest score.
In order to align to multiple references at once, I decide to use one reference genome FASTA file containing all the 7 HepC reference genotypes together. [Does this make sense?]
The intermediate files seem to be working fine. I get different VCF files for different FASTQ files I input and I seem to be getting slightly different consensus sequences.
In order to obtain the alignment scores, I take the consensus sequence obtained and compare it with the original composite reference FASTA using BLAST [make blastdb]
I have written the script and everything seems to run OK. However, when I tested it with multiple known samples (of varying genotypes) I keep getting results as Genotype 1 (clearly wrong as samples verified with other systems show it to be Genotype 3).
Is there a better way to compare with multiple references at once in BASH? Or is there something I can fix/check in my script to make it work?
Attached is the code I used in BASH:
echo "Welcome to the test Hepatitis C genotyping pipeline" #Notes down start time start=$SECONDS #Enter samples numbers here #SAMPLE_LIST="HCV176 HCV242" #Allows user to enter sample numbers as arguments after enter script name for SAMPLE_ID in $@ do #Index the aligner with the reference genomes [all of them in one composite file] bwa index -a is HCV.fasta # Align reads 1 of samples to reference genome bwa aln HCV.fasta $SAMPLE_ID.1.fastq >$SAMPLE_ID.1.sai # Align reads 2 of samples to reference genome bwa aln HCV.fasta $SAMPLE_ID.2.fastq >$SAMPLE_ID.2.sai # Merge and change into SAM format for both samples bwa sampe HCV.fasta $SAMPLE_ID.1.sai $SAMPLE_ID.2.sai $SAMPLE_ID.1.fastq $SAMPLE_ID.2.fastq > $SAMPLE_ID.sam # Change the SAM to BAM format for both samples samtools view -bt HCV.fasta $SAMPLE_ID.sam>$SAMPLE_ID.bam # Sort the BAM files samtools sort $SAMPLE_ID.bam -o $SAMPLE_ID.sorted.bam # Create the index file for the BAM files samtools index -b $SAMPLE_ID.sorted.bam #Create mpileup file and VCF file containing SNPs samtools mpileup -IEuDSf HCV.fasta $SAMPLE_ID.sorted.bam | bcftools view -v snps - > $SAMPLE_ID.vcf #Compress the VCF with tabix and its index bgzip -c $SAMPLE_ID.vcf > $SAMPLE_ID.vcf.gz tabix -p vcf $SAMPLE_ID.vcf.gz # Create a consensus sequence in FASTA format from newly-created VCF and the reference sequence cat HCV.fasta | bcftools consensus $SAMPLE_ID.vcf.gz > $SAMPLE_ID-consensus.fa done # Run BLAST and compare consensus sequence to the different HCV genotypes makeblastdb -in HCV.fasta -dbtype nucl blastn -query $SAMPLE_ID-consensus.fa -db HCV.fasta > $SAMPLE_ID-output.txt #Produce running time of script end=$SECONDS duration=$(( end - start )) echo "The process took $end seconds to complete"