Question

How do you align to multiple reference genomes?

1

Entering edit mode

5.9 years ago

owenz ▴ 20

I am in the process of finishing a small script which I will use to determine the genotype of Hepatitis C FASTQ samples.

Hepatitis C has 7 known genotypes. I would like to be able to feed FASTQ files to the script and they are aligned to all the 7 genotypes, and then obtain the most likely genotype depending on the one with the highest score.

In order to align to multiple references at once, I decide to use one reference genome FASTA file containing all the 7 HepC reference genotypes together. [Does this make sense?]

The intermediate files seem to be working fine. I get different VCF files for different FASTQ files I input and I seem to be getting slightly different consensus sequences.

In order to obtain the alignment scores, I take the consensus sequence obtained and compare it with the original composite reference FASTA using BLAST [make blastdb]

I have written the script and everything seems to run OK. However, when I tested it with multiple known samples (of varying genotypes) I keep getting results as Genotype 1 (clearly wrong as samples verified with other systems show it to be Genotype 3).

Is there a better way to compare with multiple references at once in BASH? Or is there something I can fix/check in my script to make it work?

Attached is the code I used in BASH:

echo "Welcome to the test Hepatitis C genotyping pipeline"

#Notes down start time
start=$SECONDS

#Enter samples numbers here
#SAMPLE_LIST="HCV176 HCV242"

#Allows user to enter sample numbers as arguments after enter script name
for SAMPLE_ID in $@
do
    #Index the aligner with the reference genomes [all of them in one composite file]
    bwa index -a is HCV.fasta
    # Align reads 1 of samples to reference genome
    bwa aln HCV.fasta $SAMPLE_ID.1.fastq >$SAMPLE_ID.1.sai
    # Align reads 2 of samples to reference genome
    bwa aln HCV.fasta $SAMPLE_ID.2.fastq >$SAMPLE_ID.2.sai
    # Merge and change into SAM format for both samples
    bwa sampe HCV.fasta $SAMPLE_ID.1.sai $SAMPLE_ID.2.sai $SAMPLE_ID.1.fastq $SAMPLE_ID.2.fastq > $SAMPLE_ID.sam
    # Change the SAM to BAM format for both samples
    samtools view -bt HCV.fasta $SAMPLE_ID.sam>$SAMPLE_ID.bam
    # Sort the BAM files
    samtools sort $SAMPLE_ID.bam -o $SAMPLE_ID.sorted.bam
    # Create the index file for the BAM files
    samtools index -b  $SAMPLE_ID.sorted.bam

#Create mpileup file and VCF file containing SNPs
samtools mpileup -IEuDSf HCV.fasta $SAMPLE_ID.sorted.bam | bcftools view -v snps - > $SAMPLE_ID.vcf

#Compress the VCF with tabix and its index
bgzip -c $SAMPLE_ID.vcf > $SAMPLE_ID.vcf.gz
tabix -p vcf $SAMPLE_ID.vcf.gz

# Create a consensus sequence in FASTA format from newly-created VCF and the reference sequence
cat HCV.fasta | bcftools consensus $SAMPLE_ID.vcf.gz > $SAMPLE_ID-consensus.fa
done

# Run BLAST and compare consensus sequence to the different HCV genotypes
makeblastdb -in HCV.fasta -dbtype nucl
blastn -query $SAMPLE_ID-consensus.fa -db HCV.fasta > $SAMPLE_ID-output.txt

#Produce running time of script
end=$SECONDS
duration=$(( end - start ))
echo "The process took $end seconds to complete"

next-gen sequencing hepatitis alignment • 4.1k views

ADD COMMENT • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

Can you clarify if you expect to have multiple genotypes in the same sample or will each sample have a unique genotype? If latter, you should align the samples to the references independently.

If it is the former then your task would be significantly difficult unless you have UMI's incorporated into your reads. Once you offer a clarification I may have more to add.

ADD REPLY • link 5.9 years ago by GenoMax 142k

0

Entering edit mode

Most samples tend to have one major genotype but it is not uncommon to have 2 genotypes.

In the case of a unique genotype, how do I align independently? I was trying to implement running multiple scripts (one for each reference) placed in a master script but I was getting a bunch of errors. I tried something like

do
bash hcvscript1.sh &
bash hcvscript2.sh &
bash hcvscript3.sh
done

In the case of multiple genotypes, what would the ideal procedure be, if UMI's are present

ADD REPLY • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

You will need to be mindful of how many subprocesses can be spawned by the scripts, but the way to do this is best via GNU parallel

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

I heard about GNU parallel but haven't really tested it out. Thanks for the suggestion

ADD REPLY • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

Sometimes doing something simple is the way to go, If you expect one or two genotypes to be prevalent then you should still align the data independently. Aligning to multiple almost identical genomes is going to cause issues with multi-mapping (bwa is going to have a default number of alignments it is going to report and then stop).

If you have UMI's then you could use an aligner that allows an unlimited number of mappings (to not lose any hits). Extract the reads that map over the gene you are interested in. Sort them via UMI's and then realign to individual genes (from 7 types) to confirm genotypes. This is just off top of my head will need to think about this some more.

There is no targeted capture/selection involved, is that correct?

ADD REPLY • link 5.9 years ago by GenoMax 142k

0

Entering edit mode

I see. Sounds complicated for me to implement.

No there is no target capture involved.

Thanks for your help

ADD REPLY • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

What determines the genotype (forgive me, I'm not a virologist)?

If its just a particular gene, I wouldn't align against the whole genome unless you need to. Seems a bit computationally intense unecessarily.

If it's not dictated by particular SNPs or alleles (anything that can change based on a single base pair), maybe you're looking for something like mash distances to see which ones your reads are most like (though I think this requires that you assemble the genomes so maybe isn't totally the same as what you're doing?)

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

No worries. I am fairly green myself. The NS5B gene is most often used for genotyping HepC.

Different SNPs in the gene will affect the particular genotype [as far as I know]

ADD REPLY • link 5.9 years ago by owenz ▴ 20