Question: HLA typing WGS data
gravatar for emblake
3.6 years ago by
United States
emblake60 wrote:

I have HLA typed whole genome sequencing data using HLA-VBSeq. Reads were first aligned to hg19 with BWA-MEM, then reads aligning to all HLA loci and unmapped reads were extracted from the bam file using samtools. The extracted reads were then combined and re-aligned to the collection of all the genomic HLA allele sequences in the IMGT/HLA database using BWA-MEM, allowing for multiple alignments to the reference sequences.

Now, I would like to subset the reads of a specific HLA-E gene allele, however, using samtools view input.bam HLA:HLA00936, I see that all of the sequences have a MAPQ score of 0. I know that this is specific to BWA and indicates that the reads are mapping to multiple locations. Does this mean that the alignment results for this allele are insignificant and should be disregarded? The researcher interested in this allele would like to know the HLA-E sequences for gene editing purposes, but I feel as though the coverage is far too low. Any input or advice is much appreciated.

bwa samtools hla wgs • 2.0k views
ADD COMMENTlink modified 3.6 years ago by dario.garvan480 • written 3.6 years ago by emblake60
gravatar for dario.garvan
3.6 years ago by
dario.garvan480 wrote:

If you have multiple inferred HLA-E alleles, then it is expected to have most reads mapping to multiple sequences, because HLA-E is one of the least polymorphic HLA genes. Many sequence segments will be identical between the alleles. You can get the nucleotide sequences of the HLA-E alleles as a FASTA file and see how similar they are with the multiple sequence alignment tool MUSCLE (choose HTML output format). You must have some reads mapping uniquely to the allele you are investigating because you inferred that the allele exists from the data.

ADD COMMENTlink written 3.6 years ago by dario.garvan480

Thanks for your input. The estimated HLA type from the software shows my sample is heterozygous E01:03:01:01/E01:01:01:01. I subset the reads matching E*01:03:01:01 by samtools view -h input.bam HLA:HLA00936 > HLA00936.bam and then used MUSCLE to align those sequences against the nuclear coding sequences of HLA-E alleles, as you suggested:

I also aligned the sequences from the bam against E*01:03:01:01 fasta: Do you have any input as to how to retrieve the unique sequences? I'm not exactly sure how the algorithm is making the allele calls, but the developer has stated that the read alignment between similar HLA alleles needs to be optimized.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by emblake60

That's not what I suggested. I suggested that you take the two HLA alleles and see how similar they are to each other with MUSCLE. Using E01:03:01:01 and E01:01:01:01 shows that they are both 1077 nucleotides long and that the only sequence difference between them is one SNP. Therefore, I'd expect almost all genome sequencing reads (assuming you have short reads such as 100 bp) mapping to HLA-E would be classified as multi-mapping because 1076 nucleotides are the same for both varieties. That explains the reads with MAPQ of 0. You can simply count the number of reads overlapping the A SNP (in HLA00934) and the G SNP (in HLA00936) to get your HLA-E allele frequencies, since that single nucleotide difference is specifically what defines them.

ADD REPLYlink written 3.6 years ago by dario.garvan480

I misunderstood - my mistake. That makes sense. Thank you.

ADD REPLYlink written 3.6 years ago by emblake60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1401 users visited in the last hour