Question

Reconstruct SNP haplotypes from Illumina short reads

0

Entering edit mode

2.5 years ago

rio.simon56 ▴ 10

I have whole genome sequencing data (Illumina paired-end 150bp) for a diverse set of highly polyploid accessions with various levels of ploidy (2n=6-16x). I would like to know if there is a method to get sequencing depths for combination of alleles that are shared on the same read.

For example, for three SNPs with two alleles A and G within 100bp, a given individual could have have the following SNP genotyping :

SNP1: depth_A=40, depth_G=60
SNP2: depth_A=10, depth_G=90
SNP3: depth_A=50, depth_G=50

I would like to obtain the following haplotype genotyping:

Haplotype: depth_AAA=5, depth_AAG=2,...,depth_GGG=40

I hope this was clear, thanks for your help.

SNP haplotype • 1.5k views

ADD COMMENT • link 2.5 years ago by rio.simon56 ▴ 10

score 1 · Answer 1 · 2021-10-15

Not easy ! I think you'll have to write your own script, for example in pysam, after first filtering BAMs for a region of interest.

You can have a look at these tools for read-backed phasing, which go somewhat in your direction - but AFAIK are not appropriate for your non 2n ploidy samples.

https://github.com/secastel/phaser

https://github.com/whatshap/whatshap

score 1 · Answer 2 · 2021-10-15

I quickly wrote http://lindenb.github.io/jvarkit/BamToHaplotypes.html

the input is an indexed VCF (only diallelic SNP are considered) and a BAM file.

example:

$ java -jar dist/bam2haplotypes.jar -V src/test/resources/rotavirus_rf.vcf.gz src/test/resources/S5.bam

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF03    1221    1242    11  2   1221    C   1242    C
RF03    1688    1708    5   2   1688    G   1708    T
RF04    1900    1920    4   2   1900    C   1920    A
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C

in paired mode

samtools collate -O -u src/test/resources/S5.bam TMP | java -jar dist/bam2haplotypes.jar --paired -V src/test/resources/rotavirus_rf.vcf.gz

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF02    251 578 1   2   251 A   578 G
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1221    1242    7   2   1221    C   1242    C
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1688    1708    1   2   1688    G   1708    T
RF03    1708    2150    1   2   1708    T   2150    T
RF03    1221    1708    1   3   1221    C   1688    G   1708    T
RF03    1221    1688    2   3   1221    C   1242    C   1688    G
RF03    1688    2150    1   3   1688    G   1708    T   2150    T
RF03    1221    1708    2   4   1221    C   1242    C   1688    G   1708    T
RF04    887 1241    1   2   887 A   1241    T
RF04    1900    1920    4   2   1900    C   1920    A
RF05    41  499 2   2   41  T   499 A
RF05    499 879 1   2   499 A   879 C
RF05    795 1297    2   2   795 A   1297    T
RF05    879 1297    2   2   879 C   1297    T
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF07    225 684 1   2   225 C   684 G
RF07    225 684 1   2   225 C   684 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C

I didn't write any test, so tell me if you think something is wrong.