Reconstruct SNP haplotypes from Illumina short reads
2
0
Entering edit mode
6 weeks ago
rio.simon56 ▴ 10

I have whole genome sequencing data (Illumina paired-end 150bp) for a diverse set of highly polyploid accessions with various levels of ploidy (2n=6-16x). I would like to know if there is a method to get sequencing depths for combination of alleles that are shared on the same read.

For example, for three SNPs with two alleles A and G within 100bp, a given individual could have have the following SNP genotyping :

• SNP1: depth_A=40, depth_G=60
• SNP2: depth_A=10, depth_G=90
• SNP3: depth_A=50, depth_G=50

I would like to obtain the following haplotype genotyping:

• Haplotype: depth_AAA=5, depth_AAG=2,...,depth_GGG=40

I hope this was clear, thanks for your help.

SNP haplotype • 487 views
1
Entering edit mode
6 weeks ago
colindaven ★ 3.3k

Not easy ! I think you'll have to write your own script, for example in pysam, after first filtering BAMs for a region of interest.

You can have a look at these tools for read-backed phasing, which go somewhat in your direction - but AFAIK are not appropriate for your non 2n ploidy samples.

https://github.com/secastel/phaser

https://github.com/whatshap/whatshap

1
Entering edit mode
6 weeks ago

I quickly wrote http://lindenb.github.io/jvarkit/BamToHaplotypes.html

the input is an indexed VCF (only diallelic SNP are considered) and a BAM file.

example:

\$ java -jar dist/bam2haplotypes.jar -V src/test/resources/rotavirus_rf.vcf.gz src/test/resources/S5.bam

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF03    1221    1242    11  2   1221    C   1242    C
RF03    1688    1708    5   2   1688    G   1708    T
RF04    1900    1920    4   2   1900    C   1920    A
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C


in paired mode

samtools collate -O -u src/test/resources/S5.bam TMP | java -jar dist/bam2haplotypes.jar --paired -V src/test/resources/rotavirus_rf.vcf.gz

#CHROM  START   END COUNT   N-VARIANTS  (POS\tALT)+
RF02    251 578 1   2   251 A   578 G
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1221    1242    7   2   1221    C   1242    C
RF03    1221    1688    1   2   1221    C   1688    G
RF03    1688    1708    1   2   1688    G   1708    T
RF03    1708    2150    1   2   1708    T   2150    T
RF03    1221    1708    1   3   1221    C   1688    G   1708    T
RF03    1221    1688    2   3   1221    C   1242    C   1688    G
RF03    1688    2150    1   3   1688    G   1708    T   2150    T
RF03    1221    1708    2   4   1221    C   1242    C   1688    G   1708    T
RF04    887 1241    1   2   887 A   1241    T
RF04    1900    1920    4   2   1900    C   1920    A
RF05    41  499 2   2   41  T   499 A
RF05    499 879 1   2   499 A   879 C
RF05    795 1297    2   2   795 A   1297    T
RF05    879 1297    2   2   879 C   1297    T
RF06    517 543 9   2   517 C   543 G
RF06    668 695 4   2   668 G   695 T
RF07    225 684 1   2   225 C   684 G
RF07    225 684 1   2   225 C   684 T
RF08    926 992 2   2   926 C   992 G
RF09    294 317 6   2   294 T   317 A
RF10    139 175 1   2   139 T   175 G
RF10    139 175 3   2   139 T   175 C


I didn't write any test, so tell me if you think something is wrong.

0
Entering edit mode

Thanks a million for this! I think I would have needed several months to code something comparable.

I have tried to run it on a test VCF dataset of my own with one individual and two SNPs:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Test_Ind
Sh05    1738002 .   T   C   .   .   AC=5;AN=12  GT:AD:DP:GC 0/0/0/0/0/0/0/1/1/1/1/1:122,89:211:168.534
Sh05    1738019 .   G   A   .   .   AC=3;AN=12  GT:AD:DP:GC 0/0/0/0/0/0/0/0/0/1/1/1:159,61:220:8.92207


With the proposed method, I obtain:

#CHROM  START   END     COUNT   N-VARIANTS      (POS\tALT)+
Sh05    1738002 1738019 78      2       1738002 C       1738019 G
Sh05    1738002 1738019 1       2       1738002 T       1738019 G
Sh05    1738002 1738019 1       2       1738002 T       1738019 A
Sh05    1738002 1738019 6       2       1738002 T       1738019 G
...
[several dozen lines]
...
Sh05    1738002 1738019 2       2       1738002 T       1738019 G
Sh05    1738002 1738019 2       2       1738002 T       1738019 A


There seems to be an issue in that it aggregates results only for the first category "CG". I don't know to which extent this is complicated to solve?

Otherwise the results are very comprehensive, thanks again.

1
Entering edit mode

After some exchanges with Pierre, the program has been updated and the problem is now solved:

#CHROM  START   END     COUNT   N-VARIANTS      (POS\tALT)+
Sh05    1738002 1738019 78      2       1738002 C       1738019 G
Sh05    1738002 1738019 53      2       1738002 T       1738019 A
Sh05    1738002 1738019 59      2       1738002 T       1738019 G


Thanks a lot to him!