Question

Best strategy starting from allelic chromosomes instead of reference plus VCF

0

Entering edit mode

3.6 years ago

alfonsosaera ▴ 50

Hi all,

I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.

Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:

>chr1.1
82459472
>chr1.2
86910131
>chr1.3
79881340
>chr1.4
88815615
>chr2.1
76462061
>chr2.2
74215936
>chr2.3
76375162
>chr2.4
76750018
>chr3.1
93149498
>chr3.2
93375939
>chr3.3
96157931
>chr3.4
100414524
>chr4.1
90245664
>chr4.2
93947428
>chr4.3
90228617
>chr4.4
90896203
>chr5.1
81211777
>chr5.2
84165483
>chr5.3
80712490
>chr5.4
78626892
>chr6.1
80303593
>chr6.2
89579199
>chr6.3
84649260
>chr6.4
64534737
>chr7.1
88407277
>chr7.2
93528358
>chr7.3
91580142
>chr7.4
94657719
>chr8.1
87242343
>chr8.2
84274390
>chr8.3
82440740
>chr8.4
81801543

I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:

1.- Generate a graph for each chromosome set and then combine them.

Based on the github page I though about something like:

./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg

But reading the wiki, a modification of this seems better:

vg msga -f MHC.fa \ # use MHC.fa as our input
    -b ref \        # the >ref sequence is our basis
    -B 128 \        # align 128-mer bands of the sequences
    -K 11 \         # use 16-mers as the basis for our indexing
    -X 2 \          # double twice in GCSA2, generating a 44-mer deBruijn graph
    -E 3 \          # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp
    -G -S 0.95 \    # greedily accept alignments when they match at 95% of our maximum alignment score
    -H 5 \          # ignore MEMs which have more than 5 hits
    -D \            # basic debugging output, describing progression through the process
    >MHC.vg         # write the output to MHC.vg

vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg

How do I set parameters like -B, -K, -E ...?

Now I would have to fuse/concatenate the 8 chromosome_graph.vg to get a genome_graph.vg using vg concat

2.- Augment the graph with info from DNA-seq samples

However reading the github page I am not sure how to do it.

3.- Map RNA-seq samples to augmented graph

I am thinking in a modification of this:

# map reads against the graph to get a GAM
vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam

# surject the alignments back into the reference space of sequence "x", yielding a BAM file
vg surject -x x.xg -b aln.gam > aln.bam

# or alternatively, surject them to BAM in the call to map
vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam

4.- Perform allele specific expression

My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:

gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table

Thanks for getting to here, I have 3 general questions:

1.- Do you find this approach suitable? would it be better to focus only in coding regions?

2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?

3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?

4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?

Thank you very much!!!!

Alfonso

vg • 829 views

ADD COMMENT • link 3.6 years ago by alfonsosaera ▴ 50