I am trying to do genotyping (DNAseq samples) and allele specific expression (RNAseq samples) of a recently assembled plant (alfalfa (Medicago sativa L.)), nature paper here , this is a tetraploid species so everything gets a little more complicated.
Instead of having a reference genome and a VCF, the authors generated an allele-aware chromosome-level genome assembly for the cultivated alfalfa consisting of 32 allelic chromosomes (8 chr x 4 copies), chrom lengths:
>chr1.1 82459472 >chr1.2 86910131 >chr1.3 79881340 >chr1.4 88815615 >chr2.1 76462061 >chr2.2 74215936 >chr2.3 76375162 >chr2.4 76750018 >chr3.1 93149498 >chr3.2 93375939 >chr3.3 96157931 >chr3.4 100414524 >chr4.1 90245664 >chr4.2 93947428 >chr4.3 90228617 >chr4.4 90896203 >chr5.1 81211777 >chr5.2 84165483 >chr5.3 80712490 >chr5.4 78626892 >chr6.1 80303593 >chr6.2 89579199 >chr6.3 84649260 >chr6.4 64534737 >chr7.1 88407277 >chr7.2 93528358 >chr7.3 91580142 >chr7.4 94657719 >chr8.1 87242343 >chr8.2 84274390 >chr8.3 82440740 >chr8.4 81801543
I recently found vg tools and I thing it solves most of the problems I have but I am not sure how to make the analysis. Here is what I have now:
1.- Generate a graph for each chromosome set and then combine them.
Based on the github page I though about something like:
./vg msga -f chr2.fasta -t 50 -k 16 -D -A | ./vg mod -U 10 - -t 50 | ./vg mod -c - -t 50 > chr2.vg
But reading the wiki, a modification of this seems better:
vg msga -f MHC.fa \ # use MHC.fa as our input -b ref \ # the >ref sequence is our basis -B 128 \ # align 128-mer bands of the sequences -K 11 \ # use 16-mers as the basis for our indexing -X 2 \ # double twice in GCSA2, generating a 44-mer deBruijn graph -E 3 \ # remove edges from the graph that would allow us to cross 3 bifurcations in 11bp -G -S 0.95 \ # greedily accept alignments when they match at 95% of our maximum alignment score -H 5 \ # ignore MEMs which have more than 5 hits -D \ # basic debugging output, describing progression through the process >MHC.vg # write the output to MHC.vg vg mod -n MHC.vg | vg mod -X 32 - | vg ids -c - >MHC.norm.vg
How do I set parameters like
Now I would have to fuse/concatenate the 8 chromosome_graph.vg to get a genome_graph.vg using
2.- Augment the graph with info from DNA-seq samples
However reading the github page I am not sure how to do it.
3.- Map RNA-seq samples to augmented graph
I am thinking in a modification of this:
# map reads against the graph to get a GAM vg map -T x.sim.txt -x x.xg -g x.gcsa > aln.gam # surject the alignments back into the reference space of sequence "x", yielding a BAM file vg surject -x x.xg -b aln.gam > aln.bam # or alternatively, surject them to BAM in the call to map vg map -T x.sim.txt -x x.xg -g x.gcsa --surject-to bam > aln.bam
4.- Perform allele specific expression
My idea was to get a VCF from the augmented graph (aug.graph.vcf), but reading the Calling variants using read support section I don't really know how to do it. Next I would extract a linear reference (aug.graph.linear.fasta) from the graph to use GATK ASEReadCounter using the bam from step 3 (aln.bam) as below:
gatk ASEReadCounter -R aug.graph.linear.fasta -I aln.bam -V aug.graph.vcf -O output.table
Thanks for getting to here, I have 3 general questions:
1.- Do you find this approach suitable? would it be better to focus only in coding regions?
2.- I have a modest server, not a cluster or supercomputer, with ~ 200GB RAM and ~40 processors, will I be able to run vg tools for my samples in this machine?
3.- I found toil-vg which seems to make life easier reducing command calls and parallelize vg tools. Is this something I could use in my machine?
4.- At the end of the analysis I need to check gene expression so I would need to add gene annotations to the genome graph from a GFF of the assembled genome, is there a command for this? alternatively, is there a way to do something like a liftover between the published genome and the resulting graph?
Thank you very much!!!!