Question: Genotyping by sequencing in Arundo donax
1
gravatar for fvalli84
3.4 years ago by
fvalli8420
Italy
fvalli8420 wrote:

Hi all,

I’m working with Arundo donax a species characterized by a polyploid genome, doesn’t produce viable seeds and the genetic variability is very low. In order to increase the genetic variability I produced almost 1,000 independent mutants using gamma ray and fast neutron as source of irradiation.

What we would like to do is something like a molecular fingerprinting of each mutant using ILLUMINA technology. I did a genotyping by sequencing and de novo using STACKS produced consensus tags (this species lacks of a reference sequence and is polyploid).

So, my question is if anyone has any hint about how to use the GBS data for characterizing mutants.

I was thinking to evaluate for example the number of reads that align with a given locus against a genome of reference of a related specie, and see if the difference in number we have among mutants, could be attributed to deletions caused by the mutagenic treatment.

I really appreciate any help!

blast alignment next-gen • 1.4k views
ADD COMMENTlink modified 3.3 years ago by Darked894.2k • written 3.4 years ago by fvalli8420
1

Interesting problem. But some vital IMHO data is missing: 

  1. how large is the genome and what is the level of polyploidy?
  2. do you sequence with enough coverage to discover any of the changes?
  3. what are the expected mutations & mutation rates caused by the radiations you have used? Similar to ones in rice: http://www.ncbi.nlm.nih.gov/pubmed/20154423
  4. how similar on the nucleotide level is the related species genome? Obviously for that you have to have some known genes / sequenced BACs from Arundo, or any not highly repetitive contigs (if you can assemble any) from your NGS data. 
  5. Any repeat library for other species/Arundo?

Because without this it will be quite hard to guess what to expect. 

Edit: spell

ADD REPLYlink modified 3.3 years ago • written 3.4 years ago by Darked894.2k
1
gravatar for fvalli84
3.3 years ago by
fvalli8420
Italy
fvalli8420 wrote:

The genome size is 1C = 2.744 pg, and it should be a pseudo triploid ( the ploidy level is not defined yet, could be also an hexaploid)

The coverage is around 5x.

I don't know the expected mutation rates with gamma ray in Arundo, and in literature is quite variable depending on the characacteristic of the species mutagenized

We have tried to align the reads against the sequence of Setaria italica, but the percentage of reads aligned was only 1.8%, so, since the transcriptome sequence of Arundo donax is available I'm thinking to use is as the reference.

Unfortunately no other genomic information are available for A. donax or other species of the same genus.

 

ADD COMMENTlink written 3.3 years ago by fvalli8420
1
gravatar for Darked89
3.3 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

re mapping genomic DNA to transcriptome: I would try LAST http://last.cbrc.jp/ because with any read spanning the intron-exon border more mainstream mappers I believe will reject the mapping because of the mismatch (intronic sequence from the read vs next exon in your transcript). LAST, given reasonably long exon-exon match should accept it and truncate your read. I have not done it myself in this exact scenario, but mapped RNASeq with trans-splicing leader to a genome with LAST. Close enough I hope.

re mapping to close genome: in a typical scenario the mapper choice is crucial. You need something being able to accept/report mappings with higher mismatch rates, but still not going overboard and placing almost every read anywhere. Check out again LAST and GEM http://algorithms.cnag.cat/wiki/The_GEM_library

Also because the taxonomies are still not based on sequence similarity, I would go and get all available (just 5) genomes from the same PACMAD clade: 

http://www.ncbi.nlm.nih.gov/genome/?term=txid147370[Organism:exp]

Only maize genome is of comparable size to Arundo, I think. Pick the one FASTQ with the best quality values from your data set, map to all 5 genomes with at least 2 mappers listed above. Assuming you can get the soft masked genome sequences for these 5 genomes, repeat. Hopefully, you will map more than 2% of your reads, but obviously I can not guarantee it.

very long shot (very drafty genome assembly): if you got 5x for each of your individual mutants, and mutations are rare, you may pull all this data together, preferably after getting some $$$ for a PacBio of the unmutated strain, and see what comes out of this. Even if just a shattered mitochondrial and plastid sequences plus a big swarm of pathetically sized contigs, you can map back your individual samples to this, and maybe get some idea about differences in the coverage. Then cluster your mutants based on this (like: sy 0.5M contigs RPKMs  /sample ), and check if there are any patterns (assuming deletions).

Hope it helps. 

ADD COMMENTlink written 3.3 years ago by Darked894.2k

Thanks for the answer. From the sequencing I got several files, including .vcf file with the information about SNP and INDELS and quality of the reads, and files like structure and genpop. Do you have any suggestions if I can get some useful information  for the type of analysis I have to do? Thank you again for any help,

Fabio

ADD REPLYlink written 3.3 years ago by fvalli8420

If these are based on 1.8% mapped reads to Setaria genome, I would try to get more reads mapped first. At this mapping rate you will be looking at differences between your mutants in some special regions only.

If you really must, check where these SNPs/indels are located intersecting VCF with Setaria genome annotation (genes, repeats). Check in IGV loading VCFs + BAMs what at least some of these mutation calls are. Because one can get mutation calls sequencing a haploid cell line, assembling the genome, then mapping read back to the assembly (see:  doi:10.1101/gr.180893.114). 

 

ADD REPLYlink written 3.3 years ago by Darked894.2k

All the files I got come from a de novo assembling made with STACKS. We tried to use Setaria italica as reference but since we had only the 1.8% of reads aligned against it, we decided to proceeded with the de novo.

Basically the software creates a sort of catalog of loci with all the reads and then matches samples back to the catalog to define allels at each locus in each individuals.

I will try to map all the sequences I have agaist the transcriptome of Arundo choosing genes that are known to be single copy in the genome.

ADD REPLYlink written 3.3 years ago by fvalli8420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2033 users visited in the last hour