Dealing with RADseq data: some tips from our instructors
Posted on 16 October, 2017 by Carlo Pecoraro
The rapid advent of next-generation sequencing (NGS)-based genotyping methods has significantly improved our ability to analyse thousands of molecular markers across the entire genome, at an unprecedented cost and speed.
Restriction-site-associated DNA sequencing (RAD-seq) and related approaches are the most popular among those methods (i.e. the original RADseq papers, published in 2007 and 2008, have almost 2000 citations in Google Scholar). Those methods combine restriction enzyme digestion of the genome with high-throughput sequencing, allowing to discover and genotype thousands of single nucleotide polymorphisms (SNPs) in hundreds of individuals rapidly and at low cost regardless of their genome’s size and information. However, the analysis of large-scale data obtained from tens of thousands of loci require bioinformatics knowledge that remain a daunting task for many researchers.
We will run a course on "RAD-seq data analysis” (https://www.physalia-courses.org/courses-workshops/course16/) from the 4th to the 8th of December with our two instructors Dr. Naiara Rodriguez-Ezpeleta (Azti tecnalia, Spain) and Dr. Josie Paris (University of Sussex, UK).
Here we have the possibility to talk about this methods with our instructors.
Which are the main considerations we should take into account before planning our RADseq study?
NRE: Do you need it for your research question? For example, in some cases there might be microsatellite markers already available that are perfectly adequate to resolve your question; ii) will you be able to get enough good quality DNA? For example, if you are planning to do RAD-seq on a single unicellular organism, you may not be able to get reliable results; iii) do you have a good candidate restriction enzyme? Sometimes, if you are planning a study on a species for which no close relatives have been RAD-seq before, it is a good idea to do a pilot in a few individuals to determine the best restriction enzyme.
JP: Study design is extremely important. Think about how many individuals you need to sequence (per pop for example) is 10 enough? Or do you need 20? How related are your populations / species? The biological hypothesis is vital here. There is also a trade off between how many samples you can fit in a lane and how much coverage you require. Coverage is very important for SNP calling so you should be aiming for coverage of at least 10x. Choice of restriction enzyme, genome size, GC content is also very relevant.
Currently there are many RADseq approaches available: Which are the main reasons by which we should prefer one approach among the others?
NRE: My opinion is that people give too much importance to this. I would say that if RAD-seq is good for your question, probably, ddRAD, GBS or others will be as well. I think the protocol is less important than the restriction enzyme choice. Sometimes you might prefer one over another. For example, if you want to build minicontigs with your reverse reads or you want to detect PCR clones, you will select RAD-seq. or for example, if you do not have an ultrasonicator for random shearing in your lab, you might decide to go for ddRAD. Or simply because the people in your lab use one protocol that is optimized and are able to tell you how to do it – that´s also a valid reason
JP: Again, this depends on your biological hypothesis and the biological material you have available (and your budget!). Various RADseq methods now exist, and there is some bias related to each method (see Andrews et al. 2016, Nature Reviews Genetics 17, 81–92 and Arnold et al. 2013, Mol. Ecol. 22, 3179-90).
Several pipelines have been developed to identify and genotype SNPs from RAD-seq-derived sequences: also here, which are the main guidelines, if any, to choose one pipeline among the others?
NRE: I would reply the same as above. It´s not about the pipeline itself but about how you are going o analyse the data. The most important thing is to understand what the pipeline does so that you can optimize the parameters you use.
JP: It is not really helpful to compare different software as they are all different. Stacks is the most widely used and can handle all types of RAD data (single-digest, ddRAD, 2bRAD, GBS etc).
Analysing RAD-seq data, we can get into several genotyping errors. Which are your suggestions to try to reduce and discover those errors during our analysis?
NRE: I think it´s impossible to get rid of genotyping errors simply because when we merge reads there is always some degree of overmerging or undermerging, that is joining non- homologous reads together or splitting homologous reads into separate loci. We can use parameter sets that minimize these artefacts, but we can never completely remove them. One needs to understand how genotyping errors will affect their results. For example, in a population structure study, a few incorrectly called SNPs probably do not affect the results; however, if you are looking for candidate SNPs in a GWAS study for example, you may want to validate these (as when using any other approach) with a an alternative method.
JP: Know your data. Play with it, visualise it and see if what you get out at the end makes biological sense. It is impossible to completely remove genotyping errors, but it is do not to lose sleep over it! Just get a good idea of your data, and properly optimise your parameters!
What kind of computational background RAD-seq newbies should have for analysing those data in a proper and correct way?
NRE: UNIX. You really need to be comfortable with UNIX for being able to analyse your data properly.
JP: Unix for sure. It is incredibly versatile and will allow you to play with your data.
To RAD-seq or not to RAD-seq? There two schools of thought about this method. What is your opinion about it?
NRE: You need to understand what you are aiming at achieving with your data. RAD-seq has numerous problems as many other approaches; it all depends on your scientific question. In some cases, RAD-seq is ideal, in others, you need to go for another approach. So, the easy answer to this question would be “it depends”.
JP: This depends on your biological hypothesis and your biological data. Not everyone has access to a reference genome (or can build one in a cost limited project) or, along the same lines, do WGS. It is important to think about the costs and benefits of all sequencing data, as well as the biases related to each.