Workflow Or Tutorial For Snp Calling?
8
65
Entering edit mode
13.5 years ago
Matthieu ▴ 480

I am looking for a good workflows, readings or tutorial for SNP calling. I read some other posts on this topic, but I would like a more detailed explanation. Population genomics and sequence data are new to me (I have a general CS and biology background). It might just be me, but these tools are not as straightforward or as documented as I'd like. Any links or explanations would be good!

So far, my situation is as follows:

  • I have Illumina sequence reads for a highly polymorphic species
  • I aligned these reads using BWA against a reference genome with default parameters, but I am not sure if I should change parameters (if so, which ones?) due to the highly polymorphic data
  • I am unsure of the next step, I will probably be using SAMtools or GATK... I tried making an mpile up but got really confused after that.
  • I should also be accessing SNP quality..what tools are used for that? I already see some sequencing errors when browsing the data.

As you can tell, I am totally new with this. It is pretty exciting so I want to learn and be able to do some of these things! Thanks in advance.

edit: I also get so confused with some of the output, more detailed documentation on that would be nice as well!

snp samtools alignment next-gen sequencing • 48k views
ADD COMMENT
1
Entering edit mode

You referenced several other posts but not this one ; you may find it helpful.

ADD REPLY
0
Entering edit mode

ah, I missed that. It is very helpful. Thank you, David!

ADD REPLY
20
Entering edit mode
9.1 years ago
moorem ▴ 250

GATK is certainly the best variant caller and it's the difference is important because with indels in particular, they can be hugely different.

I've spoken to a few people who have had issues with GATK so I've actually written a blog post with a workflow that works for GATK. The major issues seemed to be using GATK's best practices, particularly their add or replace read groups module which sometimes causes it to fall over later. By adding the read group (you can make it whatever you like) when you run bwa, then this workflow actually works.

This was written for microbes so ploidy is set to 1 for HaplotypeCaller, but this can be set to whatever you like (do check the documentation). Originally posted here(https://approachedinthelimit.wordpress.com/2015/10/09/variant-calling-with-gatk/)


  • Dependencies:

You'll need to install picardtools, GATK, bwa and optionally, vcffilter for this workflow. Picardtools and GATK are simply .jar files so that's no problem while you probably already have bwa installed, otherwise installation is well documented!

  • The workflow

This workflow begins with short read (fastq) files and a fasta reference. First a sequence dictionary is created, short reads are aligned to the reference and read group information provided, resulting sequence alignment map (sam) file sorted and converted to binary alignment map (bam) format, duplicates marked, bam file sorted, indel targets identified, indels realigned and variants called. Simple!

For simplicity an example set of commands are provided here, where the reference is reference.fasta and the short reads are R1.fastq.gz and R2.fastq.gz. You will need to enter the paths and versions of the software being used at each step and your actual file names.

Create sequence dictionary

java -jar~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

Align reads and assign read group

bwa mem -R "@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01" reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

Sort sam file

java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

Mark duplicates

java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

Sort bam file

java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

Create realignment targets

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

Indel realignment

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R PA01.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

Call variants (HaplotypeCaller)

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

The resulting vcf file will contain your variant calls!

Then you can optionally filter the variants:

Filter variants

~/bin/vcflib/bin/vcffilter -f 'DP > 9' -f 'QUAL > 10' raw.vcf > filtered.vcf

Or first split the raw.vcf file into SNPs and indels:

Extract SNPs

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf

Extract Indels

java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

I also have a neat perl wrapper to automate this workflow over many short read files and would be happy to make this available if people are interested! Please do comment with any questions or issues and I'll do my best to resolve them!

ADD COMMENT
0
Entering edit mode

wow thanks for the workflow! I guess this will be my starting point in this domain. have you already automated this in a perl script as you have described ? and one more question, are there certain parameters that need to be considered when working with reads from human genome?

ADD REPLY
0
Entering edit mode

Just to give you an idea, this can be automated in any language be it Perl/Python/bash as per your comfort, just to be sure about the Base Recalibration step, I do not see that here . Take a look at this link . The Base Recalibration is a standard best practice measure suggested. Another thing on the filtration is that one can discard low quality variants or even for that fact make a plot of various quality scores and then discard the first quantile or even lesser. You can also use the mentioned one in the GATK website but it depends upon the quality of your data and to what extent you had a coverage that can best suit those parameter. I would always suggest filtering of variants should be dependent to each one to its own data based on the distributions you have.

ADD REPLY
0
Entering edit mode

Hi!

Can I also use this on RNA-seq data or does this only work on genome studies? is it okay to use GATK indel realignment on a RNA-seq data? Thanks!

ADD REPLY
1
Entering edit mode

Depends on what context you want to use this work flow on RNA-Seq, if you want to find variants from RNA-Seq data then GATK workflow can be used but it is better to use the recommended one which is here. If you can make your own workflow it is fine. Else you can refer to the automated script here I made. But you can always do the filtration based on your distributions as well. Feel free to ask.

ADD REPLY
17
Entering edit mode
13.5 years ago
Pablo ★ 1.9k
  • BWA using defaults it's probably OK.

  • If you have a SAI file from the previous step, you need to convert it to SAM or BAM. Do something like (assuming your reference genome is hg37.fasta)

    bwa samse hg37.fasta s.sai s.fastq > s.sam

  • Then, Create BAM file (we assume you installed SamTools)

    samtools view -S -b s.sam > s.bam

  • Sort BAM file (will create s_sort.bam)

    samtools sort s.bam s_sort

  • Call variants: I.e. Create VCF file (BcfTools is part of samtools distribution)

    samtools mpileup -uf hg37.fasta s_sort.bam | bcftools view -vcg - > s.vcf

There is a lot more (like local realignment, etc.). But if this is your first time doing it, you should start with the basics.

ADD COMMENT
13
Entering edit mode
13.5 years ago

We are working on a SNP pipeline now. You might find my work in progress pipeline useful.

The pipeline currently starts with an alignment from BWA. It uses GATK for SNP calling.

Briefly, the flow involves:

  • Realigning the BAM file using GATK's RealignerTargetCreator and IndelRealigner
  • Optional recalibration step (we currently don't use)
  • SNP calling using GATK's UnifiedGenotyper
  • Indel calling using the UnifiedGenotyper
  • basic filtering of the resulting VCF files
    • Right now we use some basic metrics to attempt to filer out low quality SNPs. I'm sure this step could be improved
  • Annotate called and filtered SNPs
    • Currently we use a custom script to add gene / transcript / exon / intron and other information from Ensembl.
    • Recently (yesterday) I found snpEff from another BioStar discussion. We will be using it in the future for this kind of annotation.

We are still fleshing out the details on filtering and such, but it might be a good starting point to for executing GATK in a working order

ADD COMMENT
0
Entering edit mode

I've been asked to do something similar and would use the same programs as you, so I will have a look and see if I can get yours working. snpEff is very good for annotation as I expect you have found. Although I would look to expand and include a structural variation software such as pindel or Delly (embl uses this) because GATK does not detect large inserts well. Thanks very much for posting this, hopefully looking through your code will accelerate the process of putting my pipeline or modifying yours to what we want.

ADD REPLY
6
Entering edit mode
13.1 years ago

I strongly recommend this recent article from authors of GATK.

It covers various aspect associated with SNP calling in detail. At the same time do refer the software manual/wiki for up-to-date options incorporated in the toolkit.

alt text

ADD COMMENT
3
Entering edit mode
11.3 years ago
rob234king ▴ 610

I have put together a tutorial website with four core tutorials on it, RNA-Seq, ChIP-Seq, Genome assembly, and SNP calling that may be of use to you.

This website was created to share bioinformatics tutorials.

ADD COMMENT
2
Entering edit mode

Login - Register is a major put-off, maybe make it optional?

ADD REPLY
0
Entering edit mode

Thanks for the input, I know registering is a bit of a pain but the website is based upon groups that contain tutorials and I use the login to only show those tutorials from groups you are a member of to make it manageable in the future if it got bigger. You can use false details if your concerned, email etc it's just used for this purpose. If you register it will automatically log you in and if go to "groups" on side bar and join the "Cranfield University" group, then click on "group tutorials" in left side bar you can access the tutorials I have put up. I think they need a little re-formatting but have some useful demonstrations of tools like RSATs etc. Thanks again for the reply appreciate it.

ADD REPLY
0
Entering edit mode

You got me curious, but the login system is not working…

ADD REPLY
0
Entering edit mode

Thanks I'll check it out it seems the server needs restarting, there is a memory leak on one or more of the deployments on this server. I ask for it to be reset and investigate.

ADD REPLY
0
Entering edit mode

A number of student are submitting to this server over the next month which is resulting in permgen errors, some kind of memory leak somewhere or due to multiple submissions. Issue corrected for the moment.

CUBELP2 is a static website and CUBELP a sharing platform, typing their names and "bioinformatics" which should locate them in google.

ADD REPLY
1
Entering edit mode

Please fix the link. It doesn't work anymore!

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
13.1 years ago
Anil ▴ 10

Use GALAXY software. It is free and user friendly and u will get most of the step in oe programme only.

ADD COMMENT
0
Entering edit mode
11.3 years ago
Ming Tommy Tang ★ 4.4k

have a look at this course material from UT-Austin https://wikis.utexas.edu/display/bioiteam/SSC+Intro+to+NGS+Bioinformatics+Course

ADD COMMENT
0
Entering edit mode
8.6 years ago

You might want to filter and explore your VCF file for different kinds of analyses (diversity, haplotypes, population structure). You can then make use of the online SNP pipeline: http://sniplay.southgreen.fr/cgi-bin/analysis_v3.cgi

ADD COMMENT
0
Entering edit mode

I tried to use a vcf 4.2 file but does not give any kind of output, the images are all blank. The file size was also around 21 MB so that kinda find since I ran the example of the polymorphism vcf file which seemed to be ok. Does it work with new vcf4.2 format? Would like to try on my variant file which I generated for one sample from RNA-Seq and should be having a lot of false positives as there was no matched normal , still would like to see the various statistics, so was trying the tool but it does not produce anything.

ADD REPLY
0
Entering edit mode

It should work with vcf4.2 format. Maybe there was a special caracter in the sample names that blocked the VCFtools program...

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6