Question: GATK on RNASeq/ChIP data - distinction between no SNP and no data.
gravatar for John
4.9 years ago by
John12k wrote:

Hello all :)

When using sequencing technologies which do not cover the entire genome when calling SNPs (such as calling SNPs from RNA-Seq data or ChIP-Seq data), I am unsure how to do a comparison on the VCF files alone.

For example, I ran the GATK pipeline on RNA-Seq data ( to give me VCF files.

All experiments were done on genetically identical mice - however one of my samples I had 10x as many reads as the others. At the end of the GATK pipeline, this sample also has about 3x as many variants called.

So chances are that the SNPs in the deep sample are also present in the shallow samples, it's just that there wasn't as much coverage and/or confidence in these shallow samples for GATK to call them.

Problem now is that when I use VCFTools's vcf-compare, the resultant venn diagram shows considerable number of SNPs unique to the deep sample, and I think this is unfair. Really I need 3 bits of information - SNPs ALT in a sample, SNPs REF in a sample, and 'UNKNOWN' in a sample or samples because of lack of data.
In other words, having a row in the VCF tells me there is a variant, but no row can either be no variant (like REF) or no data.

In an ideal world, much like STAR's 2-pass aligner used in the GATK protocol, there would be a 2-pass SNP calling, where SNPs are found de novo, then using this information from all samples, the reads are reanalysed to see if they match, dont match, or it is unknown if they match.

Is this possible?

Thanks so much for your time! :)

- John

rna-seq snp chip-seq gatk • 2.3k views
ADD COMMENTlink modified 4.9 years ago by Devon Ryan96k • written 4.9 years ago by John12k

Hi John,

I am also interested in identifying SNP in RNAseq data. I am also using GATK pipeline. do you think Indel Realignment and Base Recalibration needed while calling SNP from RNAseq data?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Bioinfonext240

It's been over a year and a half since i did this, so im afraid my advice will be a little outdated. Bottomline, the GATK RNASeq best practises leaved much to be desired 1.5yrs ago. Perhaps it's better now, but truthfully, you probably know more than i do on this matter if you've read the GATK docs in the last year ;)

ADD REPLYlink written 3.3 years ago by John12k
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

This is the point of joint variant calling, of which GATK is capable. You may or may not be doing this in a way in which that's effective.

Assuming that you're already calling the samples jointly, you might additionally filter out sites lacking a minimum coverage in at least one sample. I suspect there's vcftools option for that (I have to admit to not being overly familiar with it).

BTW, feel free to just pop by the office with questions.


ADD COMMENTlink written 4.9 years ago by Devon Ryan96k

+1 for the last line :-)

ADD REPLYlink written 4.9 years ago by Ashutosh Pandey12k

Thanks Devon I absolutely will!
I did the SNP calling on each sample individually so i'm probably not doing it in the most effective way - although I did try combining the fasta files for all reads and call SNPs on that. Again that probably isnt the most graceful way of doing it, since it loses replicate information that calling samples jointly in a single command might keep. Filtering out sites lacking a minimum coverage seems to be the way to go. I'll check vcf tools and come and have a word tomorrow (at the dentist right now).

Will also update the thread when I find a pipeline that works for me :)

ADD REPLYlink written 4.9 years ago by John12k

Hi John,

I would like to do a similar - fast and simple - analysis in order to group ChIPseq samples by origin. Do you have a pipeline that works by now and can you share some information/pitfalls with us?

Thanks, Ruben

ADD REPLYlink written 4.1 years ago by rubendries10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1763 users visited in the last hour