GATK on RNASeq/ChIP data - distinction between no SNP and no data.
1
1
Entering edit mode
6.2 years ago
John 13k

Hello all :)

When using sequencing technologies which do not cover the entire genome when calling SNPs (such as calling SNPs from RNA-Seq data or ChIP-Seq data), I am unsure how to do a comparison on the VCF files alone.

For example, I ran the GATK pipeline on RNA-Seq data (https://www.broadinstitute.org/gatk/guide/article?id=3891) to give me VCF files.

All experiments were done on genetically identical mice - however one of my samples I had 10x as many reads as the others. At the end of the GATK pipeline, this sample also has about 3x as many variants called.

So chances are that the SNPs in the deep sample are also present in the shallow samples, it's just that there wasn't as much coverage and/or confidence in these shallow samples for GATK to call them.

Problem now is that when I use VCFTools's vcf-compare, the resultant venn diagram shows considerable number of SNPs unique to the deep sample, and I think this is unfair. Really I need 3 bits of information - SNPs ALT in a sample, SNPs REF in a sample, and 'UNKNOWN' in a sample or samples because of lack of data.
In other words, having a row in the VCF tells me there is a variant, but no row can either be no variant (like REF) or no data.


In an ideal world, much like STAR's 2-pass aligner used in the GATK protocol, there would be a 2-pass SNP calling, where SNPs are found de novo, then using this information from all samples, the reads are reanalysed to see if they match, dont match, or it is unknown if they match.

Is this possible?

Thanks so much for your time! :)

- John

GATK SNP RNA-Seq ChIP-Seq • 2.7k views
ADD COMMENT
1
Entering edit mode

Hi John,

I am also interested in identifying SNP in RNAseq data. I am also using GATK pipeline. do you think Indel Realignment and Base Recalibration needed while calling SNP from RNAseq data?

ADD REPLY
0
Entering edit mode

It's been over a year and a half since i did this, so im afraid my advice will be a little outdated. Bottomline, the GATK RNASeq best practises leaved much to be desired 1.5yrs ago. Perhaps it's better now, but truthfully, you probably know more than i do on this matter if you've read the GATK docs in the last year ;)

ADD REPLY
2
Entering edit mode
6.2 years ago

This is the point of joint variant calling, of which GATK is capable. You may or may not be doing this in a way in which that's effective.

Assuming that you're already calling the samples jointly, you might additionally filter out sites lacking a minimum coverage in at least one sample. I suspect there's vcftools option for that (I have to admit to not being overly familiar with it).

BTW, feel free to just pop by the office with questions.

-Devon

ADD COMMENT
1
Entering edit mode

+1 for the last line :-)

ADD REPLY
0
Entering edit mode

Thanks Devon I absolutely will!
I did the SNP calling on each sample individually so i'm probably not doing it in the most effective way - although I did try combining the fasta files for all reads and call SNPs on that. Again that probably isnt the most graceful way of doing it, since it loses replicate information that calling samples jointly in a single command might keep. Filtering out sites lacking a minimum coverage seems to be the way to go. I'll check vcf tools and come and have a word tomorrow (at the dentist right now).

Will also update the thread when I find a pipeline that works for me :)

ADD REPLY
0
Entering edit mode

Hi John,

I would like to do a similar - fast and simple - analysis in order to group ChIPseq samples by origin. Do you have a pipeline that works by now and can you share some information/pitfalls with us?

Thanks, Ruben

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6