Whole exomes: comparative analysis
1
0
Entering edit mode
7.2 years ago
bioinfo8 ▴ 230

Hi,

I have looked around a lot to find how to analyse whole exomes. Literature indicates the usage of Samtools, Bedtools and GATK. But, I am unable to find any clear and detailed tutorial for how to proceed with exome BAM files.

I want to analyse paired-end BAM files which are the whole exomes already aligned with reference using BWA and duplicates marked (as @PG indicates ID:bammarkduplicates2). There are two groups each with 3 individuals, so I have 6 BAM files in total.

I have done some initial analysis using Qualimap and from the PCA, I could see the variations (polymorphism in the individuals) based on how they clustered.

However, I am interested to find out further:

1) the total number of genes in each and then average number of genes from all 6 files?

2) conserved / non-conserved regions in exomes with respect to reference

3) location for genes of interest on exomes with respect to reference (I have gene list)

4) Any other way for PCA and polymorphism information

I would appreciate any guidance for the above.

P.S.: I am a R admirer, so the R solutions would work best!

Thanks!

exome R whole exome BAM • 2.1k views
ADD COMMENT
2
Entering edit mode

It's not obvious what your main objective is in this analysis. For sure looking at the number of variants isn't the final outcome?

3) location for genes of interest on exomes with respect to reference (I have gene list)

For this you wouldn't need exome sequencing... just the reference genome and a genome browser will do.

ADD REPLY
0
Entering edit mode

Thanks @WouterDeCoster!

The genus for the reference and generated exomes are similar but not species, so my interest is the comparative analysis between them [which will cover 1) and 2) ].

I have some genes (~150) specific for a feature in the reference and want to compare them within exomes (3).

As all exomes (BAM) are from same species but from different individuals, so I don't know whether I should analyse each of them separately or together.

I hope it is more clear now. :)

ADD REPLY
0
Entering edit mode

So your main interest is to compare two species, of which one has a reference genome? So what is the biological question or hypothesis?

ADD REPLY
0
Entering edit mode

Yes and for whole exomes (paired-end BAM files aligned to reference) I have from many individuals of the same species (same genus as reference), I would like to find out:

1) How much similar and different these exomes are from the reference?

2) How many total number of genes they have and average number of genes?

3) I have a gene list (~150 genes) from reference responsible for a specific feature e.g. localization. I want to compare these genes to the genes from the exomes.

4) As exomes are from various individuals, variations among them would be worth to study.

ADD REPLY
1
Entering edit mode

1) How much similar and different these exomes are from the reference?

So you would perform variant calling on those?

2) How many total number of genes they have and average number of genes?

You(or someone else) designed an assay for exome sequencing. Therefore you can only find the genes you targeted, so you will not learn new things about the total number of genes.

3) I have a gene list (~150 genes) from reference responsible for a specific feature e.g. localization. I want to compare these genes to the genes from the exomes.

So, variant calling?

4) As exomes are from various individuals, variations among them would be worth to study.

And variant calling.

ADD REPLY
0
Entering edit mode

2) Someone else did experiment but these are whole exomes.

I don't know whether I should analyse all exomes separately or together. Suggestions please.

Thanks!

ADD REPLY
1
Entering edit mode

By the sounds of things, you want to do variant calling, so again, have you looked at the GATK best practises?

ADD REPLY
0
Entering edit mode

whole exomes.

Exome sequencing? or genome sequencing?
Exome sequencing requires a priori knowledge of what is coding (to target with probes). So you only sequence what you target for.

ADD REPLY
0
Entering edit mode

Exome sequencing using all protein-coding information from reference.

I have to analyse whole exome BAM files already aligned with the reference and duplicates marked. I did some initial analysis using Qualimap, but not satisfied.

ADD REPLY
0
Entering edit mode

@WouterDeCoster, it would be nice if you can give some thoughts!

Thanks

ADD REPLY
4
Entering edit mode
7.2 years ago

Have you looked at the GATK best practises? - With exome data you'd typically call variants and indels, and work with the resulting VCF(s), then interrogate the the calls based on the context of your experiments (singletons, families, causal variant search, etc). You can produce a PCA plot from the VCF using the SNPRelate package in R.

ADD COMMENT
0
Entering edit mode

Thanks @andrew.j.skelton73 for 4) query!

ADD REPLY

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6