Question: Feature extraction from DNA sequence
0
gravatar for Uday Rangaswamy
15 months ago by
Indian Institute of Technology, Madras, India
Uday Rangaswamy120 wrote:

In the context of developing a classification model for ascertaining whether a given variant is affecting the gene expression for a certain disease, I've obtained 1k bp up and downstream of the variant locations. Now, what are the possible features that I could extract out of this sequences for this specific task? Also, is it more relevant to compute biological features over statistical ones for the same purpose? Any help would be much appreciated.

ADD COMMENTlink modified 11 months ago by pltbiotech_tkarthi180 • written 15 months ago by Uday Rangaswamy120

You can try to create VCF file from your data set and predict the variant effect to see the mutations are deleterious or tolerated A: Allele frequency visualization

ADD REPLYlink written 11 months ago by pltbiotech_tkarthi180

I'm sorry I don't think you read my description right. I'm trying to be disease specific in my context. However, I did use ensembl VEP to obtain the positions of rs ids of my interest, in hg38 assembly. Thanks for your thoughts.

ADD REPLYlink written 11 months ago by Uday Rangaswamy120
1
gravatar for Arup Ghosh
15 months ago by
Arup Ghosh2.1k
India
Arup Ghosh2.1k wrote:

You want to check the association b/w variants and expression profile of the adjacent gene or going for expression prediction based on genotype?

https://www.um.edu.mt/__data/assets/pdf_file/0005/289427/eQTL_intro.pdf

ADD COMMENTlink written 15 months ago by Arup Ghosh2.1k

Expression prediction based on genotype. Kindly share your thoughts. Thanks.

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
3

The list of things at which to look is endless:

  • Transcription start sites (TSS)
  • Transcription factor binding sites (TFBS)
  • Promoter regions (e.g. via H3K4Me3)
  • Enhancer regions (via H3K27ac)
  • Other histone marks (many types to look at, e.g., H3k9ac, H3K27ac, etc.)
  • Conservation
  • DNase hypersensitivity
ADD REPLYlink written 15 months ago by Kevin Blighe54k

Thank you. What platform would you suggest me to use to compute such features?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
1

Mostly shell scripting / BASH; so, Linux or Mac OS are preferable. Take a look here, where some of this data is available: http://genome.ucsc.edu/encode/downloads.html

One can also annotate some of these with ANNOVAR: http://annovar.openbioinformatics.org/en/latest/

ADD REPLYlink written 15 months ago by Kevin Blighe54k

Alright, but my context is as follows : given an rs id, the model should analyse a certain features for both the wild and the mutant type and accordingly predict whether or not it is going to affect the gene expression for a certain disease. I have the case and control samples from the GWAS and also 1k bp sequence up and down stream of the obtained rs ids from the GWAS. Would you still suggest me to compute the same features? Your advice is much appreciated. Thanks.

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
1

With just the SNP genotype and gene expression, you can do an eQTL study or build your own regression models whereby genotype is predicting expression of nearby genes, with the covariates in these models being some of the features that I mentioned above. This may sound easy, but it is not, particularly the set-up of such a study.

If you have the DNA sequence and are interested in that, then take a look at the manuscript mentioned by arup, i.e., Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

ADD REPLYlink written 15 months ago by Kevin Blighe54k

Alright, thanks. I do not have the gene expression, just the variants taken from the GWAS.

ADD REPLYlink modified 15 months ago • written 15 months ago by Uday Rangaswamy120
1

If you download the library annotation information for the microarray that you are using, then there may already be much metadata in that library file. Just search on the manufacturer's home-page. If it is Affymetrix, then remember that ThermoFisher purchased Affymetrix.

ADD REPLYlink written 15 months ago by Kevin Blighe54k

Which specific information would you suggest I look for from the library annotation?

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
1

Please see my list, above. Also, I would discourage anybody from becoming too dependent on this website.

ADD REPLYlink written 15 months ago by Kevin Blighe54k

In addition, I would encourage everyone doing machine learning on biological data to get familiar with what data they're working on.

ADD REPLYlink written 15 months ago by WouterDeCoster43k

Thank you so much for this. I used the above mentioned features (Histone, TFBS, Dnase) and also tried various dinucleotide features. But none of the dinucleotide features seem to contribute in the context of classification. How can I compute conservation and any other possible features that could contribute in this context? Please help.

ADD REPLYlink modified 12 months ago • written 12 months ago by Uday Rangaswamy120

Is this to me?

ADD REPLYlink written 12 months ago by Kevin Blighe54k

Yes sir. I'm typing the rest just to fulfill the minimum character criteria.

ADD REPLYlink written 12 months ago by Uday Rangaswamy120
1

Conservation scores should be there. Look for phyloP scores, as an example.

ADD REPLYlink written 12 months ago by Kevin Blighe54k

Got it. Any other feature that would contribute in this context?

ADD REPLYlink written 11 months ago by Uday Rangaswamy120
1

Take a look at the manuscripts for CADD (in silico predictor) - this will give you an idea. Conservation score is the single best predictor of functionality / pathogenicity, though.

ADD REPLYlink written 11 months ago by Kevin Blighe54k

Where can I find TSS information? Ensembl?

ADD REPLYlink written 11 months ago by Uday Rangaswamy120

Search in your search engine of choice?

ADD REPLYlink written 11 months ago by Kevin Blighe54k

Both PhyloP and PhastCon score contribute effectively to such classification, and also GC content which gives away about the stability around the variant is also a good predictor along with CpG score. I looked for TSS database in hg38, and I see there is DBTSS but it is not in the format I need. I prefer bed format. Any other alternative for the same? And also, any other possible feature selection suggestions? Thank you so much for you help so far.

ADD REPLYlink written 11 months ago by Uday Rangaswamy120
1

Hi Uday, yes, that makes sense (PhyloP, PhastCons, GC content). There are many features at which you could look:

  • TFBS - transcription factor binding sites
  • structural variants / CNV
  • H3K27 acetylation (H3K27ac)
  • H3K27me3
  • et cetera

In the manuscripts of CADD and DANN, you will see many more ideas.

You will not find any single best predictor outside of conservation score, though.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe54k

I did go through them. I have a question though. Is it alright to compute conservation score involving the region (eg. 50 bp +/-) surrounding the point of mutation? Because suppose, if positive meant highly conserved and negative meant otherwise, I'd obviously get negative for both my classes. So I was thinking of computing conservation score of sequence surrounding the point of mutation. Do you concur?

ADD REPLYlink written 11 months ago by Uday Rangaswamy120

Hi Uday. What do you mean by 'classes'? The phylop scores are measured on the log scale, with positive meaning more highly conserved, and negative meaning less likely conserved, as you appear to have noticed. I believe they already consider the surrounding region when calculating these scores, but cannot confirm.

ADD REPLYlink written 11 months ago by Kevin Blighe54k

I have defined two classes, one contains SNPs taken from an eQTL study involving diseased tissues, and these particular mutations are believed to affect gene expression in context of a particular disease. I've chosen them based on their level of association with the disease in that eQTL study. And, the neutral class is taken from GTEx portal from normal tissues of interest. The pre processing is taken care of. As for my understanding, the conservation score is computed mostly on the basis of MSA. I've considered phyloP100way and phastCon100way. I've used mean conservation score of +/-75 bp around the point of mutation and it seems to work pretty well.

ADD REPLYlink written 11 months ago by Uday Rangaswamy120
1

Okay, sounds very interesting. Great work!

ADD REPLYlink written 11 months ago by Kevin Blighe54k
1

The following article will give you an idea.

https://www.nature.com/articles/s41588-018-0160-6

ADD REPLYlink modified 15 months ago • written 15 months ago by Arup Ghosh2.1k

Thank you. Will look into it.

ADD REPLYlink written 15 months ago by Uday Rangaswamy120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 920 users visited in the last hour