I am senior year computer science student and I am considering choosing bioinformatics as a field for my ML Thesis.
Due to the fact that I lack in academic knowledge in genetics field, I would be thankful for any help with clarification my current draft, or providing any links to organizations which could help me finish my introduction (or cast it away)
Since I am still in research stage, sorry for for any fundamental errors in my questions' reasoning.
I plan the topic of my BSc Thesis to be "Machine learning application in diagnosis of a genetic diseases".
The idea is to create the model based on neural networks (with comparison of effectiveness of basic ML tools like lineal regression) that for given extracted genetic features decides whether this case is at risk of becoming ill. I would like to focus on rheumatic diseases (spondyloarthropathies as RA and AS) and work on publicly available datasets.
I found datasets located on "ncbi.nlm.nih.gov" genuinely compelling and selected two of them as relatable to my topic:
Collected gene expressions from 120 samples. An extracted RNA from white cells. Available as raw reads data in SRA browser. The description contains useful summary of data processing method. The output are feature counts on every "Ensembl ID" gene annotation for human genome.
"Screening genes associated with rheumatoid arthritis and ankylosing spondylitis" with 480 samples. The sample is considered on concrete group of "dbSNP" entries within custom human SNP list.
After creating the model with acceptable accuracy on test set I am planning to use regularization for discarding useless features. The goal is to decide whether there's or there's not a fixed group of features determining illness (e.g. B*27 SNP variant in HLA-B gene in 6th chromosome: https://www.snpedia.com/index.php/HLA-B)
My questions are:
Is it scientifically validated to create such model?
By this I mean: Is the genetic illness based on whole relation between many genes across chromosomes? (or single gene variant that triggers illness) Or is it much more complex issue that Machine Learning based on gene features is a miss?
Is it scientifically correct to collect data between different experiments? (e.g. use bot first and second mentioned dataset)
Could you briefly explain to me the relation between first and second mentioned dataset? (below I put my reasoning)
3.1. First one I understand as counted occurrences of every gene in sample (EnsemblID indicates e.g. ENSG00000223972 gene).
So the samples are indifferent to any variations in genes? For example I can easily find HLA-B gene with counted rate but the sample does not provide which allele it is (to know whether is is HLA-B27* variant). So the dataset provide the genome suit for every sample, but how understand number of counts? The lesser the number is in concrete ENSG entry the more probability that sample does not have basis encoding but some SNP variation? (as I understand sample might not have all genes (the file contains over 60'000 ENSGs))
3.2. The second dataset is only focused on given group of genes and focuses on SNPs so the different versions of given gene.
Why in this dataset the HLA genes are omitted as they are considered as crucial to these diseases? As I understand if there was also HLA-B gene included the SNP set would also include HLA-B27 variation https://www.ncbi.nlm.nih.gov/snp/rs13202464
First dataset operates on RNA, second on DNA. Am I allowed to convert first dataset to DNA (with TopHat software: http://ccb.jhu.edu/software/tophat) and consider it as DNA?
Is 120 and 420 sample quantity enough to consider it for such research?
Should I focus on samples on genome level or rather samples considering genes variations (SNP level)?
Can I take any sample's raw reads from SRA, convert it to human genome with "Bowtie" software and consider it as operational genome sample?
Could you recommend any additional datasets platforms for my further search? Or people/organizations that would be keen on giving any advice?