Which file types are needed to represent the genotype in a phenotype - genotype correlation analysis?
0
0
Entering edit mode
8 months ago

I have a specific disease and in our laboratory, we made a targeted sequence for 15 genes that are known to be correlated with the disease, the result we had for each patient 2 fastq files for the paired-end readings. After some analysis we got the VCF files for variant calling to detect the variations, in those 15 genes. I also have the phenotypes clinically detected for those patients and collected in excel sheet.

Now I want to train ML algorithms to correlate patients genotype and phenotype for the purpose of phenotype prediction in the future?

How can I gather the whole sequence of a single patient to correlate it with the different phenotypic features? I tried changing the fastaq files to fasta and then converting these fasta to tabular format and putting it in a pandas data frame?

But the point is how to combine all the sequences in the 2 paired-end files for each patient to end up with some table like the one in the photo below where each row represents the data for a single patient and every single nucleotide is a feature and the last column/s in the dataframe is/are the phenotype required ?

phenotype genotype correlation DataFrame ML • 527 views
0
Entering edit mode

What do you mean by

whole sequence of a single patient

And

every single nucleotide is a feature (column, nd)

0
Entering edit mode

Every patient has 2 fasta files (paired-end reads) represented from the sequencer, each fasta file after converting it to tabular format looks like the following image, where there are 337616 rows each row is a read, every read has an ID represented in the first column and a nucleotide sequence represented in the second column.

my question is how to combine all these reads (from the two paired-end fasta files) so that each row represent a single patient ? like the following image: where the first column (Class) here for example represents the phenotype, the second column represents the patient ID not the read ID, and the third column is a single sequence representing the 15 genes I targeted in my sequencing process

Then after that I will separate every single nucleotide in a separate column, like shown below, so that at every position all patients can be compared for the existence of a SNP and correlate that with the phenotype

0
Entering edit mode

Sequencing reads are redundant: each position is represented more than a single time (coverage definition). You couldn't use reads as they are for an application like that and you're over complicating things. Why not calling variants using your patient specific fastq and just report SNPs located on those genes of interest? There're tons of posts here and published pipeline to do that. Doing so, with some manipulation on the vcf output file you'd have a dataframe of patient (rows) - SNP(column) and you could encode the presence of specific snp in binary term (0:absence, 1:presence), for example.