Drop values in expression dataset python
1
0
Entering edit mode
3.9 years ago

I have this microarry dataset. I want to bypass an issue I had in the early version of this pipeline, (https://geoparse.readthedocs.io/en/latest/Analyse_hsa-miR-124a-3p_transfection_time-course.html) I have created an experiment file and read this in as a dataframe. I want to elimiated each column in my expression table that no longer exist as a string value in column accession of the dataframe I read in.

# Import tools
import GEOparse
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# download datasets
gse1 = GEOparse.get_GEO(geo="GSE99039", destdir="C:/Users/Highf_000/PycharmProjects/TFTest")
gse2 = GEOparse.get_GEO(geo="GSE6613", destdir="C:/Users/Highf_000/PycharmProjects/TFTest")
gse3 = GEOparse.get_GEO(geo="GSE72267", destdir="C:/Users/Highf_000/PycharmProjects/TFTest")

# import all GSM data for each GSE file
with open("GSE99039_GPL570.csv") as f:
    GSE99039_GPL570 = f.read().splitlines()
with open("GSE6613_GPL96.csv") as f:
    GSE6613_GPL96 = f.read().splitlines()
with open("GSE72267_GPL571.csv") as f:
    GSE72267_GPL571 = f.read().splitlines()

# gse1
gse1.gsm = gse1.phenotype_data
print(gse1.gsm.head())

# gse1
gse1.details = pd.read_csv('GSE99039_MicroarrayDetails.csv', delimiter = ',')
print(gse1.details.head())
gse1.detailsv1 = gse1.details[(gse1.details.values == "CONTROL") | (gse1.details.values == "IPD") | (gse1.details.values == "GPD") ]
print(gse1.detailsv1.head())

# gse1
pivoted_control_samples = gse1.pivot_samples('VALUE')[GSE99039_GPL570]
print(pivoted_control_samples)


# gse1
# Pulls the probes out
pivoted_control_samples_average = pivoted_control_samples.median(axis=1)
# Print number of probes before filtering
print("Number of probes before filtering: ", len(pivoted_control_samples_average))
# Extract all probes > 0.25
expression_threshold = pivoted_control_samples_average.quantile(0.25)
expressed_probes = pivoted_control_samples_average[pivoted_control_samples_average >= expression_threshold].index.tolist()
# Print probes above cut off
print("Number of probes above threshold: ", len(expressed_probes))
# confirm filtering worked
samples = gse1.pivot_samples("VALUE").loc[expressed_probes]
print(samples.head())

# print phenotype data
print(gse1.phenotype_data[["title", "source_name_ch1", "Disease_Label", "Sex" ]])

This is what my dataframe I created looks like, named gse1.detailsv1 in script:

   Accession       Title  Source name  ... Subject_id Disease label     Sex
0  GSM2630758  E7R_039a01  Whole blood  ...      L3012       CONTROL  Female
1  GSM2630759  E7R_039a02  Whole blood  ...      L2838           IPD    Male
2  GSM2630760  E7R_039a03  Whole blood  ...      L2540           IPD  Female
3  GSM2630761  E7R_039a04  Whole blood  ...      L3015       CONTROL  Female
4  GSM2630762  E7R_039a05  Whole blood  ...      L2884           IPD  Female

[5 rows x 7 columns]

This is what my expression table looks like, named samples in script:

name       GSM2630758  GSM2630759  ...  GSM2631314  GSM2631315
ID_REF                             ...                        
1007_s_at       5.397       4.952  ...       5.567       5.529
1053_at         5.199       5.198  ...       5.706       5.078
117_at          8.327       8.589  ...       8.511       8.458
121_at          7.042       6.935  ...       7.526       7.673
1294_at         7.753       8.210  ...       7.537       7.418

[5 rows x 558 columns]

For pretend, if GSM2630758 doesnt exist in column Accession in the first dataframe, I want to drop GSM2630758. I need loop through this and eliminate all values that no longer exist.

RNA-Seq python GEO GEOparse • 955 views
ADD COMMENT

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6