Question

Mapping Affy probe ids

0

Entering edit mode

6.1 years ago

Jalal K. Siddiqui • 0

I am working with some gene expression data that was collected using Gene Chip Human Gene 1.0 ST Arrays according to Affymetrix standard protocols. Paper was published in 2013. There is an old annotation file that contains probe ids mapped to gene names. However a newer annotation shows differences in the probe data. Meaning some of the probes map to different genes.

As an example, in the old data, probes that don't map to anything now are annotated to certain genes. Some of these are shown below:

      id gene.name.orig gene.name.new
1 7892501                     SNORD58B
2 7892501                        RPL17
3 7892506                       TARDBP
4 7892508                        RNPS1
5 7892509                     HSP90AB1
6 7892512                          JTB

Should we use the more recent mappings or use older mappings?

Thanks,

Jalal

affymetrix probe map • 2.3k views

ADD COMMENT • link updated 6.1 years ago by GenoMax 141k • written 6.1 years ago by Jalal K. Siddiqui • 0

2

Entering edit mode

You should clarify the purpose of this reanalysis. Are you trying to reproduce those results or is this reanalysis meant to do additional discoveries?

Meaning some of the probes map to different genes.

That sounds worrisome. Did you get/look at the latest annotation from Affymetrix or are you doing this yourself?

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

I just emailed my adviser what I am trying to do. Is this realistic? Would you recommend to try to proceed this way?I am trying to accomplish it in R now since I know it better than Python. We got very nice clustering results last week before I went to the conference because 8 of the 10 most correlated genes of MRPL44 also belonged to the same GO-term, i.e. Mitochondrial Large Ribosomal Subunit and the remaining 2 belonged to the Mitochondrial Small Ribosomal Subunit. Since about Christmas of 2016/17 I am trying to concatenate all of the approximately 10,000 conditions/observations from the same Microarray chip into a gigantic data frame, i.e. a special Python and R variable type, which looks like an Excel table, where each row is a gene and each column is a time point/observation/condition. I think of each time series dataset as a string, which shows the time series trajectory for all the time series time points from a single dataset. Unfortunately 81 time points/observations was the longest time series I could find on NCBI and Array Express. Since clustering results based on time series trajectory similarities based on only a single dataset are still very fuzzy and ambiguous. But if we think of each time series dataset as being the life of a particular yeast and that when it dies, it can rise from the dead in form of another time series dataset with different environmental conditions, e.g. no oxygen, CR salt stress, heat stress, galactose media, copper poisoning, knockouts, over-expression, etc. and be added to the previous yeast's expression trajectory of life, then instead of only 81 opportunities to differ, we provide it with over 10,000 opportunities for distinct differences in trajectories between even the most correlated genes under a few but at least one condition, which allows us to more unambiguously rank them according to their correlation. This method is very promising because it is the best way to most correctly infer the gene functions of those genes, which have no common name yet. I am trying to get it to work programmatically either in R or Python because then I could present this new method in Memphis. It holds good potential for getting published because it can be applied to microarray data from any chip for any species. It does not even have to be time series data and there is no need to read the publications about any of the datasets since all we are looking for is correlation between trajectories of gene expression, which are formed, when connecting all observations/measurements into a single long string of 10,000 mRNA levels connected by 9,999 slopes/ It might take some trial and error to compare the clustering outcomes outcomes of different correlation measures, such as Pearson Correlation, Cosine Similarity, Euclidean Distance, Manhattan Distance, Randall Correlation or Spearman Correlation and any other kind of correlation, which I can find in the literature, so that we can choose from all the clustering methods, which we have tried, the one that results in the highest enrichment score. None of my friends, who I consulted and asked for a quick literature search is away of anybody ever having tried to infer gene function from a single super-long concatenated trajectories, which are connected by as many slops as reported expression measurements from the same chip minus 1. If this is true then we have something novel coming out of my dissertation, which promises to outperform every other clustering methodology. Somebody on Biostars.org texted me that the 50 genes, which belong to the GO term "Mitochondrial Large Ribosomal Subunit" are more correlated expressed to one another than the genes belong to any other GO term. That is why I tried this GO term first.

Is there any particular GO term you are interested in getting texted next? Do you have any particular genes of interests to which you'd like to know, which other genes are expressed in the most correlated manner? Then I can test those as soon as somebody online has figured out how to best deal with the annoying error message "No numerical data to plot. This is the status of progress since we started facing it at around 3 p.m. I am tempted to forget about saving computing time and let the similarity calculations for each distance matrix to run for a day because as long as 63 digits after the decimal point are retained the current Python script works well.

ADD REPLY • link 6.1 years ago by tfhahn ▴ 50

0

Entering edit mode

Part 2 of my email to my adviser since it was to long to fit in one reply text box:

We wanted to make a Graphical User Interface (GUI) for you, where you could type in any genes of your interest, pick any correlation measure you, type how many of the most correlated genes you want the script to use for gene enrichment and after some time ranging from a few minutes up to a day our script will draw you the plots and lists the name of the most correlated genes, We could apply this method to any gene, which has no common name yet, and publish the name of the GO-term, to which it clustered most and hypothesize that it is involved in the same function. Then we could validate with promoter analysis, i.e. TFBS distribution similarity using JASPAR distance matrices to see if we get the some result. As soon as we have a reliably well functioning GUI, I can start to rewrite my dissertation because then I can try anything I want on my own without having to worry about messing up the code because I cannot find the cursor unless I start writing something or press enter.

It might be that we can improve our clustering results even more by separating time series from other datasets because for time series we could use the time wrap clustering algorithm, which must not be used on non-time series data. We could even use the MetaCycle Bioconductor R package, which clusters according to period length, phase shift, amplitude and oscillation pattern; but only for datasets, which spans at least 2 cell cycles, because otherwise, errors are imperative since MetaCycle enforces a period length on every gene even if it is not periodically expressed.

Once we have our programming methodology to work properly for the yeast 2 microarray chip, we could modify it for every other chip to predict gene functions of any organism despite not knowing anything about it. I believe that this would make us look smart but we need to be fast because my new method is so easy and simple that anyone, who can program in R or Python can get results in less than a day. That is why I am so surprised that nobody ever tried it yet.

Should I offer coauthorship to anyone, who can help us significantly with the coding because this is holding us back in the most unpredictable manner? The more people are on te paper the more important it looks and the faster the work gets done.

Actually, I don't see much of a problem when I write in my method section that one clustering distance matrix takes about a day to computer because it gives people the impression that our calculations are very precise and sophisticated; hence needing more time.

Today I have also applied for a $500 travel grant to a machine learning and AI conference, where I hope to get some help in getting my hidden object theory published because that is the bottleneck right now. Since it is viewed as very speculative and hypothetical by almost all publishers nobody seems to want to publish it until at least one machine learning expert understand it and can describe it in a much better way than me to get others to understand it as well. If we can get it published then I have 2 novel things in my dissertation.

ADD REPLY • link 6.1 years ago by tfhahn ▴ 50

0

Entering edit mode

part 3 out of 3 of already submitted email:

Below is the Python code we have been struggling with. If we cannot fix it, we either must wait a day for the results or get access to a Linux computer, which is faster than my laptop.

from __future__ import print_function import os import timeit import argparse

import pandas as pd import matplotlib.pyplot as plt

import gpl import conf.settings from util import count_samples from correlation import CorrelationMatrix

class ExpressionMatrix(object): def __init__(self, platform=None, series=None, invert=False, limit=0, top=10, **kwargs): data_path = conf.settings.DATA_PATH self.sample_number = 0 self.invert = invert self.top = top

    if series:
        file_path = os.path.join(data_path, series+'.csv')
        self.df = pd.read_csv(file_path, index_col=0)
        sample_number = count_samples(self.df)


        #print(self.df)
        print(self.df.dtypes)
        print(self.df.shape)

        self.df.iloc[:,:sample_number] = self.df.iloc[:,:sample_number].astype('float32')
        printself.df.info)

    elif platform:
        count = 0
        platform = gpl.Platform(args.platform, parse=False, meta_only=True)
        series = platform.get_series(download=False)

        for index, dataset in enumerate(series):
            file_path = os.path.join(data_path, dataset+'.csv')
            if not os.path.exists(file_path):
                file_path = os.path.join(data_path, dataset+'.tar.csv')
                if not os.path.exists(file_path):
                    continue
            df = pd.read_csv(file_path, index_col=0)
            count += 1

            sample_number = count_samples(df)
            expression_matrix = df.iloc[:,:sample_number]

            if count == 1:
                matrix = expression_matrix
            else:
                matrix = pd.concat([matrix, expression_matrix], axis=1)
                print('Concated matrix: %s' % dataset, matrix.shape)
            if limit:
                if count > limit:
                    break
        annotations = df.iloc[:,sample_number:]
        self.df = pd.concat([matrix, annotations], axis=1)
    self.sample_number = count_samples(self.df)

    for key, value in kwargs.items():
        setattr(self, key, value)

    if self.unlog:
       self.df.iloc[:,:sample_number] = 2**self.df.iloc[:,:sample_number]

def correlations(self):
    return CorrelationMatrix(self)

def main(args): expressions = ExpressionMatrix(**vars(args)) if args.load: correlations = CorrelationMatrix(expressions, calc=False) correlations.load() else: correlations = expressions.correlations() if args.save: correlations.save()

print(correlations.df.shape)
print(args.similarity)
times = []
if args.choices:
    for i in range(args.trials):
        start_time = timeit.default_timer()
        correlations.correlate(args.choices)
        stop_time = timeit.default_timer()
        difference = stop_time - start_time
        times.append(difference)
print('Average duration: ', sum(times)/len(times))

if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--series', '-s', type=str) parser.add_argument('--platform', '-p', type=str) parser.add_argument('--invert', '-i', action='store_true') parser.add_argument('--choices', '-c', type=str, nargs='+', default='') parser.add_argument('--limit', '-l', type=int, default=0) parser.add_argument('--top', '-t', type=int, default=10) parser.add_argument('--similarity', '-sim', type=str, default='pearson', help='''Method of similarity measure which can be either pearson, kendall, spearman (default: pearson).''') parser.add_argument('--trials', '-tr', type=int, default=1) parser.add_argument('--plot', '-plt', action='store_true') parser.add_argument('--unlog', '-ul', action='store_true') parser.add_argument('--save', '-sa', action='store_true') parser.add_argument('--load', '-lo', action='store_true') args = parser.parse_args() main(args)

ADD REPLY • link 6.1 years ago by tfhahn ▴ 50

0

Entering edit mode

How did you find out that some of the probe-set IDs map to different genes? Did you text and ran the code? How should I deal with this problem? Should I delete all the probe-set Ids, which map to more than one gene? How can this happen? How could you notice so quickly? Should we just keep the probe-site ID as unique tuple identifier and list it together with the systematic and common yeast gene name instead of simply deleting them?

ADD REPLY • link 6.1 years ago by tfhahn ▴ 50

0

Entering edit mode

Hi Genomax. Thanks for your help and I am sorry for the long wait time for my answers and replies. It seems that www.biostars.org only allows me to post five items within a six hour time period. That is why I had to make 2 Biostars.org accounts. I am continuing information exchange where it stopped from my TFHahn@UALR.edu account. I must somehow get this Python Script to run before I can go to sleep because it runs for several hours. My adviser is expecting results when he meets me tomorrow at 11 a.m. in his office. Tomorrow we must agree on what all needs to be part of my dissertation and which I can get accomplished realistically before the April 1st deadline.

Could you therefore please take remote control of my Linux computer, on which the troublesome Python script is running in a way that I cannot understand?

My Team Viewer ID is 970 883 755 and my password with h62q9g. I am using TeamViewer 12.

If you prefer AnyDesk my AnyDesk ID is thomashahn@ad. Could you please fix my Python script remotely and make it do what I have described in my email to my adviser so I can start generating the 10,000 by 10,000 distance matrix before I go to sleep?

I have been trying to fix it for hours because my Python tutor is sleeping but the more effort I am putting towards changing the code so that it runs properly the more dysfunction my script is becoming since my limited eyesight is not good enough to visually see the location of my narrow cursor, which can cause accidental unintended deletions or insertions because that is my only way for me to figure out where the courser is if I have forgotten where I left it.

What is the reason for the posting limitations? They cannot prevent anyone determined to post from doing so anyways because its so simple to open multiple accounts with different email address? But it is causing lots of unneeded confusion and frustrations because when looking for answers to my questions more than a week after I initially posted them, I cannot remember anymore from which account I posted them. Since it only makes finding replies harder without ever having a chance to accomplish its initial objective, i.e. preventing anyone needing answers fast from posting above their limit, could you please remove or at raise my posting limit so that I can stay in the same account when looking for replies, help and answers?

Unfortunately, I must operate under extreme time pressure because my department chair told me that if I cannot graduate this spring, I will never again have a chance to ever graduate because my university will start charging me $36,000 per year out of state tuition, which I cannot pay. They already took away my campus employment last summer since they wanted me to have graduated already a year ago.

My Skype ID is tfh002 and my email is Hahn5Thomas@gmail.com. Those two communication options work better with my screen reader than these Biostars.org text boxes because my remaining visual field is too small to figure out whether I am correctly replying, posting or commenting.

ADD REPLY • link 6.1 years ago by Thomas.F.Hahn3 ▴ 30

0

Entering edit mode

Am I doing what myself? Annotating? I downloaded the yeast2 chip annotation file from Affymetrix about a year ago. It seems to be somewhere on my hard drive, where I cannot find it, and function like a library. I cannot remember anymore how exactly I got it. I had to call Affymetrix technical support to make it work on my computer. They took remote control over my computer and installed it for me. They moved too fast for me to follow. It seems that my Python script interacts with other Python files and imports data from them in ways that I cannot follow yet. In general, is my proposed plan of action, which I emailed to my adviser, realistic? He asked me to do this a year ago but I am still not able to code it. Nobody in my academic advisory committee knows more about bioinformatics and genomic data analysis than me. That is why nobody can help me. That is why my only option is to figure it out by interacting with internet people. I must get my work published because that tells my committee that my writings are true. Otherwise, nobody knows for sure. About a year into my PhD program I discovered that my master's thesis was wrong. But since I believed it to be right people trusted me since they knew less than me. But the problem with dissertations is that - unlike master's thesis - they must be published electronically online. That is why people start getting worried about approving fundamental mistakes in my dissertation and get in trouble for it later. A publication would give us the confidence that my writings must be correct. That is why its so important in my case. This is the reason why I have circumvented the imposed posting limitations. If anybody in the USA can help me to somehow finish this work, you are welcome to call me on my cell phone at + 1 (318) 243 3940. I appreciate any help that I can get.

ADD REPLY • link 6.1 years ago by Thomas.F.Hahn3 ▴ 30