Where To Find Test Datasets For Data Classification Problems
6
1
Entering edit mode
11.3 years ago

HI there.

i am new to this blog and bioinfo too.

i am finishing my master degree in bioinformatics and findign very interesting the data mining field. I have started to practice and perform some analysis on Acute Lymphoblastic Leukemya and acute myeloid datasets based on the famous Golub dataset from here http://datam.i2r.a-star.edu.sg/datasets/krbd/Leukemia/ALLAML.html Hi understand it is one of the most famous and studied. Almost every paper talks about it.

My question is, once i finished my analysys on this dataset in order to validate these conclusion i would like to challenge the methods with some other dataset related to the same classification problem (ALL; AML). Any suggestion where can i get some different dataset?

Thanks

dataset • 6.4k views
ADD COMMENT
1
Entering edit mode

It sounds like you are pretty new to bioinformatics. I'd suggest finding a local collaborator to help you navigate the database and format details until you are a bit more comfortable with the data. It will certainly save you some time and energy in the short term and you will learn more in the long term.

ADD REPLY
0
Entering edit mode

yes i am..as i said before. Anyway thanks for the tips i am starting to figure out all this stuff.

Thanks

ADD REPLY
0
Entering edit mode

If you have further questions, please can you post them as comments under each answer, not as new answers.

ADD REPLY
1
Entering edit mode
11.3 years ago

The largest collection of publicly available microarray data are at NCBI GEO. A large collection of cancer samples is available in GSE2109, but there are literally hundreds of potentially interesting datasets there.

ADD COMMENT
0
Entering edit mode
11.3 years ago

That was very helpful what is the difference among GPL and GSE accession?

Thanks

ADD COMMENT
0
0
Entering edit mode

ok.... Thanks for your tip Regards

ADD REPLY
0
Entering edit mode
11.3 years ago

sorry for bothering you again, there is something is not clear to me.

i have always worked with arff files already formatted, with several instances thousands of attributes and divided in to classes. If i want to try another dataset (with many patients, instances) i have to obtain an arff file containing several intances (1 instance==1 patient) since i am using weka as a data mining software. My question: is this http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM353943 (let's pretend this is the case i want) the expression array related to just one patient? what if i want many of these arrays (for many patients) and then want to obtain a single arff file divided in to classes in order to work with weka? I probably may seem very naive to you and let's say i am indeed, but i am at the very beginning of this.

Thanks

ADD COMMENT
0
Entering edit mode

Yes, GSMs represent single samples. GSE and GDS records represent collections of related samples. I'm not sure what you mean by "arff" format. The format used by NCBI GEO is SOFT format or MiNIML, both of which are specific to NCBI GEO.

ADD REPLY
0
Entering edit mode

arff is the input file for weka data mining software, Attribute-Relation File Format.

ADD REPLY
0
Entering edit mode

pl give ur mail id, i need to talk to u for neuropsychiatric disease protein or snp dataset in arff format. my mail id-bgupta.rs.cse@itbhu.ac.in,+919307025085

ADD REPLY
0
Entering edit mode

Ahh, I see. You can simply convert the microarray data into arff format yourself. Alternatively, if you want to pursue machine learning, you could use R directly since the GEOquery package loads data from NCBI GEO directly into R.

ADD REPLY
0
Entering edit mode
11.3 years ago

You can always look for the cancer genome atlas datasets tcga-data.nci.nih.gov/tcga they have an acutemyeloid leukemia dataset among others. I don't know about ALL samples instead.

ADD COMMENT
0
Entering edit mode
11.3 years ago

Thanks a lot to all. still trying to figure out this matter. anyway it's fascinating Regards

ADD COMMENT
0
Entering edit mode
11.3 years ago
Neilfws 49k

There was a recent blog post on this topic: "High-Dimensional Microarray Data Sets in R for Machine Learning." The author has created an R package containing 20 or so datasets.

ADD COMMENT
0
Entering edit mode

Thank you, i'll give it a look asap.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6