Question: Merging Datasets Together Using Excel
0
gravatar for Vanceed
5.1 years ago by
Vanceed30
United States
Vanceed30 wrote:

Hello all, I am currently a PhD student and I am looking at datasets from GEO and trying to merge them together and starting to analyze overlapping genes. Does anyone have input onto how to do this?

bioinformatics microarray • 3.6k views
ADD COMMENTlink modified 5.0 years ago by umer.zeeshan.ijaz1.7k • written 5.1 years ago by Vanceed30
12

enter image description here

ADD REPLYlink written 5.1 years ago by Pierre Lindenbaum119k
1

You will have to show us some lines from the files and tell us what exactly you are trying to accomplish.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Ashutosh Pandey11k

You need to provide far more detail about what, precisely, you want to do and the type of data involved.

ADD REPLYlink written 5.1 years ago by Neilfws48k
1
gravatar for Irsan
5.1 years ago by
Irsan6.8k
Amsterdam
Irsan6.8k wrote:

If you want to use excel for this, google for VLOOKUP function. If you are working in a unix-like environment, use join command. Have a look at this paper why you shouldn't use excel. Personally I think you can use excel for certain bio-informatics tasks but you have to be very careful with data types. Unfortunately the people that want to use excel are usually not so careful/aware of all the things that can go wrong

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Irsan6.8k

I think pasting gene names as "text" in excel should take care of most of the problems. Sometimes, excel is handy if you do it carefully.

ADD REPLYlink written 5.1 years ago by Ashutosh Pandey11k
2

But as Irsan says, vast majority of users do not do it carefully. The point-and-click mentality does not encourage thoughtful reflection.

ADD REPLYlink written 5.1 years ago by Neilfws48k
0
gravatar for Biojl
5.1 years ago by
Biojl1.6k
Barcelona
Biojl1.6k wrote:

I would export them as tsv or csv and run the command join from a linux/mac command line.

ADD COMMENTlink written 5.1 years ago by Biojl1.6k
0
gravatar for fatstrat389
5.1 years ago by
fatstrat38930
United States
fatstrat38930 wrote:

I think it would be prudent for you to pick up MySQL. SQL is (kinda) like a spreadsheet but it enforces data types and may help you avoid the aforementioned pitfalls of excel.

ADD COMMENTlink written 5.1 years ago by fatstrat38930
0
gravatar for alaincoletta
5.0 years ago by
alaincoletta110
Belgium
alaincoletta110 wrote:

InSilico DB has a "merging" R-Bioconductor package to combine public datasets from GEO and their clinical annotations. If you are not using R you can also combine data from the online platform (See this short step-by-step tutorial



Example:
# Retrieve 2 datasets
eset1 = getDataset(gse="GSE10072", gpl="GPL96", norm="ORIGINAL", genes=TRUE);
eset2 = getDataset(gse="GSE7670", gpl="GPL96", norm="ORIGINAL", genes=TRUE);

#combine them
esets = list(eset1, eset2);
eset = merge(esets, method="NONE");

#plot them
plotMDS(eset, targetAnnot="Disease", batchAnnot="Study");

InSilico DB packaged various batch removal effects methods so line 4 could be replaced with:

eset = merge(esets, method="XPN");
or
eset = merge(esets, method="COMBAT");

Hope this helps.

For more info Bioinformatics paper reference; InSilico DB and InSIlico Merging packages links, and blog link.

- Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages -BMC Bioinfomatics [http://www.biomedcentral.com/1471-2105/13/335/abstract]

- inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO - Bioinformatics [http://bioinformatics.oxfordjournals.org/content/27/22/3204]

-Tutorial example : https://insilicodb.org/the-impact-of-batch-effects-when-merging-different-data-sets/

R-Bioconductor packages:
http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoDb.html
and
http://www.bioconductor.org/packages/2.12/bioc/html/inSilicoMerging.html

ADD COMMENTlink written 5.0 years ago by alaincoletta110
0
gravatar for umer.zeeshan.ijaz
5.0 years ago by
Glasgow, UK
umer.zeeshan.ijaz1.7k wrote:

Save your Excel sheet from each sample as a comma-separated file in a given folder within a "Main_Folder" folder. The contents of the file should be [Feature],[value] and folder name be the sample name. All csv file should share a common string as a name (e.g. *_data.csv) for us to grep them.

Then use my http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/collateResults.pl in the "Main_Folder" as

perl collateResults.pl -f . -p _data.csv

and it should merge the data together

So say you have

BEFORE:
Main_Folder/Folder_1/*_data.csv:
species_A,2
species_B,4
species_C,5

Main_Folder/Folder_2/*_data.csv:
species_A,3
species_D,5

AFTER:
Samples,Folder_1,Folder_2
species_A,2,3
species_B,4,0
species_C,5,0
species_D,0,5

Alternatively If you can produce records of this form: [Sample_Name]\t[Feature]\t[Value] then use my GENERATEtable.sh script

$ cat test.tsv
contig1 F1 12.2
contig1 F2 34.2
contig1 F3 45.2
contig2 F2 56.3
contig2 F3 56.2
contig3 F1 45.4
contig3 F2 56.3
contig4 F1 23.5
contig5 F1 24.5
$ cat GENERATEtable.sh
#!/bin/bash
less <&0| \
perl -ane '$r{$F[0].":".$F[1]}=$F[2];
  unless($F[0]~~@s){
   push @s,$F[0];}
  unless($F[1]~~@m){
   push @m,$F[1];}
END{
print "Contigs\t".join("\t",@s)."\n";
for($i=0;$i<@m;$i++){
  print $m[$i];
  for($j=0;$j<@s;$j++){
   (not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}
  print "\n";}}' 
$ cat test.tsv | ./GENERATEtable.sh
Contigs contig1 contig2 contig3 contig4 contig5
F1 12.2 0 45.4 23.5 24.5
F2 34.2 56.3 56.3 0 0
F3 45.2 56.2 0 0 0

 

Best Wishes,

Umer

ADD COMMENTlink written 5.0 years ago by umer.zeeshan.ijaz1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1114 users visited in the last hour