Generate heatmap or table showing genes found in multiple tests
2
1
Entering edit mode
6.9 years ago
Rubal ▴ 350

Hello All,

This is a duplicate of a question I have posted on stack overflow but I think perhaps this community is better informed about possible tools that can solve it.

Question on stack overflow:

http://stackoverflow.com/questions/32312790/converting-lists-of-data-into-informative-table-or-heatmap

So the issue is:

I am looking for a good way to visualise overlaps of genes found in multiple tests. I would like to check for overlap between multiple files containing lists of genes and output a table that shows which files contain which genes in a specific way (example output further below).

I have multiple text files with lists of genes. One gene per line. Files range from approximately 30-100 rows. To be as clear as possible I will show 4 example files that I have shortened for space.

File1:

NRG3
FOXP3
SHH2
ROBO1
PPP3CA

File2:

NRG3
SHH2

File3:

NRG3
ROBO1

File4:

ROBO1

I would like a way to take these files and create an output table that takes all the genes that are in the files and prints them as the rows of the first column (sorted alphabetically with each gene appearing only once). Then the following columns each represent an input file. If a gene appears in a file there will be an 'x' (or some arbitrary marker) to represent this in the relevant column. This would provide an easy way to visualise which genes appear in multiple files. Like this:

    File1   File2   File3   File4
FOXP3   X
NRG3    X   X   X
PPP3CA  X
ROBO1   X       X   X
SHH2    X   X

It would be even more useful if, instead of an 'x' to represent if a gene appears in a file, this was shown in a heatmap color-gradient style way, so when a gene is found in only one file the relevant cell is shaded a light yellow, whereas if it appears in all files the cells are shaded a dark red. Are there R packages that exist to do this? However this is not essential, I just think it would be cool and improve the clarity of the visualisation.

I would appreciate any advice on how to go about doing this, especially if there are existing packages in R I am unaware about that do this already. Let me know how I can be more clear in explaining this problem.

Thank you for your help

heatmap overlap venn comparative genomics • 1.9k views
2
Entering edit mode
6.9 years ago
Alternative ▴ 270

This is how I would do it in R:

FILES <- c("FILE1.txt", "FILE2.txt", "FILE3.txt", "FILE4.txt")

genes.all <- unlist(lapply(FILES, function(x){ readLines(x) }))

genes.all <- unique(genes.all)

res.df <- as.data.frame(lapply(FILES, function(x){

genes.all %in% readLines(x) }), row.names = genes.all)

colnames(res.df) <- FILES

0
Entering edit mode

Thanks this worked!

1
Entering edit mode
6.9 years ago

In R, you could use merge_recurse function from reshape package:

A.txt:
id    file1
abc    1
def    1
ghi    1

B.txt:
id    file2
abc    1
def    1
jkl    1
mno    1

C.txt:
id    file3
abc    1
def    1
ghi    1
lll    1

>file1 <- read.table("A.txt", sep="\t", header=T)
>my.list=list(file1,file2,file3)
>merge_recurse(my.list)

id file1 file2 file3
1 abc     1     1     1
2 def     1     1     1
3 ghi     1    NA     1
4 jkl   NA     1   NA
5 mno   NA      1   NA
6 lll   NA    NA     1


Just create the dummy column for all your files, and name the column with the file name. Then keep the same header for the columns for you would like to merge.

0
Entering edit mode

Thanks I will try this, I'm assuming it works with only one column in each input file.

0
Entering edit mode

If you would like to keep 'x' or '1', then better to reformat your data as shown above. It should not be very difficult. like:

awk 'BEGIN{ print "id\tFile1"} { print $1"\t1" }' File1 keep a loop for all the files: for file in File1 File2 File1 ; do awk -v name=$file 'BEGIN{print "id\t"name} { print $1"\t1"}'$file > new_\${file}; done​
0
Entering edit mode

Thanks very much this works too but option below was slightly easier for me to implement.