Question: Differentially gene expressed analyses of two different samples
1
gravatar for Mehmet
3.9 years ago by
Mehmet600
Japan
Mehmet600 wrote:

Dear All:

I have two different samples. I did assembly, alignment and run Cufflinks for transcripts.gtf files.

I would like to ask you; how can I find differentially expressed genes between two different samples?

I mean;

Sample A Sample B

they are two different samples, which means I have two different transcriptome assembly, aligned .bam files.

Thank you.

rna-seq next-gen R • 1.4k views
ADD COMMENTlink modified 3.9 years ago by EVR570 • written 3.9 years ago by Mehmet600

Based on your question I assume you have no replicates? Just sample A and B?

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Yes, two samples, two .bam files.

ADD REPLYlink written 3.9 years ago by Mehmet600
2

You might already know this, but there is no statistically sound way to reliably deduct differentially expressed genes based on just two samples without any replicates. Essentially, an algorithm for this analysis will not be able to differentiate technical/biological noise from true differential expression. If you google 'differential expression analysis without replicates' you'll probably find some hits.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k
0
gravatar for EVR
3.9 years ago by
EVR570
Earth
EVR570 wrote:

Hi, Try GOLD. I can give pretty decent results. Also DESeq also has option for finding. diff genes without replicates. Start with Gfold first

ADD COMMENTlink written 3.9 years ago by EVR570

Hi:

How can I download and use it? I could not access.

ADD REPLYlink written 3.9 years ago by Mehmet600
1

http://compbio.tongji.edu.cn/~fengjx/GFOLD/

ADD REPLYlink written 3.9 years ago by EVR570

Hi: Thank you. I installed and used. For gfold -diff; I used this command; ./gfold diff -s1 Sample1 -s2 Sample2 -suf .read_cnt -o Sample1VSSample2.diff

Sample1.1298.read_cnt

CUFF.1 NA 30 258 3.30355 CUFF.10 NA 42 436 2.73679 CUFF.100 NA 255 825 8.78143 CUFF.1000 NA 28 427 1.86298 CUFF.10000 NA 64 619 2.93744 CUFF.10001 NA 234 973 6.83254 CUFF.10002 NA 930 487 54.2542 CUFF.10003 NA 90 994 2.57238 CUFF.10004 NA 95 1036 2.60521 CUFF.10005 NA 81 1080 2.13079

Sample2.read_cnt

CUFF.1 NA 952 3234 6.63002 CUFF.10 NA 70 289 5.45529 CUFF.100 NA 22 458 1.08187 CUFF.1000 NA 24 344 1.57134 CUFF.10000 NA 21 328 1.44199 CUFF.10001 NA 34 327 2.34179 CUFF.10002 NA 26 240 2.43994 CUFF.10003 NA 22 257 1.928 CUFF.10004 NA 35 208 3.78985 CUFF.10005 NA 137 432 7.14257

But, I received this error:

ERROR: The read count file Sample2.read_cnt is not in the right format. Please refer to the documentation.

I checked the documentation, but could not find any solution.

Do you have any idea about this error?

Thank you.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Mehmet600

Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

http://compbio.tongji.edu.cn/~fengjx/GFOLD/gfold.html

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Two read count files were produced by GFOLD, and they are the same. GFOLD -diff says sample2 is not in the right format.

ADD REPLYlink written 3.9 years ago by Mehmet600

Could you post some lines of sample2 and we could check whether it is in right format

ADD REPLYlink written 3.9 years ago by EVR570

Hi: this is sample2:

CUFF.1      NA  952 3234    6.63002
CUFF.10     NA  70  289 5.45529
CUFF.100    NA  22  458 1.08187
CUFF.1000   NA  24  344 1.57134
CUFF.10000  NA  21  328 1.44199
CUFF.10001  NA  34  327 2.34179
CUFF.10002  NA  26  240 2.43994
CUFF.10003  NA  22  257 1.928
CUFF.10004  NA  35  208 3.78985
CUFF.10005  NA  137 432 7.14257
CUFF.10006  NA  71  1034    1.54652
CUFF.10007  NA  28  556 1.13423
CUFF.10008  NA  28  418 1.50869
CUFF.10009  NA  30  456 1.48175
CUFF.1001   NA  361 1062    7.65597
CUFF.10010  NA  30  443 1.52523
CUFF.10011  NA  195 627 7.00462
CUFF.10012  NA  41  306 3.01773
CUFF.10013  NA  72  425 3.81559
CUFF.10014  NA  72  412 3.93598
CUFF.10015  NA  50  733 1.53633
CUFF.10016  NA  35  487 1.61866
CUFF.10017  NA  1747    775 50.7702
CUFF.10018  NA  30  378 1.7875
CUFF.10019  NA  26  696 0.84136
CUFF.1002   NA  357 827 9.72255
CUFF.10020  NA  42  638 1.48268
CUFF.10021  NA  52  539 2.17286
CUFF.10022  NA  60  676 1.99904
CUFF.10023  NA  116 1084    2.41016

this is sample1

CUFF.1      NA  30  258 3.30355
CUFF.10     NA  42  436 2.73679
CUFF.100    NA  255 825 8.78143
CUFF.1000   NA  28  427 1.86298
CUFF.10000  NA  64  619 2.93744
CUFF.10001  NA  234 973 6.83254
CUFF.10002  NA  930 487 54.2542
CUFF.10003  NA  90  994 2.57238
CUFF.10004  NA  95  1036    2.60521
CUFF.10005  NA  81  1080    2.13079
CUFF.10006  NA  82  371 6.27941
CUFF.10007  NA  76  662 3.26163
CUFF.10008  NA  70  743 2.67663
CUFF.10009  NA  328 746 12.4915
CUFF.1001   NA  22  339 1.84375
CUFF.10010  NA  38  505 2.13782
CUFF.10011  NA  73  759 2.7325
CUFF.10012  NA  297 758 11.1318
CUFF.10013  NA  107 563 5.39951
CUFF.10014  NA  1070    4022    7.55824
CUFF.10015  NA  1137    4101    7.8768
CUFF.10016  NA  45  179 7.14231
CUFF.10017  NA  38  480 2.24917
CUFF.10018  NA  1703    2644    18.2992
CUFF.10019  NA  208 1168    5.05941
CUFF.1002   NA  248 440 16.0132
CUFF.10020  NA  27  274 2.79958
CUFF.10021  NA  66  1215    1.54329
CUFF.10022  NA  166 2834    1.66413
CUFF.10023  NA  88  2115    1.18209
CUFF.10024  NA  22  452 1.38281
ADD REPLYlink modified 3.9 years ago by WouterDeCoster44k • written 3.9 years ago by Mehmet600

I don't know how to edit this page. I copied and pasted. In the original file, there are 5 columns, but when I pasted here, it looked like this way.

ADD REPLYlink written 3.9 years ago by Mehmet600
1

I edited your post and formatted the file parts using the 101010 button.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Each file contains two columns corresponding to gene names and read counts separated by a TAB. All files are sorted by gene names and have the same number of lines.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Sorry;

What does this mean? Each file has 5 columns as suggested by gfold diff. Each file has gene name and read counts. These are from 2 different samples, not replicates. I guess number of lines of each file is not equal to each other. Is this the reason ?

ADD REPLYlink written 3.9 years ago by Mehmet600

I haven't installed the software myself, just looking at the manual. Based on what do you find that the file needs to have 5 columns?

Checking the number of lines can be done using wc -l yourfile.txt

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by WouterDeCoster44k

Hi: Number of lines of the two samples:

22331 Sample1.read_cnt
18948 Sample2.read_cnt

When I tried to use only two columns containing gene names and read counts, the program gave an error saying files don't have 5 columns.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Mehmet600

So now you know your files don't have the same number of lines...

You could remove all genes which are not present in both files, e.g. using something like the following:

cat <(cut -f1 Sample1.read_cnt) <(cut -f1 Sample2.read_cnt) | sort | uniq -d > identifiers_in_both.txt
grep -w -f identifiers_in_both.txt Sample1.read_cnt > Sample1.read.matching identifiers_cnt
grep -w -f identifiers_in_both.txt Sample2.read_cnt > Sample2.read.matching identifiers_cnt
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by WouterDeCoster44k

when I run this command: grep -f -w identifiers_in_both.txt Sample1.read_cnt > Sample1.read.matching identifiers_cnt

I got this error: grep: -w:There is no such file or directory

ADD REPLYlink written 3.9 years ago by Mehmet600
1

Wow, my bad, I made a mistake in the command and I've corrected it in the previous post.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by WouterDeCoster44k

Thank you very much. You saved my life. Finally, I could run the program and finish my analysis. Thank you very much again for your help.

ADD REPLYlink written 3.9 years ago by Mehmet600

Hi:

I run gfold. Now I need to visualize the results. How can I do that?

ADD REPLYlink written 3.9 years ago by Mehmet600
1

You shouldn't expect that we help you with every step of your project... definitely not with open question like this. Try to solve some issues on your own, you'll learn more!

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k
1

I second WousterDeCoster point. You must learn some basic stuffs. It is tooooo silly to ask, how to open or visualize a file.

ADD REPLYlink written 3.9 years ago by EVR570

I know that I need to search and learn by myself. I did, and in a forum I read posts that people sent using GFOLD asked any tool which is suitable for visualize results of GFOLD. So, I though it is normal to ask people here to learn which tool/program can be used to visualize results. Otherwise, why would I ask ? I am not tend to make things easy by asking here. please do not judge quickly.

ADD REPLYlink written 3.9 years ago by Mehmet600
1

Just googling "visualise gene expression data" or similar will give you already an idea on what you can do.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

I tried many times to visualize my results, most packages of R have been designed for R packages (DESeq, edgeR etc).

ADD REPLYlink written 3.9 years ago by Mehmet600

Which visualization do you want to make?

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Just open the result file with Excel. The output is in tab delim format. So excel would be best way to visualize

ADD REPLYlink written 3.9 years ago by EVR570

Visualization could also mean a graphical representation such as a volcano plot, violin plots, MA plot, heatmap, PCA. Although we shouldn't forget that this entire analysis is based on two samples and therefore doesn't really make sense or is reliable at all.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

I did a volcano plot,but it looked different compared to general plots. I couldn't upload it here. I just am wondering which values should be used to make a volcano or MA plot. People use log fold change and pvalue to create volcano and MA plots. Gfold gives gfold value, fdr , log2fold change. The author creating gfold suggest to use gfold value as pvalue. I want to get your suggestions.

ADD REPLYlink written 3.9 years ago by Mehmet600

You cannot calculate a "realistic" p-value or similar when just comparing two samples. That's utter nonsense.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1934 users visited in the last hour