Question

How to use the data from SRA database?

1

Entering edit mode

5.4 years ago

dz2353 ▴ 120

Dear friends, Maybe the post title is not described accurately, but I do not know how to say it. I want to download some data from SRA database on NCBI as a control group to make a comparison with my treatment group so that I can do some downstream analysis like differently expressed genes analysis. But I do not know how to choose proper raw data. The first problem is that if I can compare the single-end data with paired-end data? If the size of raw data is different, can I compare them directly? Another question is that after getting the gene expression matrix, do I need to use the TMM method to eliminate the batch effect?

RNA-Seq next-gen • 1.2k views

ADD COMMENT • link updated 5.4 years ago by WouterDeCoster 47k • written 5.4 years ago by dz2353 ▴ 120

1

Entering edit mode

You can only remove batch effect between different experiments, if at least one group overlaps between the two experiments. As I read your design correctly you want to download controls... To compare with your treatment group... It sounds like you don't have at least one overlapping group in both experiments.

ADD REPLY • link 5.4 years ago by Benn 8.3k

0

Entering edit mode

Maybe I need to make it more clear. My case is that I have three amniotic epithelial cell samples (AECs), and I want to find out the differently expressed genes between AECs and hESC. However, I do not have hESC's data, so I have to download some from the SRA database. Actually, I did find some hESCs from different projects. But the result of PCA is not good. hESC samples from the different projects can not cluster together. So I want to know how to figure out this issue. Thanks a lot!

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

score 3 · Answer 1 · 2018-12-05

3

Entering edit mode

5.4 years ago

WouterDeCoster 47k

I want to download some data from SRA database on NCBI as a control group to make a comparison with my treatment group so that I can do some downstream analysis like differently expressed genes analysis.

You won't be able to see the difference between 'treatment and control' differential expression vs technical differences between the datasets. Differential expression analysis is only valid if you don't have technical confounders. Libraries should be made with the same kit, in the same lab, for the same sequencer; ideally by the same person.

The first problem is that if I can compare the single-end data with paired-end data?

That's not ideal, but it is not your only problem as described above.

If the size of raw data is different, can I compare them directly?

For differential expression analysis you should use a method such as edgeR or DESeq2 which will take care of normalizing the size of libraries.

Another question is that after getting the gene expression matrix, do I need to use the TMM method to eliminate the batch effect?

You cannot eliminate the batch effect since it is confounded by your treatment effect.

ADD COMMENT • link 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks for your detailed reply. I think I made a mistake. My case is that I have three amniotic epithelial cell samples (AECs), and I want to find out the differently expressed genes between AECs and hESC. Obviously, the relationship between AEC and hESC is not that as I mentioned before. I downloaded some hESC sequences from the SRA database. Before doing the DEG analysis, I did PCA analysis. But the result of PCA is not good. hESC samples from the different projects can not cluster together. So I think maybe somewhere I ignored and that is what I want to figure out. Thanks again.

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

0

Entering edit mode

You are comparing condition A with condition B, but your comparison between those conditions is confounded by technical differences. AEC vs hESC is the same problem as treatment vs control.

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k