Question

Understaning the result of Meta-analysis of differential expression across datasets

0

Entering edit mode

8 months ago

JACKY ▴ 140

I have performed differential expression anaysis using limma, for about 20 data sets. Each one of them alone. Then I performed a meta-analysis approach to visualize the results using a volcano plot. Some of the genes appearing in the volcano plot make a lot of sense, however there are some things that I don't understand what they are.

EDIT - finally after some digging, I found that these are genes from the mitochondrial DNA.

What is "J01415" ? there are versions of this in my results, what are those stuff and should I maybe get rid of them before I conduct the analysis ? here is the plot :

enter image description here

r differential-expression genes • 937 views

ADD COMMENT • link updated 8 months ago by i.sudbery 19k • written 8 months ago by JACKY ▴ 140

0

Entering edit mode

Just out of interest, how did you perform the meta-analysis?

ADD REPLY • link 8 months ago by i.sudbery 19k

0

Entering edit mode

Also, where did these data come from?

ADD REPLY • link 8 months ago by i.sudbery 19k

0

Entering edit mode

I collected processed data from several cancer cohorts regarding ICI treatments. For each data set I performed differential expression analysis, and I kept only the significant p-value genes.

Then I made a function that amalgamate p-values using the chi-squared statistic. For each distinct gene in the datasets, this function gathers individual p-values. After collecting the p-values, the function computes a combined p-value for each gene.

Concurrently, for every unique gene, the average log fold change is determined across all datasets.

ADD REPLY • link 8 months ago by JACKY ▴ 140

score 0 · Answer 1 · 2023-08-24

0

Entering edit mode

8 months ago

i.sudbery 19k

Things like J01415, AC079781.4 and AL121578.7 are GenBank IDs. Sometimes they are reffered to as "Clone IDs". GenBank was the database set up at the start of the sequencing era (i.e the 1980s) to store results of sequencing. J01415 is the accession code for the mitochondrial clone that was used to construct the Hg38 human genome sequence. My guess would be that J01415.16 is either the sixteenth gene on that sequence, or the 16th version of that sequence.

The content of those clones could be anything. AC079781.4 turns out to be a BAC clone with sequence with part of the sequence of chromosome 7, while AL121578 is the sequence from the X chromosome assembled from Cosmids.

In all these cases the accessions turn out to be antiquated terms for chromosomal sequence in the human genome.

Looks to me like at least one of the sets of samples has been quantified against an odd reference annotation. If thats the case then, depending on how you did the meta-analysis you might have rows that just have a single p-value for one sample and a bunch of missing data or 0s, which is a very non-random p-value distribution, and would therefore come out of a meta-analysis based on something like Pearson's method.

ADD COMMENT • link 8 months ago by i.sudbery 19k

0

Entering edit mode

So the best course of action might be to just get rid of those "genes" before doing anything with the data? cause I've never delt with this kind of data before.

ADD REPLY • link 8 months ago by JACKY ▴ 140

0

Entering edit mode

Sometimes genes with clone id names can be annotations derived from cDNA clones. Generally if things have made it this far without a proper gene name, they are things you might not be intersted in following up in the first instance.

However, I'm rather confused by how you ended up with clones representing whole chromosomes in an expression dataset.

ADD REPLY • link 8 months ago by i.sudbery 19k

0

Entering edit mode

I don't know either, I never expected to have those stuff in my data. The only data I gathered is RNA-seq data, either from GEO database or from the article it self. This just might mean that the extraction it self in those cohorts, was poorly done, since it has metochondrial DNA so I suspect there are dead cells in there.

ADD REPLY • link 8 months ago by JACKY ▴ 140

0

Entering edit mode

Are you sure all the datasets are RNA-seq datasets?

RNA-seq data sets will always have some mitochondrial DNA in them, but i've never heard of counting the reads mapping to the mitochondiral choromosome. The point about mitochondrial RNA signifying bad quality is a metric normally associated with single cell RNA-seq, not bulk.

Plus I'm not entirely sure those things are mitochondrial genomic DNA as there appears to be both J01415.16 and J01415.20 in the data. J01415 is the mitochondrial chromosome. But you can't have both version 16 and version 20 in the same analysis. That suggests to me that these reffer to either something like "the 16th fosmid in the mitchondrial assembly" or the 16th gene on the mitochondrial chromosome. If this were restricted to the mitochondrial genome, I'd probably say just ignore it, but you also have examples of this sort of thig for at least chormosome X and chromosome 7 as well.

This suggest a couple fo things: 1) Probably not all your datasets are using the same names for genes. You need to check that, because meta-analysis will only work if the same names are used for genes in each data set. If one data set calls it GAPDH, one GAPD and one ENSG00000111640 then the meta-analysis isn't going to work. or 2) some of these datasets arn't RNA-seq. They with names like that they could be very old tiling array data. Or perhaps they are SAGE data (a precursor to RNA-seq).

ADD REPLY • link 8 months ago by i.sudbery 19k