Question: comparative gene quantification, problem with COGs and hypothetical proteins.
0
gravatar for Hansen_869
7 weeks ago by
Hansen_86920
Hansen_86920 wrote:

I have 8 metagenomic samples (bacterial DNA) that I have generated count-matrixes for (I used Prokka for annotation). This means, that I now have the abundance of all genes, in all of my 8 samples. I have then normalized my samples with TPM (Transcript per Million reads), as my sample sizes are of different size, so i can better compare the samples. I want to group my genes into COGs (Cluster of Orthologous Groups). My goal is to look at the relative gene abundance of the different samples, and be able to compare specific COGs to other COGs across different samples. But now I approached a problem.

When I group my genes into COGs, the samples that have the most functionally annotated proteins (as opposed to "Hypothetical proteins"), will naturally have a higher abundance of total reads mapped to the COGs. Say a sample "A" have 60% Hypothetical proteins and sample "B" have 40% Hypothetical proteins, then sample "B" will have more functionally annotated proteins (60%), and thus more proteins (and consequently more reads) will group into each COG. Thus, when i compare 2 similar COGs across samples, in most cases COGs from sample "B" will have more reads mapped to them.

How do I solve this problem? If i re-calculate my TPM only for the COGs (disregarding Hypothetical proteins), would that give me a false picture of the relative gene abundance?

Thanks!!

abundance cog tpm gene • 102 views
ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by Hansen_86920
0
gravatar for Asaf
7 weeks ago by
Asaf7.3k
Israel
Asaf7.3k wrote:

Someone else faced this problem a while ago: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0610-8 (MUSiCC)

Basically their solution is to normalize each sample using a list of single-copy proteins, this way you normalize to the estimated number of genomes in your sample.

ADD COMMENTlink written 7 weeks ago by Asaf7.3k

Cool thanks! After examing the reference and if I understand it correctly: I provide an abundance file of the genes and their mapped reads to Musicc, and then I can confidently assign them to COGs? Also, do I have to TPM normalize afterwards?

ADD REPLYlink written 7 weeks ago by Hansen_86920

I actually never used MUSiCC, I just borrowed their list of single-copy proteins and used it for normalization in DESeq2.

# A universal list of KOs from musicc https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4391136/ 
universals <- c("K00133","K00789","K00927","K00939","K01689","K01803","K01866","K01867","K01868","K01869","K01870","K01872","K01873","K01874","K01875","K01876","K01881","K01883","K01887","K01889","K01890","K01892","K01937","K02357","K02519","K02528","K02600","K02601","K02835","K02838","K02863","K02864","K02867","K02871","K02874","K02876","K02878","K02879","K02881","K02884","K02886","K02887","K02890","K02892","K02895","K02904","K02906","K02926","K02931","K02933","K02946","K02948","K02950","K02952","K02956","K02961","K02965","K02967","K02982","K02986","K02988","K02992","K02994","K02996","K03040","K03076","K03106","K03110","K03438","K03470","K03664","K03687","K03702","K06942","K09903","K10773")
dds <- estimateSizeFactors(dds, locfunc=shorth, controlGenes=rownames(counts) %in% universals)

No other TPM normalization needed. I know it's KO rather than COG, I'm not sure they can be mapped 1:1.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Asaf7.3k

But do I have to associate my genes with KO? I'm not familar with KO. You wrote a list of KO-IDs, are those the same as the single-copy proteins?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Hansen_86920

The list of KOs are of single-copy proteins. I believe it's the same one used in other tools like checkm, introduced by Peer Bork's group. This paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022099 lists their COGs

ADD REPLYlink written 6 weeks ago by Asaf7.3k

I see. I just don't get, how a single copy marker gene is identified, if Musicc is just given a KO/COG ID and an abundance number.

ADD REPLYlink written 6 weeks ago by Hansen_86920

The input is an abundance table of COGs, it uses the SCG for normalization.

ADD REPLYlink written 6 weeks ago by Asaf7.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1958 users visited in the last hour