Question: Mapping to contigs or predicted genes, for quantitative gene analysis (DNAseq)
gravatar for allerdrengen55
23 days ago by
allerdrengen5520 wrote:


A little background: I have a metagenomic sample (DNA-seq). I want to figure out the gene depth/abundance of all the genes in the sample. Since there a lot of different (and same) bacteria in the sample, I expect some of the genes to have a high abundance. Keep in mind that it is a DNA-seq, so it's not an expression analysis, but rather what genes (and their depth) the sample inhabit. Lastly, I want to group the genes in COGs while still preserving the quantitative information.

I started with an assembled contigs-file and 2 FASTQ-files (Paired-end). I then predicted genes with Prokka (bacteria only) from the contigs-file.

Now, I'm not sure whether I should map my reads to the predicted genes, and then just count how many reads mapped to each gene, or if I should map my reads to my contigs, and then use FeatureCounts, to figure out how the predicted genes relate to the alignment. I guess the ladder would give me more freedom (for instance, counting fragments instead of reads, as I'm dealing with paired-end). However, I'm not too confident on the topic.

What are your suggestions?

map contigs alignment fastq dnaseq • 129 views
ADD COMMENTlink modified 23 days ago by Asaf6.5k • written 23 days ago by allerdrengen5520

I re-opened this one and deleted the previous thread. Please be sure to keep things now focused in this thread to avoid information being spread across multiple identical / similar threads.

ADD REPLYlink written 23 days ago by ATpoint26k

That is noted, thanks!

ADD REPLYlink written 23 days ago by allerdrengen5520
gravatar for Asaf
23 days ago by
Asaf6.5k wrote:

The simple solution - use both methods and compare them.

I think that the straightforward way to go is map to the contigs and use featureCounts. The mapping should be used in other downstream analyses as well such as binning or to test average contig coverage etc. Once you have the count matrix you can look for differentially abundant genes. You might want to run eggNOG mapper on your genes, prokka might not get the best homology group mapping. Once you have that you can combine it with the count matrix and collapse it to the homology group level, pretty simple with R or python.

ADD COMMENTlink written 23 days ago by Asaf6.5k

Hi Asaf, thanks for your response! A couple of questions:

Is count matrix referring to the contig alignment?

How would you suggest I normalize my genes? (Longer genes, will have more mapped reads than shorter genes)

ADD REPLYlink modified 23 days ago • written 23 days ago by allerdrengen5520

Count matrix is the matrix generated by featureCount i.e. the number of reads mapped to each gene. You shouldn't normalize the genes and use software like DESeq2 to compare samples. If you only have one sample and want to see which genes have less coverage then you should take the median coverage for each gene

ADD REPLYlink written 23 days ago by Asaf6.5k

Ah yes, that's what it is!

I have 8 big samples to compare. The goal is to look at the difference in gene/cog abundance between the samples. How do you suggest i approach that, in regards to normalization (and comparison, any tools)?

(Writing this comment from a friends user, due to comment limit. I will not write further comments before the limit is up. Just wanted to ask one more question before I'm off)

ADD REPLYlink modified 23 days ago • written 23 days ago by Hansen_86910

You won't find a strict answer. There are several tools, the most straightforward would be DESeq2 and edgeR but you have to make sure that their assumptions are met, meaning that most of the genes (COGs, KOs etc.) have the same level in all the samples.

ADD REPLYlink written 23 days ago by Asaf6.5k

So there is no universal way to normalize my genes? My samples could potentially differ pretty much in terms of abundance of genes (levels?), so those tools won't be able to help?

ADD REPLYlink written 23 days ago by allerdrengen5520

The main issue is normalization. Usually you could find a set of universal genes that will allow you to do proper normalization. Take a look at Musicc for instance:

ADD REPLYlink written 23 days ago by Asaf6.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour