Question: Mapping to contigs or predicted genes, for quantitative gene analysis (DNAseq)
gravatar for allerdrengen55
4 months ago by
allerdrengen5520 wrote:


A little background: I have a metagenomic sample (DNA-seq). I want to figure out the gene depth/abundance of all the genes in the sample. Since there a lot of different (and same) bacteria in the sample, I expect some of the genes to have a high abundance. Keep in mind that it is a DNA-seq, so it's not an expression analysis, but rather what genes (and their depth) the sample inhabit. Lastly, I want to group the genes in COGs while still preserving the quantitative information.

I started with an assembled contigs-file and 2 FASTQ-files (Paired-end). I then predicted genes with Prokka (bacteria only) from the contigs-file.

Now, I'm not sure whether I should map my reads to the predicted genes, and then just count how many reads mapped to each gene, or if I should map my reads to my contigs, and then use FeatureCounts, to figure out how the predicted genes relate to the alignment. I guess the ladder would give me more freedom (for instance, counting fragments instead of reads, as I'm dealing with paired-end). However, I'm not too confident on the topic.

What are your suggestions?

map contigs alignment fastq dnaseq • 230 views
ADD COMMENTlink modified 4 months ago by Asaf7.3k • written 4 months ago by allerdrengen5520

I re-opened this one and deleted the previous thread. Please be sure to keep things now focused in this thread to avoid information being spread across multiple identical / similar threads.

ADD REPLYlink written 4 months ago by ATpoint32k

That is noted, thanks!

ADD REPLYlink written 4 months ago by allerdrengen5520
gravatar for Asaf
4 months ago by
Asaf7.3k wrote:

The simple solution - use both methods and compare them.

I think that the straightforward way to go is map to the contigs and use featureCounts. The mapping should be used in other downstream analyses as well such as binning or to test average contig coverage etc. Once you have the count matrix you can look for differentially abundant genes. You might want to run eggNOG mapper on your genes, prokka might not get the best homology group mapping. Once you have that you can combine it with the count matrix and collapse it to the homology group level, pretty simple with R or python.

ADD COMMENTlink written 4 months ago by Asaf7.3k

Hi Asaf, thanks for your response! A couple of questions:

Is count matrix referring to the contig alignment?

How would you suggest I normalize my genes? (Longer genes, will have more mapped reads than shorter genes)

ADD REPLYlink modified 4 months ago • written 4 months ago by allerdrengen5520

Count matrix is the matrix generated by featureCount i.e. the number of reads mapped to each gene. You shouldn't normalize the genes and use software like DESeq2 to compare samples. If you only have one sample and want to see which genes have less coverage then you should take the median coverage for each gene

ADD REPLYlink written 4 months ago by Asaf7.3k

Ah yes, that's what it is!

I have 8 big samples to compare. The goal is to look at the difference in gene/cog abundance between the samples. How do you suggest i approach that, in regards to normalization (and comparison, any tools)?

(Writing this comment from a friends user, due to comment limit. I will not write further comments before the limit is up. Just wanted to ask one more question before I'm off)

ADD REPLYlink modified 4 months ago • written 4 months ago by Hansen_86920

You won't find a strict answer. There are several tools, the most straightforward would be DESeq2 and edgeR but you have to make sure that their assumptions are met, meaning that most of the genes (COGs, KOs etc.) have the same level in all the samples.

ADD REPLYlink written 4 months ago by Asaf7.3k

So there is no universal way to normalize my genes? My samples could potentially differ pretty much in terms of abundance of genes (levels?), so those tools won't be able to help?

ADD REPLYlink written 4 months ago by allerdrengen5520

The main issue is normalization. Usually you could find a set of universal genes that will allow you to do proper normalization. Take a look at Musicc for instance:

ADD REPLYlink written 4 months ago by Asaf7.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour