Question

Mapping to contigs or predicted genes, for quantitative gene analysis (DNAseq)

1

Entering edit mode

4.4 years ago

allerdrengen55 ▴ 60

Hi!

A little background: I have a metagenomic sample (DNA-seq). I want to figure out the gene depth/abundance of all the genes in the sample. Since there a lot of different (and same) bacteria in the sample, I expect some of the genes to have a high abundance. Keep in mind that it is a DNA-seq, so it's not an expression analysis, but rather what genes (and their depth) the sample inhabit. Lastly, I want to group the genes in COGs while still preserving the quantitative information.

I started with an assembled contigs-file and 2 FASTQ-files (Paired-end). I then predicted genes with Prokka (bacteria only) from the contigs-file.

Now, I'm not sure whether I should map my reads to the predicted genes, and then just count how many reads mapped to each gene, or if I should map my reads to my contigs, and then use FeatureCounts, to figure out how the predicted genes relate to the alignment. I guess the ladder would give me more freedom (for instance, counting fragments instead of reads, as I'm dealing with paired-end). However, I'm not too confident on the topic.

What are your suggestions?

map DNAseq alignment fastq contigs • 1.6k views

ADD COMMENT • link updated 4.4 years ago by Asaf 10k • written 4.4 years ago by allerdrengen55 ▴ 60

0

Entering edit mode

I re-opened this one and deleted the previous thread. Please be sure to keep things now focused in this thread to avoid information being spread across multiple identical / similar threads.

ADD REPLY • link 4.4 years ago by ATpoint 81k

1

Entering edit mode

That is noted, thanks!

ADD REPLY • link 4.4 years ago by allerdrengen55 ▴ 60

score 1 · Answer 1 · 2019-11-12

1

Entering edit mode

4.4 years ago

Asaf 10k

The simple solution - use both methods and compare them.

I think that the straightforward way to go is map to the contigs and use featureCounts. The mapping should be used in other downstream analyses as well such as binning or to test average contig coverage etc. Once you have the count matrix you can look for differentially abundant genes. You might want to run eggNOG mapper on your genes, prokka might not get the best homology group mapping. Once you have that you can combine it with the count matrix and collapse it to the homology group level, pretty simple with R or python.

ADD COMMENT • link 4.4 years ago by Asaf 10k

0

Entering edit mode

Hi Asaf, thanks for your response! A couple of questions:

Is count matrix referring to the contig alignment?

How would you suggest I normalize my genes? (Longer genes, will have more mapped reads than shorter genes)

ADD REPLY • link 4.4 years ago by allerdrengen55 ▴ 60

0

Entering edit mode

Count matrix is the matrix generated by featureCount i.e. the number of reads mapped to each gene. You shouldn't normalize the genes and use software like DESeq2 to compare samples. If you only have one sample and want to see which genes have less coverage then you should take the median coverage for each gene

ADD REPLY • link 4.4 years ago by Asaf 10k

0

Entering edit mode

Ah yes, that's what it is!

I have 8 big samples to compare. The goal is to look at the difference in gene/cog abundance between the samples. How do you suggest i approach that, in regards to normalization (and comparison, any tools)?

(Writing this comment from a friends user, due to comment limit. I will not write further comments before the limit is up. Just wanted to ask one more question before I'm off)

ADD REPLY • link 4.4 years ago by Hansen_869 ▴ 80

0

Entering edit mode

You won't find a strict answer. There are several tools, the most straightforward would be DESeq2 and edgeR but you have to make sure that their assumptions are met, meaning that most of the genes (COGs, KOs etc.) have the same level in all the samples.

ADD REPLY • link 4.4 years ago by Asaf 10k

0

Entering edit mode

So there is no universal way to normalize my genes? My samples could potentially differ pretty much in terms of abundance of genes (levels?), so those tools won't be able to help?

ADD REPLY • link 4.4 years ago by allerdrengen55 ▴ 60

0

Entering edit mode

The main issue is normalization. Usually you could find a set of universal genes that will allow you to do proper normalization. Take a look at Musicc for instance: https://www.ncbi.nlm.nih.gov/pubmed/25885687

ADD REPLY • link 4.4 years ago by Asaf 10k