Identifying gene outliers in a genome using composition analysis.
1
0
Entering edit mode
3 months ago

Hello. I have a FASTA file with some plant (Arabidopsis) and bacterial (E. coli) sequences. I want to separate the bacterial sequences from the plant sequences using sequence composition analyses. I calculated the GC content and average RSCU (relative synonymous codon usage) for each sequence in my file, and found the overall mean GC and mean RSCU for the entire set of sequences. I then used Tukey's fences method to identify the outliers with respect to GC content and outliers with respect to RSCU. I identify a sequence as an outlier to the plant sequences only if the sequence shows up as a GC outlier and an RSCU outlier. But I see that, although the bacterial sequences turn out as GC outliers, they do not show up as RSCU outliers. I cannot see why this is the case. Can someone explain? Are there any other metrics or approaches using different metrics that I can adapt? I am specifically looking for outlier detection using sequence composition parameters and not phylogenetic approaches.

Thank you.

RSCU composition GC outliers • 2.1k views
ADD COMMENT
1
Entering edit mode
3 months ago
GenoMax 154k

I want to separate the bacterial sequences from the plant sequences using sequence composition analyses.

Use CheckM --> Calculate tetranucleotide frequency deviation on python

Also other solutions noted in --> How to assemble viral genomes when my data contains host DNA as well

ADD COMMENT
0
Entering edit mode

Thank you. I will check them out.

ADD REPLY

Login before adding your answer.

Traffic: 4435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6