Question

Identifying gene outliers in a genome using composition analysis.

0

Entering edit mode

3 months ago

Shakunthala Natarajan • 0

Hello. I have a FASTA file with some plant (Arabidopsis) and bacterial (E. coli) sequences. I want to separate the bacterial sequences from the plant sequences using sequence composition analyses. I calculated the GC content and average RSCU (relative synonymous codon usage) for each sequence in my file, and found the overall mean GC and mean RSCU for the entire set of sequences. I then used Tukey's fences method to identify the outliers with respect to GC content and outliers with respect to RSCU. I identify a sequence as an outlier to the plant sequences only if the sequence shows up as a GC outlier and an RSCU outlier. But I see that, although the bacterial sequences turn out as GC outliers, they do not show up as RSCU outliers. I cannot see why this is the case. Can someone explain? Are there any other metrics or approaches using different metrics that I can adapt? I am specifically looking for outlier detection using sequence composition parameters and not phylogenetic approaches.

Thank you.

RSCU composition GC outliers • 2.1k views

ADD COMMENT • link updated 3 months ago by GenoMax 154k • written 3 months ago by Shakunthala Natarajan • 0

score 1 · Answer 1 · 2025-07-09

1

Entering edit mode

3 months ago

GenoMax 154k

I want to separate the bacterial sequences from the plant sequences using sequence composition analyses.

Use CheckM --> Calculate tetranucleotide frequency deviation on python

Also other solutions noted in --> How to assemble viral genomes when my data contains host DNA as well