Hello. I have a FASTA file with some plant (Arabidopsis) and bacterial (E. coli) sequences. I want to separate the bacterial sequences from the plant sequences using sequence composition analyses. I calculated the GC content and average RSCU (relative synonymous codon usage) for each sequence in my file, and found the overall mean GC and mean RSCU for the entire set of sequences. I then used Tukey's fences method to identify the outliers with respect to GC content and outliers with respect to RSCU. I identify a sequence as an outlier to the plant sequences only if the sequence shows up as a GC outlier and an RSCU outlier. But I see that, although the bacterial sequences turn out as GC outliers, they do not show up as RSCU outliers. I cannot see why this is the case. Can someone explain? Are there any other metrics or approaches using different metrics that I can adapt? I am specifically looking for outlier detection using sequence composition parameters and not phylogenetic approaches.
Thank you.
Thank you. I will check them out.