Extracting related sequences from a FASTA file
2.9 years ago
ATCG ▴ 350

How can I

1. Compare long genomic sequences e.g 1-15kb and group them into families
2. Look for a specific k-mer within these sequences
3. FInd most frequently shared k-mers

Thank you!

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.