Extracting related sequences from a FASTA file
0
0
Entering edit mode
2.9 years ago
ATCG ▴ 350

How can I

1. Compare long genomic sequences e.g 1-15kb and group them into families
2. Look for a specific k-mer within these sequences
3. FInd most frequently shared k-mers

Thank you!

Sequence comparizon Data mining kmer • 684 views
0
Entering edit mode

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

0
Entering edit mode

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.