Question: Extracting related sequences from a FASTA file
gravatar for  DataFanatic
4 months ago by
DataFanatic130 wrote:

How can I

  1. Compare long genomic sequences e.g 1-15kb and group them into families
  2. Look for a specific k-mer within these sequences
  3. FInd most frequently shared k-mers

Thank you!

ADD COMMENTlink modified 12 weeks ago by Biostar ♦♦ 20 • written 4 months ago by DataFanatic130

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

ADD REPLYlink written 12 weeks ago by cpad011211k

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.

ADD REPLYlink written 11 weeks ago by jrj.healey11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 753 users visited in the last hour