Question: Extracting related sequences from a FASTA file
gravatar for  DataFanatic
13 months ago by
DataFanatic150 wrote:

How can I

  1. Compare long genomic sequences e.g 1-15kb and group them into families
  2. Look for a specific k-mer within these sequences
  3. FInd most frequently shared k-mers

Thank you!

ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 13 months ago by DataFanatic150

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

ADD REPLYlink written 11 months ago by cpad011212k

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.

ADD REPLYlink written 11 months ago by Joe15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1969 users visited in the last hour