Question: Extracting related sequences from a FASTA file
gravatar for  DataFanatic
16 months ago by
DataFanatic150 wrote:

How can I

  1. Compare long genomic sequences e.g 1-15kb and group them into families
  2. Look for a specific k-mer within these sequences
  3. FInd most frequently shared k-mers

Thank you!

ADD COMMENTlink modified 14 months ago by Biostar ♦♦ 20 • written 16 months ago by DataFanatic150

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

ADD REPLYlink written 14 months ago by cpad011212k

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.

ADD REPLYlink written 14 months ago by Joe16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour