Question: I have DNA motifs 6-12 bp long. Trying to get conservation measurements for them.
gravatar for ericbrenner
3.7 years ago by
ericbrenner0 wrote:

I have about 200 short nucleotide motifs (6-12 bp in length) from the human genome, and I'm trying to see how conserved they are across vertebrates. I was thinking that I'd need to make a bed file for each motif that lists all of its occurrences in the human genome. From there, I could map the beds to a bigwig files of PhastCons scores. Does that sound like the best approach? I'm getting stuck at the step of going from motifs to bed files. I've tried using BLAST to find all occurrences of motifs, but their short length is causing issues. I've tried messing with the e-value threshold, word size, and filter parameters, but I still don't get any hits. Is there a work-around for this issue, or should I just rethink my entire approach?

motif conservation • 937 views
ADD COMMENTlink modified 3.7 years ago by shenwei3565.7k • written 3.7 years ago by ericbrenner0

You could check the positions/sequences in ECR Browser.

ADD REPLYlink written 3.7 years ago by GenoMax95k
gravatar for shenwei356
3.7 years ago by
shenwei3565.7k wrote:

For locating motifs, try locate command (usage) of SeqKit.

For some motifs:

$ cat motifs.txt 

Convert to FASTA format which seqkit wants:

$ awk '{print ">"$1"\n"$1}' motifs.txt > motifs.fa
$ cat motifs.fa 


$ seqkit locate -i -d -f motifs.fa --bed human.fa  > motifs@human.bed

elapsed time: 14m:54s
peak rss: 1.96 GB

$ head -n 5 motifs@human.bed
1       20614   20621   ACTNGCT 0       +
1       21148   21155   ACTNGCT 0       +
1       21226   21233   ACTNGCT 0       +
1       27934   27941   ACTNGCT 0       +
1       34284   34291   ACTNGCT 0       +

It's fast for small genomes but still acceptable for human genome. Locating multiple motifs are not parallelized. But you can use GNU parallel:

parallel 'seqkit locate -i -d -p {} --bed human.fa  > motif_{}@human.bed' :::: motifs.txt
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by shenwei3565.7k
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

One approach:

  1. Use UCSC BLAT (site) to find hits, which are written to a PSL file.
  2. Use psl2bed to convert from PSL to BED.
  3. Grab phyloP signal for your genome build and alignment of choice from UCSC (e.g., human vs vertebrate).
  4. Convert phyloP signal from WIG to BED with wig2bed.
  5. Map BLAT-sourced hits to phyloP signal with bedmap --echo-map-score.

Once you have scores, you can do some command-line work to extract scores for classes of motifs, and then bring in data files to R or other visualization tools. You could plot hierarchically-clustered heatmaps of per-base phyloP signal per motif hit of the same length, to explore sub-groupings of motifs. Or scatterplots or violin plots of per-base conservation score measurements for classes of motifs. Etc.

ADD COMMENTlink written 3.7 years ago by Alex Reynolds31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2081 users visited in the last hour