Question

I have DNA motifs 6-12 bp long. Trying to get conservation measurements for them.

0

Entering edit mode

6.9 years ago

ericbrenner • 0

I have about 200 short nucleotide motifs (6-12 bp in length) from the human genome, and I'm trying to see how conserved they are across vertebrates. I was thinking that I'd need to make a bed file for each motif that lists all of its occurrences in the human genome. From there, I could map the beds to a bigwig files of PhastCons scores. Does that sound like the best approach? I'm getting stuck at the step of going from motifs to bed files. I've tried using BLAST to find all occurrences of motifs, but their short length is causing issues. I've tried messing with the e-value threshold, word size, and filter parameters, but I still don't get any hits. Is there a work-around for this issue, or should I just rethink my entire approach?

motif conservation • 1.6k views

ADD COMMENT • link updated 6.9 years ago by shenwei356 8.4k • written 6.9 years ago by ericbrenner • 0

1

Entering edit mode

You could check the positions/sequences in ECR Browser.

ADD REPLY • link 6.9 years ago by GenoMax 141k

score 3 · Answer 1 · 2017-05-30

For locating motifs, try locate command (usage) of SeqKit.

For some motifs:

$ cat motifs.txt 
ACTNGCT
GCTGRTCGAAC

Convert to FASTA format which seqkit wants:

$ awk '{print ">"$1"\n"$1}' motifs.txt > motifs.fa
$ cat motifs.fa 
>ACTNGCT
ACTNGCT
>GCTGRTCGAAC
GCTGRTCGAAC

Locating:

$ seqkit locate -i -d -f motifs.fa --bed human.fa  > motifs@human.bed

elapsed time: 14m:54s
peak rss: 1.96 GB

$ head -n 5 motifs@human.bed
1       20614   20621   ACTNGCT 0       +
1       21148   21155   ACTNGCT 0       +
1       21226   21233   ACTNGCT 0       +
1       27934   27941   ACTNGCT 0       +
1       34284   34291   ACTNGCT 0       +

It's fast for small genomes but still acceptable for human genome. Locating multiple motifs are not parallelized. But you can use GNU parallel:

parallel 'seqkit locate -i -d -p {} --bed human.fa  > motif_{}@human.bed' :::: motifs.txt

score 2 · Answer 2 · 2017-05-30

One approach:

Use UCSC BLAT (site) to find hits, which are written to a PSL file.
Use psl2bed to convert from PSL to BED.
Grab phyloP signal for your genome build and alignment of choice from UCSC (e.g., human vs vertebrate).
Convert phyloP signal from WIG to BED with wig2bed.
Map BLAT-sourced hits to phyloP signal with bedmap --echo-map-score.

Once you have scores, you can do some command-line work to extract scores for classes of motifs, and then bring in data files to R or other visualization tools. You could plot hierarchically-clustered heatmaps of per-base phyloP signal per motif hit of the same length, to explore sub-groupings of motifs. Or scatterplots or violin plots of per-base conservation score measurements for classes of motifs. Etc.