Hi:
I have used PureClip to analyze the an eClip-seq dataset and get the bedfile. Now I want to get the binding motif of the RBP, so I installed MEME to call motif from the bedfile.
The head of my bed file is like this:
head file.bed
chr1 633991 633992 3 14.891 + [score_CL=14.891;score_E=55.5259;score_B=56.5133;score_UC=14.891]
chr1 633992 633993 3 13.9207 + [score_CL=13.9207;score_E=55.7616;score_B=56.7817;score_UC=13.9207]
chr1 633993 633994 3 8.62409 + [score_CL=8.62409;score_E=55.9413;score_B=44.2988;score_UC=8.62409]
chr1 633997 633998 3 2.33843 + [score_CL=2.33843;score_E=52.7415;score_B=28.4546;score_UC=2.33843]
chr1 633998 633999 3 5.65288 + [score_CL=5.65288;score_E=50.1933;score_B=32.0494;score_UC=5.65288]
chr1 633999 634000 3 5.54813 + [score_CL=5.54813;score_E=48.5419;score_B=31.9227;score_UC=5.54813]
chr1 634009 634010 3 0.880815 + [score_CL=0.880815;score_E=44.1986;score_B=22.3018;score_UC=0.880815]
chr1 634010 634011 3 0.782717 + [score_CL=0.782717;score_E=45.8173;score_B=22.1729;score_UC=0.782717]
chr1 1037335 1037336 3 0.189687 + [score_CL=0.189687;score_E=57.7691;score_B=25.1448;score_UC=0.189687]
chr1 1037336 1037337 3 0.277321 + [score_CL=0.277321;score_E=55.6412;score_B=25.2031;score_UC=0.277321]
So I get fasta file from the bed file using bedtools:
bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed file.bed -s -fo file.fasta
and run MEME like this:
meme file.fasta -mod anr -o RBP_MEME_anr_motif6 -rna -nmotifs 6 >log015 2>error015
But I get nothing except the following feedback:
Writing results to output directory 'PureCLIP_HNRNPK_MEME_anr_motif6'.
BACKGROUND: using background model of order 0 PRIMARY (classic): n
13474 p0 13474 p1 0 p2 0 SEQUENCE GROUP USAGE-- Starts/EM: p0; Trim:
p0; pvalue: p0; nsites: p0,p1,p2 All sequences must be at least 8
characters long. Set '-w' or '-minw' or remove shorter sequences and
rerun.
From the bedfile I see the sequence is about 3bp long, how should I fill the condition of at least 8 characters long?
Can anyone help me on this? Thank you very much.
Thank you for the reply. I tried it but still have no results:
error is like:
Very much appreciated!
This means that your version of MEME linked to a system installed MPI library when it was compiled, but that library is no longer available on the system you are running it on.
If this is on your local machine, I recommend just re-installing/re-compiling MEME. If its on a shared system, then you will probably need to talk to the system admin.
Again, thank you for your time and patience. I just re-intalled it using :
and re-ran it gain:
meme PureCLIP.crosslink_sites.cov_inputSignal.fasta -mod anr -minw 3 -o PureCLIP_HNRNPK_MEME_anr_motif6 -rna -nmotif 6 >log016 2>error016
This time the output file is still empty, I checked the error it reads:
I changed the parameter to -nmotif 1, still has the same error.
the option is
-nmotifs
not-nmotif
en, it solves the problem. But it brought me back to the old problem:
Could it be the problems of my bed file? I see most of the coordinates has nucleotide length of 3. i set it to --minw 2, still echos the same error.
Its not quite the same error - note that it now says All sequences must be at least 2 characters long. That means there are some peaks that are only 1 characters long. If most of your peaks are short, you are never going to find any motifs.
I recommend extending the peaks as I suggested in 3 in my orignal answer. If you don't have a summit, just use the middle of the peak.
Thanks a lot for your teaching. I don't know how to do #3. But I will search and learn. I will get back to this question as soon as I understand what it means. Thanks again!
Hi:
I read the manual again and go to see the files I used. I guess the main problem might be my input files: After analysis using Pureclip, I got a bed file containing the I showed in the beginning: the first column is the chromosome number, the second column is the position of the and the third column is the position behind the crosslink site (usually it is the 2nd column +1). Then I transformed this bed file into a fasta file:
bedtools getfasta -fi file.fa -bed file.bed -s -fo file.fasta &
and use this fastfile to do the MEME call motif. The problem is that every sequence in the fasta file is a mononucleotide because the coordinates of the above mentioned bed file:. So I guess it is the problem of my input? Do I think in the right way?
Okay, so Pureclip is outputting signficant crosslinks, rather than peaks.
So I recommend you first turn these into regions.
Easist way to do this would be to extend each one 15nt in each direction (use
bedtools slop
to do this). Then merge overlapping regions (usebedtools merge
).Very much appreciated. I am going to do read the manual of these command and see what I can get.
Hi:
I want to ask a naive question on the step of analyzing CLIP-seq data. Usually, the beginning is to cut adaptor. My questions is after cutting the adaptor, the sequence for labeling each individual cDNA will also be removed because it is included in the adaptor, how will it be able to remove PCR duplicates later on?
Thank you very much,
Usually rather than discarding the UMI sequence, it is moved to the read name so that it can be used for deduplcating afterwards. Or at least, thats how I do it for iCLIP data.
I see. Thank you!
May I ask what package you used for iClip-seq analysis. I want to learn to use one whose output can be used for motif calling directly.
I wrote my own. Sorry.
I used UMI-tools (which was actaully written orignally for iCLIP) to process the barcodes onto the reads and de-multiplexed with reaper. I then mapped with STAR and deduplicated with UMI-tools. We then called peaks with my own peak caller. But there are other peak callers out there to call peaks from CLIP data.
That is so cool! I will read some reviews~
Thanks,
Cai