To plot distance histograms of protein binding sites to all types of pre-RNA junctions (intron-exon and exon-intron, etc.), bedtools closest requires post-processing to isolate pertinent information. A batch script that does that post-processing is linked below, and it also uses ngs.plot.r to create exon and genebody distance graphs.
EDIT: I have moved the source code to GitHub: https://github.com/ProfHariSeldon/closestBedPAR-CLIP
For instructions and a one-command batch script working example, download from directory: https://drive.google.com/open?id=1PnudvWR_xbyfqaJJve5R44PHeCyQBBRr file preRNAbinding.tar by right-clicking the TAR file and selecting "Download". Untarring that file creates a subdirectory
preRNAbinding, which contains the contents of the TAR file.
The TAR should be 1.5 GB.
tar -xvf preRNAbinding.tar cd preRNAbinding ls
Expect to see a working directory
batchFile and an example of a working directory after a completed run "afterRunningBatchFile".
cd batchFile ls vi README.txt
This readme contains the contents of this forum post plus more background information.
batchFile contains the batch script
preRNAbinding.s and the input files needed to generate output data files, after which the contents of directory
batchFile should be the same as the contents of
Step by step instructions to run
preRNAbinding.s are self-documented in comments in the SYNOPSIS and PREREQUISITES sections. Batch file
preRNAbinding.s post-processes the files in directory "batchFile" to create the data files that it passes to R Studio to plot the distance histograms.
preRNAbinding.s also runs
ngs.plot.r to create exon and genebody RNA binding protein binding frequency graphs.
To create the histograms: After running
preRNAbinding.s, R Studio must be manually run to use the
preRNAbinding.s output files to produce the HTML file containing the PNG histograms. Step by step instructions are documented in
Background: The plots of these calculations are included and explained in my masters thesis
Lipscomb_Thomas_Thesis.pdf included in directory
batchFile. To understand the structure of pre-RNA see
BEDTOOLS LIMITATIONS AND HOW SBATCH FILE
preRNAbinding.s OVERCOMES THEM:
- BEDTOOLS CLOSEST LIMITATIONS:
Limitation 1: Bedtools closest returns negative numbers for RNA binding protein binding sites upstream of a region and positive numbers for RNA binding protein binding sites downstream of a region (e.g. an exon and the regions upstream and downstream), but does not limit the results to only RNA binding protein binding sites within the regions flanking the center region (e.g. intron-exon-intron, the intronic regions flank the center exon region), which is what is needed. If the nearest RNA binding protein binding site is beyond the flanking region then that protein binding site is erroneously mixed in with the flanking region results (see Figure 7A of the thesis PDF
Lipscomb_Thomas_Thesis.pdf in the batchFile folder). Also, bedtools closest returns a zero for distance if the RNA binding protein site and preRNA region overlap, instead of the needed distance to either the start or end of region, whichever is closer.
Limitation 2: Erroneous double-counting of QKI binding sites inside the region can happen if the region is short (e.g. exons can be short), causing the two junction histograms to overlap (see Figure 8A of the thesis PDF in the batchFile folder).
- HOW BEDTOOLS CLOSEST LIMITATIONS ARE OVERCOME by
Solution to Limitation 1: When bedtools closest returns zero, signifying that the RNA binding protein bound within the region (e.g. the exon or the flanking introns), AWK code provided in
preRNAbinding.s measures the distance to the start or the end of the region, whichever is closer (see Figure 7B of the thesis PDF in the batchFile folder). Looking within adjacent regions prevents erroneous inclusion of more distant protein binding sites.
Solution to Limitation 2: Split the center region in half and split the flanking regions in half. This prevents erroneous double-counting, because no matter what the nucleotide distance on the x-axis of the junction histogram is, the junction histogram is given only the distances that were closer to either the start or the end of the center region, whichever the histogram includes (see Figure 8B of the thesis PDF in the batchFile folder). This solution fully applied to intron-exon and exon-intron histograms. This solution only partially applied to the other histograms, which split only the center region in half because the size of the flanking regions is unknown.
Thomas H. Lipscomb In partial fulfillment of New York University 2018 MS thesis.
Please report bugs to this forum post.
I may not be able to assist in debugging because I will no longer have access to the server I used to generate the example output directory
COPYRIGHT: freeware copyleft
I see a few factors that make the code look suspicious/asks for a lot more than usual:
1: I have moved the source code to GitHub: https://github.com/ProfHariSeldon/closestBedPAR-CLIP
2: Source code is 129 KB (.s, .awk, .py, and .rmd), input files are 291 MB (batchFile folder). The output of a sample run is included so that you may verify that your run on the sample input is running correctly. Output of the sample run including temporary files that trace the computation's progress is 1.12 GB (afterRunningBatchFile folder).
3: To run R Studio in the command line:
$ module load rstudio/1.1.383
$ Rscript -e "knitr::stitch_rmd('Hg38_QKI_closest_dist_freqV9.Rmd')"
4: The .s file is the main batch shell script. The .sh suffix is the more popular suffix, and many people choose to use no suffix.
5: I no longer have access to the high performance computing center at NYU because I graduated, so I am unable to test any code changes. I can answer questions.