Tutorial: Using Bedtools Closest (closestBed) to make distance histograms of protein binding frequency near all types of pre-RNA junctions (e.g. exon-intron and intron-exon)
0
gravatar for linuxborg2
3 months ago by
linuxborg20
linuxborg20 wrote:

To plot distance histograms of protein binding sites to all types of pre-RNA junctions (intron-exon and exon-intron, etc.), bedtools closest requires post-processing to isolate pertinent information. A batch script that does that post-processing is linked below, and it also uses ngs.plot.r to create exon and genebody distance graphs.

For instructions and a one-command batch script working example, download from directory: https://drive.google.com/open?id=1PnudvWR_xbyfqaJJve5R44PHeCyQBBRr file preRNAbinding.tar by right-clicking the TAR file and selecting "Download". Untarring that file creates a subdirectory preRNAbinding, which contains the contents of the TAR file.

To untar:

The TAR should be 1.5 GB.

tar -xvf preRNAbinding.tar
cd preRNAbinding
ls

Expect to see a working directory batchFile and an example of a working directory after a completed run "afterRunningBatchFile".

cd batchFile
ls
vi README.txt

This readme contains the contents of this forum post plus more background information.

batchFile contains the batch script preRNAbinding.s and the input files needed to generate output data files, after which the contents of directory batchFile should be the same as the contents of afterRunningBatchFile.

Step by step instructions to run preRNAbinding.s are self-documented in comments in the SYNOPSIS and PREREQUISITES sections. Batch file preRNAbinding.s post-processes the files in directory "batchFile" to create the data files that it passes to R Studio to plot the distance histograms. preRNAbinding.s also runs ngs.plot.r to create exon and genebody RNA binding protein binding frequency graphs.

To create the histograms: After running preRNAbinding.s, R Studio must be manually run to use the preRNAbinding.s output files to produce the HTML file containing the PNG histograms. Step by step instructions are documented in preRNAbinding.s.

Background: The plots of these calculations are included and explained in my masters thesis Lipscomb_Thomas_Thesis.pdf included in directory batchFile. To understand the structure of pre-RNA see Gene_structure_eukaryote_2_annotated.png in batchFile.

BEDTOOLS LIMITATIONS AND HOW SBATCH FILE preRNAbinding.s OVERCOMES THEM:

  • BEDTOOLS CLOSEST LIMITATIONS:

Limitation 1: Bedtools closest returns negative numbers for RNA binding protein binding sites upstream of a region and positive numbers for RNA binding protein binding sites downstream of a region (e.g. an exon and the regions upstream and downstream), but does not limit the results to only RNA binding protein binding sites within the regions flanking the center region (e.g. intron-exon-intron, the intronic regions flank the center exon region), which is what is needed. If the nearest RNA binding protein binding site is beyond the flanking region then that protein binding site is erroneously mixed in with the flanking region results (see Figure 7A of the thesis PDF Lipscomb_Thomas_Thesis.pdf in the batchFile folder). Also, bedtools closest returns a zero for distance if the RNA binding protein site and preRNA region overlap, instead of the needed distance to either the start or end of region, whichever is closer.

Limitation 2: Erroneous double-counting of QKI binding sites inside the region can happen if the region is short (e.g. exons can be short), causing the two junction histograms to overlap (see Figure 8A of the thesis PDF in the batchFile folder).

  • HOW BEDTOOLS CLOSEST LIMITATIONS ARE OVERCOME by preRNAbinding.s:

Solution to Limitation 1: When bedtools closest returns zero, signifying that the RNA binding protein bound within the region (e.g. the exon or the flanking introns), AWK code provided in preRNAbinding.s measures the distance to the start or the end of the region, whichever is closer (see Figure 7B of the thesis PDF in the batchFile folder). Looking within adjacent regions prevents erroneous inclusion of more distant protein binding sites.

Solution to Limitation 2: Split the center region in half and split the flanking regions in half. This prevents erroneous double-counting, because no matter what the nucleotide distance on the x-axis of the junction histogram is, the junction histogram is given only the distances that were closer to either the start or the end of the center region, whichever the histogram includes (see Figure 8B of the thesis PDF in the batchFile folder). This solution fully applied to intron-exon and exon-intron histograms. This solution only partially applied to the other histograms, which split only the center region in half because the size of the flanking regions is unknown.

AUTHOR:

Thomas H. Lipscomb In partial fulfillment of New York University 2018 MS thesis.

REPORTING BUGS:

Please report bugs to this forum post.

I may not be able to assist in debugging because I will no longer have access to the server I used to generate the example output directory afterRunningBatchFile.

COPYRIGHT: freeware copyleft

ADD COMMENTlink modified 3 months ago by RamRS21k • written 3 months ago by linuxborg20
3

I see a few factors that make the code look suspicious/asks for a lot more than usual:

  1. Source code is on Google drive, not on GitHub or any other Git repo
  2. Source code is >1 GB
  3. Needs R Studio and not just R
  4. Unknown .s file
  5. Author explicitly states they'd not be able to assist in debugging
ADD REPLYlink written 3 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 741 users visited in the last hour