Context: Sometimes, I get badly mapped RNA-Seq reads due to pseudo-genes (here close duplicates with similar sequences). Lots of software have an option to set the maximal size introns can take, but it often seems unreasonable (Tophat sets it at 500000bp I think). I would like to set it slightly below the maximal distance between any two homologous genes.
Question: I have as input a list of genes of interest in .gtf format. I want to find out what is the minimum distance between any of these two genes.
Is bedtools closest -io -a foo.gtf -b foo.gtf
a correct way to obtain this value?
General: I have reads which are split in two: the left part aligns on a gene, the right one on another gene (close by and with similar sequences). How to stop this?
As a final note, transcriptome assembly usually ignores any aberrant reads and follow the consensus.
Thanks for your advice!
Thanks for the idea, I am looking into it but the awk command is incorrect: it outputs chromosome name...
Here is a subset of my data:
Double-check the arguments passed to the closest-features binary. If you're getting chromosome names, then it is printing BED features as well as signed distances, which means you may be missing an argument.
I ended up using this:
Download the mm10 gene Ensembl Table from UCSC.
Subset it if you want; I use
grep -f ensemblIdList.txt tableUCSCGeneMm10.bed
and sort the resulting bed file. Then, the real issue:Your solution was nice, just had to fiddle with awk. closest-features and bedtools closest can both output the distance as a new (rightmost) column in your bed file. It can be grabbed by awk as
$NF
(i.e., last field).