Question: Finding the minimum distance between any two genes in a list / Coping with pseudo-genes in RNA-Seq data?
gravatar for cyril-cros
4.6 years ago by
cyril-cros890 wrote:

Context: Sometimes, I get badly mapped RNA-Seq reads due to pseudo-genes (here close duplicates with similar sequences). Lots of software have an option to set the maximal size introns can take, but it often seems unreasonable (Tophat sets it at 500000bp I think). I would like to set it slightly below the maximal distance between any two homologous genes.

Question: I have as input a list of genes of interest in .gtf format. I want to find out what is the minimum distance between any of these two genes.

Is "bedtools closest -io -a foo.gtf -b foo.gtf" a correct way to obtain this value?

General: I have reads which are split in two: the left part aligns on a gene, the right one on another gene (close by and with similar sequences). How to stop this?

As a final note, transcriptome assembly usually ignores any aberrant reads and follow the consensus.

Thanks for your advice!

rna-seq pseudo-genes genome • 1.9k views
ADD COMMENTlink modified 4.6 years ago by Alex Reynolds29k • written 4.6 years ago by cyril-cros890
gravatar for Alex Reynolds
4.6 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

With BEDOPS gtf2bed and closest-features, it is pretty easy and efficient.

Convert the GTF file to BED with gtf2bed. Then use closest-features to report all the nearest distances, leaving out the reference and overlapping elements. Pipe the distances to awksort and head to grab the first, smallest absolute distance:

$ gtf2bed < foo.gtf > foo.bed
$ closest-features --closest --dist --no-ref --no-overlaps foo.bed foo.bed | awk '{ print ($1>=0)?$1:-$1 }' | sort -n | head -1 > minimum_distance.txt

If you want this in a self-contained one-liner:

$ closest-features --closest --dist --no-ref --no-overlaps <(gtf2bed < foo.gtf) <(gtf2bed < foo.gtf) | awk '{ print ($1>=0)?$1:-$1 }' | sort -n | head -1 > minimum_distance.txt

Converting only once will be faster, though.

You can preprocess input with bedops to filter out categories of elements, if desired, before using closest-features.

Depending on the reads you get back, if you first need to segregate by strand:

$ awk '$6 == "+"' foo.bed > foo.for.bed
$ awk '$6 == "-"' foo.bed > foo.rev.bed

Then you can run the above pipelines per strand.

ADD COMMENTlink modified 4.5 years ago • written 4.6 years ago by Alex Reynolds29k

Thanks for the idea, I am looking into it but the awk command is incorrect: it outputs chromosome name...

Here is a subset of my data:

~$ head subset.bed 
chr1    92479680    92479683    ENSMUST00000086837    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92479680    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92479683    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    CDS    0    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92480616    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    start_codon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92490817    92490820    ENSMUST00000071521    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000071521"; transcript_id "ENSMUST00000071521"; 


ADD REPLYlink written 4.5 years ago by cyril-cros890

Double-check the arguments passed to the closest-features binary. If you're getting chromosome names, then it is printing BED features as well as signed distances, which means you may be missing an argument.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds29k

I ended up using this:
Download the mm10 gene Ensembl Table from UCSC.
Subset it if you want; I use "grep -f ensemblIdList.txt tableUCSCGeneMm10.bed" and sort the resulting bed file. Then, the real issue:

bedtools closest -s -d -io -N -a olfr_genes_sorted.bed -b olfr_genes_sorted.bed > output.bed
# Gets me: gene bed data | closest distinct match bed data | distance between the two
awk '{print $NF,"\t",$1, "\t",$4,"\t",$16}' output.bed | less
# Gets me distance | chromosome | Ensembl gene ID #1 | Ensembl gene ID #2 

Your solution was nice, just had to fiddle with awk. closest-features and bedtools closest can both output the distance as a new (rightmost) column in your bed file. It can be grabbed by awk as $NF (ie last field).

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by cyril-cros890
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1415 users visited in the last hour