Question

Finding the minimum distance between any two genes in a list / Coping with pseudo-genes in RNA-Seq data?

0

Entering edit mode

10.3 years ago

cyril-cros ▴ 950

Context: Sometimes, I get badly mapped RNA-Seq reads due to pseudo-genes (here close duplicates with similar sequences). Lots of software have an option to set the maximal size introns can take, but it often seems unreasonable (Tophat sets it at 500000bp I think). I would like to set it slightly below the maximal distance between any two homologous genes.

Question: I have as input a list of genes of interest in .gtf format. I want to find out what is the minimum distance between any of these two genes.

Is bedtools closest -io -a foo.gtf -b foo.gtf a correct way to obtain this value?

General: I have reads which are split in two: the left part aligns on a gene, the right one on another gene (close by and with similar sequences). How to stop this?

As a final note, transcriptome assembly usually ignores any aberrant reads and follow the consensus.

Thanks for your advice!

rna-seq genome pseudo-genes • 3.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 10.3 years ago by cyril-cros ▴ 950

Ram · Accepted Answer · 2015-03-25

1

Entering edit mode

10.3 years ago

Alex Reynolds 36k

With BEDOPS gtf2bed and closest-features, it is pretty easy and efficient.

Convert the GTF file to BED with gtf2bed. Then use closest-features to report all the nearest distances, leaving out the reference and overlapping elements. Pipe the distances to awk, sort and head to grab the first, smallest absolute distance:

$ gtf2bed < foo.gtf > foo.bed
$ closest-features --closest --dist --no-ref --no-overlaps foo.bed foo.bed | awk '{ print ($1>=0)?$1:-$1 }' | sort -n | head -1 > minimum_distance.txt

If you want this in a self-contained one-liner:

$ closest-features --closest --dist --no-ref --no-overlaps <(gtf2bed < foo.gtf) <(gtf2bed < foo.gtf) | awk '{ print ($1>=0)?$1:-$1 }' | sort -n | head -1 > minimum_distance.txt

Converting only once will be faster, though.

You can preprocess input with bedops to filter out categories of elements, if desired, before using closest-features.

Depending on the reads you get back, if you first need to segregate by strand:

$ awk '$6 == "+"' foo.bed > foo.for.bed
$ awk '$6 == "-"' foo.bed > foo.rev.bed

Then you can run the above pipelines per strand.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for the idea, I am looking into it but the awk command is incorrect: it outputs chromosome name...

Here is a subset of my data:

~$ head subset.bed
chr1    92479680    92479683    ENSMUST00000086837    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92479680    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92479683    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    CDS    0    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92480616    92480619    ENSMUST00000086837    0.000000    -    mm10_ensGene    start_codon    .    gene_id "ENSMUST00000086837"; transcript_id "ENSMUST00000086837"; 
chr1    92490817    92490820    ENSMUST00000071521    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000071521"; transcript_id "ENSMUST00000071521";

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by cyril-cros ▴ 950

0

Entering edit mode

Double-check the arguments passed to the closest-features binary. If you're getting chromosome names, then it is printing BED features as well as signed distances, which means you may be missing an argument.

ADD REPLY • link 10.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I ended up using this:

Download the mm10 gene Ensembl Table from UCSC.

Subset it if you want; I use grep -f ensemblIdList.txt tableUCSCGeneMm10.bed and sort the resulting bed file. Then, the real issue:

bedtools closest -s -d -io -N -a olfr_genes_sorted.bed -b olfr_genes_sorted.bed > output.bed
# Gets me: gene bed data | closest distinct match bed data | distance between the two
awk '{print $NF,"\t",$1, "\t",$4,"\t",$16}' output.bed | less
# Gets me distance | chromosome | Ensembl gene ID #1 | Ensembl gene ID #2

Your solution was nice, just had to fiddle with awk. closest-features and bedtools closest can both output the distance as a new (rightmost) column in your bed file. It can be grabbed by awk as $NF (i.e., last field).

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 10.2 years ago by cyril-cros ▴ 950