Question: How to determine partial or complete gene overlap.
0
gravatar for hellbio
3.4 years ago by
hellbio370
hellbio370 wrote:

Dear all,

I have the gene annotations in gtf format as shown below:

chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1";exon_id "ENPP1.1";
chr1    stdin   CDS     247832  252562  .       -       0       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "1"; exon_id "ENPP1.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   CDS     254466  254628  .       -       1       gene_id "ENPP1"; transcript_id "ENPP1"; exon_number "2"; exon_id "ENPP1.2";
chr1    stdin   exon    247829  252562  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   CDS     247829  252562  .       -       1       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "1"; exon_id "ENPP1_2.1";
chr1    stdin   exon    254466  254628  .       -       .       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";
chr1    stdin   CDS     254466  254628  .       -       2       gene_id "ENPP1_2"; transcript_id "ENPP1_2"; exon_number "2"; exon_id "ENPP1_2.2";

gene_id is duplicated for each transcript with a suffix "_1" , "_2".... Are there any tools to determine whether there is a partial or complete gene overlap with few Kb segments using the gene file in the above format?

next-gen • 1.1k views
ADD COMMENTlink modified 3.4 years ago by EagleEye6.5k • written 3.4 years ago by hellbio370
1
gravatar for EagleEye
3.4 years ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:
echo `date`;
dir=(`dirname $1`)

base=(`basename $1`)

echo "1/7 Preparing files";


cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

echo "2/7 sorting and joining files - Transcripts (Not applicable)";


echo "3/7 sorting and joining files - Genes";
### ENSG

sort -k1,1 ${dir}/ensg_start.tmp > ${dir}/ensg_start_sort.tmp;
sort -k1,1  ${dir}/ensg_end.tmp > ${dir}/ensg_end_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp > ${dir}/ensg_location.tmp


sort -k1,1 ${dir}/ensg_annotation.tmp > ${dir}/ensg_annotation_sort.tmp;
sort -k1,1 ${dir}/ensg_location.tmp > ${dir}/ensg_location_sort.tmp;

join -j1 -t $'\t' ${dir}/ensg_annotation_sort.tmp ${dir}/ensg_location_sort.tmp > ${dir}/ensg_annotation.txt;


echo "4/7 Distance measure - Transcripts (Not applicable)";

echo "5/7 Distance measure - Genes";

cat $dir/ensg_annotation.txt | awk 'function max(x){i=0;for(val in x){if(i<=x[val]){i=x[val];}}return i;}function min(x){i=max(x);for(val in x){if(i>x[val] && x[val]>1){i=x[val];}}return i;}{split($9,a,",") ; split($10,b,","); print $0"\t"min(a)"\t"max(b);}' | awk 'BEGIN{FS="\t"}{print $2"\t"$11"\t"$12"\t"$1"\t0\t"$5}' >  ${dir}/${base}_ensg_annotation.txt;

echo "6/7 Cleaning temporary files";

rm $dir/ensg_annotation.tmp $dir/ensg_start.tmp $dir/ensg_end.tmp $dir/ensg_annotation.txt $dir/ensg_location.tmp ${dir}/ensg_start_sort.tmp ${dir}/ensg_end_sort.tmp ${dir}/ensg_location_sort.tmp ${dir}/ensg_annotation_sort.tmp;

echo "7/7 Done";
echo `date`;
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.5k

Just have to tweak a little for non standard Ensembl. The above code will work for your test.genes. it outputs the bed file as below

chr1    247829  273988  ENPP1   0   -
chr1    247829  313574  ENPP1_2 0   -
chr1    247829  270611  ENPP1_3 0   -
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.5k
0
gravatar for EagleEye
3.4 years ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:

This might be useful,

https://github.com/bedops/bedops#Dependencies

https://bedops.readthedocs.io/en/latest/

Or convert gff to bed (gene level) then use bed tools

There is a simple script I wrote to convert GTFtoBED, use similar method.

A: Converting gtf format to bed format

I have noticed that your gff file is almost similar to ensembl format (above code only works for gencode), use this,

Script:

Sample GTF:

Sample output where you can easily get the bed columns:

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.5k

The script usage message shows: usage: dirname path usage: basename string [suffix] basename [-a] [-s suffix] string [...]

What does dirname mean? could it be the gtf file

ADD REPLYlink written 3.4 years ago by hellbio370

Use,

./script.sh path_to_gtf/INPUT.GTF

Tips: make the script executable first using

chmod a+x script.sh

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.5k

Please see below post

ADD REPLYlink written 3.4 years ago by hellbio370
0
gravatar for hellbio
3.4 years ago by
hellbio370
hellbio370 wrote:

I did used the script as ./EnsemblGTFtoAnnotation.sh genes.gtf . However, the output looks weird and looks like below

3BHS_CANFA  chr17           +   3BHS_CANFA.1 56552945,56552945,56553120,
41_CANFA    chr2            -   41_CANFA.1  71472762,71472762,71476569,71476569,71479564,
41_CANFA_10 chr2            -   41_CANFA_10.1    71472762,71472765,71476569,71476569,71479564,71479564
41_CANFA_11 chr2            -   41_CANFA_11.1    71472762,71472762,71476569,71476569,71479564,71479564,71480800,
41_CANFA_12 chr2            -   41_CANFA_12.1   71472762,71472762,71476569,71476569,71479564,71479564,
ADD COMMENTlink written 3.4 years ago by hellbio370

Please edit and delete 18,22 from cut command in these lines

echo "1/7 Preparing files"; cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$6]++' | awk '{print $6"\t"$0;}' > $dir/enst_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '!x[$5]++' | awk '{print $5"\t"$0;}' > $dir/ensg_annotation.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$2;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_start.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $6"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/enst_end.tmp;

cat $1 | sed 's/ "/\t/g' | sed 's/"; /\t/g' | cut -f1,4,5,7,10,12,16,18,22 | sed 's/";//' | awk '{print $5"\t"$3;}' | awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "," } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' > $dir/ensg_end.tmp;

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by EagleEye6.5k

Unfortunately,still the same!! Uploaded a sample file for one gene with few lines (http://www.fileconvoy.com/dfl.php?id=g9e89b42792cd42a1999835989774c2d90ccf95918) if it helps!!

ADD REPLYlink written 3.4 years ago by hellbio370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 773 users visited in the last hour