How to extract upstream and downstream genes of a list of locations from gtf file?
1
0
Entering edit mode
3.9 years ago

I have a list of locations in a bed format

chr2    55159107    55160004
chr3    40280597    40282177
chr4    74682484    74683574
chr4    76795449    76796456
chr6    10250838    10251741
chr6    20435795    20436466
chr6    31169498    31170294

I am interested to identify genes that are 1000bp (1K) upstream and downstream of each of those location from a gtf file, which belongs to a non model plant.

gtf file:

SpoScf_00032    maker   exon    12116   12419   .   +   .   gene_id  transcript_id "Spo06120";
SpoScf_00032    maker   exon    14070   17062   .   +   .   gene_id  transcript_id "Spo06120";
SpoScf_00032    maker   exon    17626   17899   .   +   .   gene_id  transcript_id "Spo06120";
chr2    maker   CDS 15262965    15263150    .   +   0   gene_id  transcript_id "Spo26212";
chr2    maker   CDS 15264530    15264667    .   +   0   gene_id  transcript_id "Spo26212";
chr2    maker   CDS 15265433    15265885    .   +   0   gene_id  transcript_id "Spo26212";

bedtools window, intersect, closest doesn't answer my question because they look for overlaps.

gtf gene • 1.2k views
ADD COMMENT
2
Entering edit mode

You can create a bed_downstream file with chr, start-1000, start and an other bed_upstream file with chr, end, end+1000.

Then you can run bedtools intersect once on bed_downstream then on bed_upstream.

ADD REPLY
2
Entering edit mode
3.9 years ago

get the upstream / downstream regions of the bed

awk '{X=1000;B=(int($2)-1)-X;if(B<0) B=0;printf("%s\t%d\t%s\n",$1,B,$2);printf("%s\t%s\t%d\n",$1,$3,int($3)+X);}' the.bed

sort and use bedtools intersect

ADD COMMENT
0
Entering edit mode

Thanks so much,

Do I need to apply for specific options in bedtools intersect?

ADD REPLY
0
Entering edit mode

yes. read the manual.

ADD REPLY
0
Entering edit mode

I used your awk script to get up and down regions of my bed files, which looks like:

bed file:

SpoScf_00500    226344  227695
SpoScf_00562    236367  239437

Result:

SpoScf_00500    0   226344
SpoScf_00500    227695  228695
SpoScf_00562    0   236367
SpoScf_00562    239437  240437

This is correct? for the first line, 225344 should not be instead of 0?!

ADD REPLY
1
Entering edit mode

fixed . changed P->B

ADD REPLY

Login before adding your answer.

Traffic: 2514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6