Question

Best tool for finding Boundary Pairs

0

Entering edit mode

7.4 years ago

Fatima ▴ 1000

Hello everyone,

I'm a newbie to this field! I need to find the boundary pairs, but I don't know exactly how to find them! I have the names of the first and the last gene!

Here is the definition:

Boundary Pairs (consists of either the upstream adjacent gene and the first gene of the Transcription Unit, or the last gene of the Transcription Unit and the adjacent downstream gene, providing that these genes are transcribed in the same direction)

Reference Genome: Ecoli

Upstream RNA-Seq • 1.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 7.4 years ago by Fatima ▴ 1000

1

Entering edit mode

Important for the others that will read: which organism are you working on? Is there a gene annotation, a reference sequence, a database available? Finding such pairs in human might be easy but not in a non annotated plant, for example.

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Yes, there's a reference genome (Ecoli)

ADD REPLY • link 7.4 years ago by Fatima ▴ 1000

1

Entering edit mode

What's a TU? Preferentially don't use abbreviations, those aren't clear for everyone.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

1

Entering edit mode

Transcription unit. I modified the post.

ADD REPLY • link 7.4 years ago by Fatima ▴ 1000

1

Entering edit mode

I gave for granted that they were Transcriptional Units. Please correct me if I'm wrong @Afagh.

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

@Macspider Yes, you are right :)

ADD REPLY • link 7.4 years ago by Fatima ▴ 1000

1

Entering edit mode

7.4 years ago

Matteo Schiavinato ★ 3.6k

Ok, so since you said that you have a reference genome and a reference annotation (given that it is E. coli) what you need is to find out which one is the closest gene to each one in your list. An example:

you have G1 and G2 which are the first and last of a TU. If you extract them from the GTF file of E. coli you will find the positions in terms of nucleotides (field 4 and 5 of the gene feature line).
you can write a little script to scan the annotation for the gene upstream G1, and the one downstream G2 (it shouldn't be too hard, just some dictionaries in python / hashes in perl or even lists.

Your boundary pairs for the TU you're considering will be [k-1, k] and [j, j+1] where k and j are the indexes of G1 and G2 in the list with all the gene coordinates.

You then iterate this rationale over all your genes in the list you have.

ADD COMMENT • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thank you! So, I don't need to use the start site or -100 base and things like that? Okay, I'll write the script in python!

ADD REPLY • link 7.4 years ago by Fatima ▴ 1000

score 3 · Accepted Answer · 2016-11-22

a one-liner using sqlite3, putting all data in a table 'T1' and using 3 aliases a,b,c :-) Assuming there is no overlapping transcript. This is very slow, I've used a head -n 200 to get a subset. I should have used some sql indexes.

$ rm -f db.sqlite && curl -s "ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//gtf/bacteria_91_collection/escherichia_coli/Escherichia_coli.HUSEC2011CHR1.32.gtf.gz" | \
gunzip -c | awk -F '\t' '($3=="transcript")' |\
head -n 200 | cut -f 1,4,5,7 |\
awk 'BEGIN {printf("create table T1(C text,S int,E int,P int); begin transaction;\n");} {printf("insert into T1(C,S,E,P) values(\"%s\",%s,%s,%d);\n",$1,$2,$3,($4=="+"?1:0));} END {printf("commit;\nselect a.*,b.* from T1 as a, T1 as b left join T1 as c on c.C==a.C and a.E <= c.S and c.E <= b.S where a.C==b.C and a.E <= b.S and a.P==b.P group by a.C,a.S,a.E,a.P,b.C,b.E,b.S,b.P having count(c.C)==0;");}' |\
sqlite3 db.sqlite

Chromosome|349|2811|1|Chromosome|2813|3745|1
Chromosome|2813|3745|1|Chromosome|3746|5032|1
Chromosome|3746|5032|1|Chromosome|5246|5542|1
Chromosome|5696|6472|0|Chromosome|6542|7972|0
Chromosome|8320|9204|1|Chromosome|9319|9906|1
Chromosome|9941|10507|0|Chromosome|10656|11369|0
Chromosome|10656|11369|0|Chromosome|11395|11799|0
Chromosome|12176|14092|1|Chromosome|14181|15311|1
Chromosome|16153|17319|1|Chromosome|17379|18284|1
Chromosome|18323|19282|0|Chromosome|19295|20440|0
Chromosome|19295|20440|0|Chromosome|20873|21994|0
Chromosome|23423|24310|1|Chromosome|24353|27169|1
Chromosome|24353|27169|1|Chromosome|27169|27663|1
Chromosome|27169|27663|1|Chromosome|27751|28200|1
Chromosome|27751|28200|1|Chromosome|28202|29152|1
Chromosome|28202|29152|1|Chromosome|29218|30132|1
Chromosome|29218|30132|1|Chromosome|30299|31120|1
Chromosome|30299|31120|1|Chromosome|31399|31536|1
Chromosome|31399|31536|1|Chromosome|31576|32724|1
Chromosome|31576|32724|1|Chromosome|32742|35963|1
Chromosome|32742|35963|1|Chromosome|36224|36619|1
Chromosome|36705|37295|0|Chromosome|37301|38194|0
Chromosome|37301|38194|0|Chromosome|38195|39748|0
Chromosome|38195|39748|0|Chromosome|39822|41039|0
Chromosome|39822|41039|0|Chromosome|41168|42310|0
Chromosome|41168|42310|0|Chromosome|42341|43855|0
Chromosome|44329|45099|1|Chromosome|45114|46055|1
Chromosome|45114|46055|1|Chromosome|46106|47392|1
(...)