Best tool for finding Boundary Pairs
2
0
Entering edit mode
5.0 years ago
Fatima ▴ 960

Hello everyone,

I'm a newbie to this field! I need to find the boundary pairs, but I don't know exactly how to find them! I have the names of the first and the last gene!

Here is the definition:

Boundary Pairs (consists of either the upstream adjacent gene and the first gene of the Transcription Unit, or the last gene of the Transcription Unit and the adjacent downstream gene, providing that these genes are transcribed in the same direction)

Reference Genome: Ecoli

Upstream RNA-Seq Tool • 1.1k views
1
Entering edit mode

Important for the others that will read: which organism are you working on? Is there a gene annotation, a reference sequence, a database available? Finding such pairs in human might be easy but not in a non annotated plant, for example.

0
Entering edit mode

Yes, there's a reference genome (Ecoli)

1
Entering edit mode

What's a TU? Preferentially don't use abbreviations, those aren't clear for everyone.

1
Entering edit mode

Transcription unit. I modified the post.

1
Entering edit mode

I gave for granted that they were Transcriptional Units. Please correct me if I'm wrong @Afagh.

0
Entering edit mode

@Macspider Yes, you are right :)

3
Entering edit mode
5.0 years ago

a one-liner using sqlite3, putting all data in a table 'T1' and using 3 aliases a,b,c :-) Assuming there is no overlapping transcript. This is very slow, I've used a head -n 200 to get a subset. I should have used some sql indexes.

$rm -f db.sqlite && curl -s "ftp://ftp.ensemblgenomes.org/pub/release-32/bacteria//gtf/bacteria_91_collection/escherichia_coli/Escherichia_coli.HUSEC2011CHR1.32.gtf.gz" | \ gunzip -c | awk -F '\t' '($3=="transcript")' |\
head -n 200 | cut -f 1,4,5,7 |\
awk 'BEGIN {printf("create table T1(C text,S int,E int,P int); begin transaction;\n");} {printf("insert into T1(C,S,E,P) values(\"%s\",%s,%s,%d);\n",$1,$2,$3,($4=="+"?1:0));} END {printf("commit;\nselect a.*,b.* from T1 as a, T1 as b left join T1 as c on c.C==a.C and a.E <= c.S and c.E <= b.S where a.C==b.C and a.E <= b.S and a.P==b.P group by a.C,a.S,a.E,a.P,b.C,b.E,b.S,b.P having count(c.C)==0;");}' |\
sqlite3 db.sqlite

Chromosome|349|2811|1|Chromosome|2813|3745|1
Chromosome|2813|3745|1|Chromosome|3746|5032|1
Chromosome|3746|5032|1|Chromosome|5246|5542|1
Chromosome|5696|6472|0|Chromosome|6542|7972|0
Chromosome|8320|9204|1|Chromosome|9319|9906|1
Chromosome|9941|10507|0|Chromosome|10656|11369|0
Chromosome|10656|11369|0|Chromosome|11395|11799|0
Chromosome|12176|14092|1|Chromosome|14181|15311|1
Chromosome|16153|17319|1|Chromosome|17379|18284|1
Chromosome|18323|19282|0|Chromosome|19295|20440|0
Chromosome|19295|20440|0|Chromosome|20873|21994|0
Chromosome|23423|24310|1|Chromosome|24353|27169|1
Chromosome|24353|27169|1|Chromosome|27169|27663|1
Chromosome|27169|27663|1|Chromosome|27751|28200|1
Chromosome|27751|28200|1|Chromosome|28202|29152|1
Chromosome|28202|29152|1|Chromosome|29218|30132|1
Chromosome|29218|30132|1|Chromosome|30299|31120|1
Chromosome|30299|31120|1|Chromosome|31399|31536|1
Chromosome|31399|31536|1|Chromosome|31576|32724|1
Chromosome|31576|32724|1|Chromosome|32742|35963|1
Chromosome|32742|35963|1|Chromosome|36224|36619|1
Chromosome|36705|37295|0|Chromosome|37301|38194|0
Chromosome|37301|38194|0|Chromosome|38195|39748|0
Chromosome|38195|39748|0|Chromosome|39822|41039|0
Chromosome|39822|41039|0|Chromosome|41168|42310|0
Chromosome|41168|42310|0|Chromosome|42341|43855|0
Chromosome|44329|45099|1|Chromosome|45114|46055|1
Chromosome|45114|46055|1|Chromosome|46106|47392|1
(...)

1
Entering edit mode
5.0 years ago
Macspider ★ 3.4k

Ok, so since you said that you have a reference genome and a reference annotation (given that it is E. coli) what you need is to find out which one is the closest gene to each one in your list. An example:

• you have G1 and G2 which are the first and last of a TU. If you extract them from the GTF file of E. coli you will find the positions in terms of nucleotides (field 4 and 5 of the gene feature line).
• you can write a little script to scan the annotation for the gene upstream G1, and the one downstream G2 (it shouldn't be too hard, just some dictionaries in python / hashes in perl or even lists.

Your boundary pairs for the TU you're considering will be [k-1, k] and [j, j+1] where k and j are the indexes of G1 and G2 in the list with all the gene coordinates.

You then iterate this rationale over all your genes in the list you have.

0
Entering edit mode

Thank you! So, I don't need to use the start site or -100 base and things like that? Okay, I'll write the script in python!