Question

Selecting intergenic regions

0

Entering edit mode

5.9 years ago

vinayjrao ▴ 250

Hi,

I have gtf files for a list of organisms. I want to extract only the intergenic sequences for each of their corresponding fasta files. Currently, I have been able to use grep -o '^..........' ecoli.fa > 10.txt to grep out the first 10 characters if the first gene starts at pos 11.

I now need to find a way to automate the process for the remaining regions, as manual inspection is impossible. How can I do the same for a huge file, for example -

Gene      Start     End
Gene1    11        157
Gene2    2878    8548
Gene3    11254  12124

The process might require file handling to subtract End of a row from the Start of the next row, and that value could be coupled with grep to give a command like grep -o '^{2877.}' ecoli.gtf | tail -f 2720 > 2720.txt

P.S. I could make the gtf file containing the intergenic lengths, but I do need help with the file handling.

Thanks in advance

grep shell • 3.1k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 5.9 years ago by vinayjrao ▴ 250

1

Entering edit mode

Hello,

you could first create a bed file containing the coordinates for each intergenic regions with the language of your choice. Than you can use this bed file together with bedtools getfasta to get the sequences.

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Thanks for the advice. It would certainly help, but to do the same, I first need to employ file handling to select the Start and End coordinates of the annotated genes. Would you happen to have a code to create the bed file?

Thanks.

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

0

Entering edit mode

How does your file with gene coordinates looks like? In your first post you've said you have a gtf file. But the example you have posted doesn't looks like one.

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Sorry for the confusion on my part. I only showed the relevant columns, but it is a proper gtf file.

GeneID Chr Start End Strand GeneInfo

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

0

Entering edit mode

@ Pierre Lindenbaum , Devon Ryan , genomax , WouterDeCoster

What happens to that thread? Who closes it and why?

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Only mods close threads when they are off-topic/spam etc. Users should just accept answers that are correct. That provides proper closure for the thread.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

Hello,

I have a GFF3 file. How can I get the intergenic regions and its positive and negative chain information?

Thank.

ADD REPLY • link 5.9 years ago by M201570116 ▴ 10

0

Entering edit mode

This is an unrelated question and should have been posted as a new thread. Also provide clear objective of what you are trying to achieve.

ADD REPLY • link 5.9 years ago by GenoMax 141k

score 2 · Accepted Answer · 2018-06-06

2

Entering edit mode

5.9 years ago

finswimmer 16k

Hello again,

this doesn't look like a proper gtf file. Nevertheless here is a python script that should create a bed file for the intergenic regions. It asumes the columns you gave, that your file is sorted by coordinates and that there is header line.

Run it like this:

$ python intergenic.py genes.gtf > intergenic.bed

ADD COMMENT • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Actually, I modified the gtf file I had earlier. I downloaded a gff file from NCBI, and converted it to gtf using gffread ecoli.gff -T -o ecoli.gtf. This gtf was modified to keep the columns mentioned in my previous comment.

Thanks a lot for all your help.

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

0

Entering edit mode

I tried running the script, but line 22 sep="\t" throws an error saying invalid syntax. I haven't learnt python :(

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

1

Entering edit mode

Hello,

than you are using python2. Add this line at the beginning of the python script and run again:

from __future__ import print_function

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

The script is perfectly running. Thanks a lot for your time. Could you please add it as an answer so that I can accept it.

Thanks again :)

ADD REPLY • link 5.9 years ago by vinayjrao ▴ 250

0

Entering edit mode

You're welcome.

I moved my post to an answer.

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k