GFF-Ex Issues in extracting intergenic regions
1
0
Entering edit mode
3.9 years ago
Diesel • 0

Hi everyone,

Is there anybody who has experience in extracting intergenic regions with GFF-Ex?

I have successfully downloaded, installed and used this tool on my Linux machine. I am using it to extract the intergenic regions of Bacillus subtilis strain 168.

In my opinion (please correct me if I am wrong), there are some issues(at least for me) in the output file intergenic_sequence.fasta

This file does contain the intergenic sequences, however I suspect that not all are read in the correct direction. This file created by running the tool is a FASTA with all the sequences between each of the genes in my organism; each sequence is labelled with the 2 gene names of which the sequence it is inbetween. Therefore I assume it does capture all those sequences, but I think all those captured are from the forward strand. By opening the appropriate script however it seems they do try to take the strand of the genes in consideration, but I am no programming expert.

I have this suspicion by looking at 2 genes in my genome; this 2 genes are next to each other in the genome; the first gene however is in the reverse strand, and the second gene it is on the forward strand.

This means that the intergenic region upstream of each gene is the same (it's overlapping), but in the opposite direction for each gene.

(Gene1 Reverse Strand) ATCTCTCTCGC (Gene2 Forward strand)

Look at the example above, the shared intergenic is: 5 ATCTCTCTCGC 3 Therefore when I extract the intergenic sequences I would like to have in the output file 2 sequences, the same sequence but in the opposite direction, each attributed to the correct gene, like below:

> Gene 1
5 CGCTCTCTCTA 3
> Gene 2
3 ATCTCTCTCGC 5


Instead GFF-Ex only output this:

> Gene 1 BETWEEN Gene 2
5  ATCTCTCTCGC 3


and then goes to extract whatever is between Gene 2 and Gene 3 and so on.

Note: I simplify the output; in reality the name of the sequences have other IDs for the gene, but basically they give the 2 genes for which the sequence is in between. Also the numbers of the ends (5 and 3) are not there, I just added them to make this example more clear.

Maybe what I want is not the intended output of the intergenic_sequence script of GFF-Ex; I basically need to treat intergenic regions more like the full region upstream of each gene instead GFF-Ex seems to extract them as region between 2 genes, and it link each intergenic region to 2 genes.

I was looking at the "Upstream to promoter" output file aswell, and maybe that would be a better method for what I need; however before running this the tool asks to insert the length of the upstream sequences to extract and that is an issue for me, because I need the full upstream sequence, and they are all of different length.

Any solution, insight or clarification on this issue?

I have tried numerous method unsuccessfully, also using R.

In particular the package "genbankr" with the function "intergenic"; this package allows you to read in a genbank file and then extract the intergenic regions coordinates as a GRanges object; the problem here is that all the ranges are classed as unknown strand (all *, no + or - in the GRanges). Therefore when I then use this ranges for the extraction I can only extract the forward orientation. This is even more weird; I also tried the gaps function of GenomicFeatures, after converting my genbank file to a TxDb object; this ranges do have the the strand correctly extracted, however this function does not create a metadata column and I loose all the names of the genes attributed to the intergenic ranges, which is also an issue for me.

I can use Perl or R; does anybody have any suggestion on how to tackle this problem? Unfortunately my programming skills are limited, and I do not think parsing a GenBank file with my own script is a viable option. But once I have the ranges, I can write a function to extract those from a Fasta by myself.

bioinformatics gff-ex perl unix r • 1.6k views
0
Entering edit mode
3.8 years ago
achyR ▴ 10

I might be able to help you if you can be more clear in explaining "what do you need?"

Second, I have tested the GFF-Ex with some manually created gff files and here are the results:

# GFF-file

## sequence-region chromosome 1 6397126

chromosome PseudoCAP gene 339 1874 . + 0 geneA

chromosome PseudoCAP gene 2000 2200 . - 0 geneB

chromosome PseudoCAP gene 2300 2400 . + 0 geneC

Chromosome PseudoCAP gene 2600 2650 . - 0 geneD

Chromosome PseudoCAP gene 2625 2700 . + 0 geneE

Chromosome PseudoCAP gene 2780 2900 . - 0 geneF

# intergenic output file looks like this

chromosome_geneB*BETWEENchromosome_geneC ACTGATCGTGAAGGCAGGTCGCAGCCGGTTCACGTTGTCCACGCTGCCTGCCAACGATTTCCCGACGGTGGAAGAAGGCCCCGGCTCGCTGACGTTCAATC chromosome_geneABETWEENchromosome_geneB GAACCGGCAAGGTCTGGCGGCGTTCTACCACGCCGGCGACCAATTGCAGGGGTTTCAACAGGGCTTCGCGTTGAATGGTGAAATGCATGGTCTAGTCCCTTGCCTTGATAAGCTGCACTGGCTTCAT chromosome_geneCBETWEENChromosome_geneD CACCGGTGGTGGCGCGGATGTGATGCTGACCCAGCACGATGCTGACCATGCCGTCCTGCTCGGTCAGCAAACGCGCCATCTCCAGAATACCTTTACGAGGCACGATGACCTGATGACGGTCGGCGTGTTCGATATCGGCTGACATCGAGCACATTGCCAGACGGTGGCCATCGGTAGCCACTGCACGCAAGATTCCGGCAC Chromosome_geneEBETWEEN*Chromosome_geneF GGCCGTTAGCCAGTTGCAGACGAATGCCACGGTACTTCTCGTTGGACAGAATCGCGGTGCGGCTGAACGCTTCACGCAGGG

So, GFF-Ex do consider the strand information to fetch the intergenic regions between the two genes even if they are on the different strand. It is biologically logical as gene is always double stranded.

Note: there might be problems with the gff formatting which could fetch wrong results.