Is there anybody who has experience in extracting intergenic regions with GFF-Ex?
I have successfully downloaded, installed and used this tool on my Linux machine. I am using it to extract the intergenic regions of Bacillus subtilis strain 168.
In my opinion (please correct me if I am wrong), there are some issues(at least for me) in the output file intergenic_sequence.fasta
This file does contain the intergenic sequences, however I suspect that not all are read in the correct direction. This file created by running the tool is a FASTA with all the sequences between each of the genes in my organism; each sequence is labelled with the 2 gene names of which the sequence it is inbetween. Therefore I assume it does capture all those sequences, but I think all those captured are from the forward strand. By opening the appropriate script however it seems they do try to take the strand of the genes in consideration, but I am no programming expert.
I have this suspicion by looking at 2 genes in my genome; this 2 genes are next to each other in the genome; the first gene however is in the reverse strand, and the second gene it is on the forward strand.
This means that the intergenic region upstream of each gene is the same (it's overlapping), but in the opposite direction for each gene.
(Gene1 Reverse Strand) ATCTCTCTCGC (Gene2 Forward strand)
Look at the example above, the shared intergenic is: 5 ATCTCTCTCGC 3 Therefore when I extract the intergenic sequences I would like to have in the output file 2 sequences, the same sequence but in the opposite direction, each attributed to the correct gene, like below:
> Gene 1 5 CGCTCTCTCTA 3 > Gene 2 3 ATCTCTCTCGC 5
Instead GFF-Ex only output this:
> Gene 1 BETWEEN Gene 2 5 ATCTCTCTCGC 3
and then goes to extract whatever is between Gene 2 and Gene 3 and so on.
Note: I simplify the output; in reality the name of the sequences have other IDs for the gene, but basically they give the 2 genes for which the sequence is in between. Also the numbers of the ends (5 and 3) are not there, I just added them to make this example more clear.
Maybe what I want is not the intended output of the intergenic_sequence script of GFF-Ex; I basically need to treat intergenic regions more like the full region upstream of each gene instead GFF-Ex seems to extract them as region between 2 genes, and it link each intergenic region to 2 genes.
I was looking at the "Upstream to promoter" output file aswell, and maybe that would be a better method for what I need; however before running this the tool asks to insert the length of the upstream sequences to extract and that is an issue for me, because I need the full upstream sequence, and they are all of different length.
Any solution, insight or clarification on this issue?
I have tried numerous method unsuccessfully, also using R.
In particular the package "genbankr" with the function "intergenic"; this package allows you to read in a genbank file and then extract the intergenic regions coordinates as a GRanges object; the problem here is that all the ranges are classed as unknown strand (all *, no + or - in the GRanges). Therefore when I then use this ranges for the extraction I can only extract the forward orientation. This is even more weird; I also tried the gaps function of GenomicFeatures, after converting my genbank file to a TxDb object; this ranges do have the the strand correctly extracted, however this function does not create a metadata column and I loose all the names of the genes attributed to the intergenic ranges, which is also an issue for me.
I can use Perl or R; does anybody have any suggestion on how to tackle this problem? Unfortunately my programming skills are limited, and I do not think parsing a GenBank file with my own script is a viable option. But once I have the ranges, I can write a function to extract those from a Fasta by myself.