Question: Get intronic and intergenic sequences based on gff file
1
gravatar for biolab
4.9 years ago by
biolab1.1k
biolab1.1k wrote:

Dear all

I have a gff3 file with header shown below. The 3rd column indicates feature (exon, intron, UTR),the 4th and 5th columns indicate start and stop position, and the last column indicates detailed feature (eg which exon). I aim to extract intergenic and intronic sequences based on this file.

Note many genes overlap.How to achieve my goal?

Chr1 MSU_osa1r7 gene 2903 10817 . + . ID=LOC_Os01g01010;Name=LOC_Os01g01010;Note=TBC%20domain%20containing%20protein%2C%20expressed
Chr1 MSU_osa1r7 mRNA 2903 10817 . + . ID=LOC_Os01g01010.1;Name=LOC_Os01g01010.1;Parent=LOC_Os01g01010
Chr1 MSU_osa1r7 exon 2903 3268 . + . ID=LOC_Os01g01010.1:exon_1;Parent=LOC_Os01g01010.1
... ...
gff • 8.3k views
ADD COMMENTlink modified 13 months ago by zx87547.8k • written 4.9 years ago by biolab1.1k

Hello biolab!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLYlink written 4.9 years ago by Daniel Standage3.9k

Questions like biolab's get asked again and again in many different versions.  I think my answer below was specific to his exact questions (or at least one interpretation of it) which is not posed in your link....   and gave him specific advice on how to proceed.  I certainly would not have answered as I did to that other question which was in some ways broader.

ADD REPLYlink written 4.9 years ago by Malcolm.Cook1.0k

The question I linked to answers a very similar (albeit more general) question. I'm typically in favor of pointing users to these so that useful answers don't get spread across different threads.

However, I think I misread the question when I closed it: I thought it said intronic and exonic sequences, rather than intergenic sequences. I agree that inferring the intergenic sequence adds a dimension to this question that warrants a separate thread, so I'll reopen.

ADD REPLYlink written 4.9 years ago by Daniel Standage3.9k

Hi Daniel, thank you very much for the link provided.

ADD REPLYlink written 4.9 years ago by biolab1.1k
11
gravatar for ATpoint
14 months ago by
ATpoint19k
Germany
ATpoint19k wrote:

Question is a few years old, but still relevant:

Intergenic is the easiest, as it is simply the complement of all features in a GTF/GFF file with the rest of the genome.

Note: This definition of intergenic is based purely on the GFF file entries. Of course there are promoters and other regulatory regions in the genome, with e.g. promoters right upstream of the first exon, but this is not annotated in a typical GFF so here, intergenic is simply the complement of everything in the GFF. If one wants to include promoters, one could define a certain window upstream of the gene start coordinate.

Requires the GFF/GTF and a BED file with the chromosome sizes. The latter, you can get based on the chromSizes files for your genome, e.g. from UCSC. It contains two columns, the first is the chromosome name, the second the number of basepairs on that chr. Make a BED file out of it:

awk 'OFS="\t" {print $1, "0", $2}' chromSizes.txt | sort -k1,1 -k2,2n > chromSizes.bed

Sort the GFF file:

cat in.gff | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' > in_sorted.gff

Get intergenic regions:

bedtools complement -i in_sorted.gff -g chromSizes.bed > intergenic_sorted.bed

Next, intron is the complement of intergenic and exonic regions. Note: This definition of intron is (IMHO) sufficient, because the other features of a GFF (CDS, UTR, Start/Stop Codon) are all sub-intervals of "exon". Hence, everything that is not intergenic or exon must be intron. First, extract exonic coordinates in BED format:

awk 'OFS="\t", $1 ~ /^#/ {print $0;next} {if ($3 == "exon") print $1, $4-1, $5}' in_sorted.gff > exon_sorted.bed

Now use BEDtools complement to get the introns:

bedtools complement -i <(cat exon_sorted.bed intergenic_sorted.bed | sort -k1,1 -k2,2n) -g chromSizes.bed > intron_sorted.bed

This of course can be customized based on the information in your GFF. Again, as mentioned above, if you want intergenic to be more precise, you could add custom features such as promoters, enhancers or other regulatory elements to the GFF, and then again take the complement of this file with the rest of the genome.

Edit 08/18: just learned that there is a nice functionality in R to get introns. Check this post.

ADD COMMENTlink modified 10 months ago • written 14 months ago by ATpoint19k
4
gravatar for Malcolm.Cook
4.9 years ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

Well, your problem is underspecified, since you don't say how you want to handle overlapping genes or transcripts.  

However, in your toolkit you might want to have bedtools.

For example, if by "intergenic" you want all intervals that are not spanned by any gene, then you will want to use pull out just the gene entries from your gff3 (iand then use `bedtools complement` which "returns all intervals in a genome that are not covered by at least one interval in the input BED/GFF/VCF file." (http://bedtools.readthedocs.org/en/latest/content/tools/complement.html)

You can then get the fasta for the resulting gff using `bedtools getfasta` which" extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file" (http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html)

Similarly, if you require that intronic regions must be introns in each and every transcript which spans it, you can extract your 'gene' and 'intron' features into separate gene.gff and intron.gff files, respectively, then use `bedtools merge` on each to create genic.gff and intronic.gff, and then  use `bedtools subtract` to subtract the genic regions from the intronic regions.  

If your definition of intronic and intergenic are otherwise, you'll probably find a way to compute them using bedtools.

ADD COMMENTlink written 4.9 years ago by Malcolm.Cook1.0k

Hi, Malcolm, thanks a lot for your helpful suggestions!

ADD REPLYlink written 4.9 years ago by biolab1.1k

Can you tell me how to put all intron sequences to a single file (specify each intron of one gene) using a genome sequence and a GTF/GFF file? Thank you very much.

ADD REPLYlink written 3.4 years ago by biomoon10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1990 users visited in the last hour