Question: What is GFF3 and how to extract genomic sequences from it?
4
gravatar for MAPK
5.5 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

Hi everybody,

I am new in bioinformatics and recently I am learning all these new things which is very exciting. I now have another challenge in working with the genome data. I have Tribolium genome from beetle base: ftp://ftp.bioinformatics.ksu.edu/pub/BeetleBase/3.0/. I also have this GFF3 data from the same FTP site: ftp://ftp.bioinformatics.ksu.edu/pub/BeetleBase/3.0/GFF3/Official_Gene_GFF3/. I need to extract full length gene (intron, exons, noncoding regions...) using these two data. Could any of you please enlighten me on getting this done? Thank you for your help!

genomic sequence gene gff3 • 6.7k views
ADD COMMENTlink modified 5.1 years ago by Daniel Standage3.9k • written 5.5 years ago by MAPK1.4k

Given links doesn't seem to be working.

ADD REPLYlink written 5.5 years ago by PoGibas4.8k

Thanks for replying, I just corrected the link. 

ADD REPLYlink written 5.5 years ago by MAPK1.4k
4
gravatar for Alex Reynolds
5.5 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Generally, you could use conversion tools in genomic toolkits (like gff2bed) to extract data quickly:

$ gff2bed < annotations.gff | grep -w 'gene' | cut -f1-4 > genes.bed

For example:

$ gff2bed < ChLG2_official_gene.gff3 | grep -w 'gene' | cut -f1-4
ChLG2    4835    6220    TCOGS2:TC004531
ChLG2    10132    20674    TCOGS2:TC004850
ChLG2    24354    25482    TCOGS2:TC004532
ChLG2    25740    30300    TCOGS2:TC004849
ChLG2    30769    31550    TCOGS2:TC004533
ChLG2    35291    36284    TCOGS2:TC004534
ChLG2    36824    39385    TCOGS2:TC004535
ChLG2    64940    65360    TCOGS2:TC004848
ChLG2    76047    76227    TCOGS2:TC004847
ChLG2    77616    90710    TCOGS2:TC004846
...

Once you have these coordinate ranges for your genes, you can query each one against an indexed FASTA file. One approach is discussed in another answer in this thread.

You could do something like the following to get a cleaned-up FASTA file and index:

$ wget ftp://ftp.bioinformatics.ksu.edu//pub/BeetleBase/3.0/Tribolium_genome_sequence.fasta -O - \
   | sed '/^$/d' \
   > Tribolium_genome_sequence.fasta

$ samtools faidx Tribolium_genome_sequence.fasta

(It looks like there are some blank lines in the FASTA file, which causes samtools to segfault. The sed statement strips those lines.)

Once cleaned and indexed, you could do lookups of gene sequences, appending results to the fifth column of the genes.bed file:

$ awk ' \
    { \
        range = $1":"$2"-"$3; \
        system("samtools faidx Tribolium_genome_sequence.fasta "$range); \
    }' genes.bed \
    > genes.fa

$ awk ' \
    { \
        if (/^>/) { \
            printf("\n"); \
        } \
        else { \
            printf("%s", $0); \
        } \
    }' genes.fa \
    | tail -n +2 - \
    | paste genes.bed - \
    > answer.bed

The file answer.bed could be fairly large. It might be preferable to extract subsets of genes of interest from genes.bed using tools like BEDOPS bedops or bedmap, and then do a lookup on the subset, instead of the entire set. (On the other hand, if you build your answer file once, you can always just do repeated lookups on that end product.)

ADD COMMENTlink modified 3.6 years ago • written 5.5 years ago by Alex Reynolds29k

Thank you, Alex! could you please elaborate this"(It looks like there are some blank lines in the FASTA file, which causes samtools to segfault. The sedstatement strips those lines.)"

and how do I use sed?

ADD REPLYlink written 5.5 years ago by MAPK1.4k

The FASTA file from the KSU site has blank lines in it. That's bad FASTA. If you simply download the FASTA file and try to run it through samtools as it is, then samtools will crash (at least the version of samtools written at the time of answering this question).

The sed command strips out those lines, so that you can run samtools on a clean FASTA file and index it.

To clean the FASTA file, you can use sed on its own, or pipe input into it, as I show in the example above.

If this is unclear, read Wikipedia or other sources on standard input and output streams in UNIX, as you'll find basic input and output piping in use all over various modern bioinformatics analyses.

ADD REPLYlink modified 3.6 years ago • written 5.5 years ago by Alex Reynolds29k
3
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

For some of that the path of least resistance is to simply use gffread from cufflinks, which can directly accept an annotation and a genome sequence in a fasta file and be told to spit out full sequences. I don't recall gffread being able to output intronic sequences (maybe it can, though I doubt it since that'd normally not be useful). For that, one could simply tweak the R (my post) or perl (Alejandro Reyes' post) scripts in this post on seqanswers.

ADD COMMENTlink written 5.5 years ago by Devon Ryan92k

Seems the gffread link is now broken, just a head's up

ADD REPLYlink written 2.6 years ago by a1ultima710

Here are updated links to the program description and it's corresponding GitHub repository.

ADD REPLYlink written 15 months ago by mmfansler320
1
gravatar for Ashutosh Pandey
5.5 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

You can get start and end coordinates of a gene from gff3 file. 

For example:

ChLG10	TCOGS2	gene	11381335	11383395	.	-	.	ID=TCOGS2:TC004355;Name=TC004355;Alias=GLEAN_04355

The start and end coordinates for this gene are 11381335 and 11383395 respectively.

Now you can extract the sequence from the reference fasta file using these coordinates. You can use samtools faidx command. See below for the faidx description from samtools.

  samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.faion the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

ADD COMMENTlink written 5.5 years ago by Ashutosh Pandey11k
0
gravatar for MAPK
5.5 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I found this which is very useful:

https://usegalaxy.org/root/index

it has feature to fetch sequences using gff3 file.

ADD COMMENTlink written 5.5 years ago by MAPK1.4k
0
gravatar for Daniel Standage
5.1 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

The AEGeAn Toolkit includes the xtractore program, designed precisely for this task.

[standage@lappy AEGeAn] bin/xtractore -h

xtractore: extract sequences corresponding to annotated features from the
           given sequence file

Usage: xtractore [options] features.gff3 sequences.fasta
  Options:
    -h|--help             print this help message and exit
    -i|--idfile: FILE     file containing a list of feature IDs (1 per line
                          with no spaces); if provided, only features with
                          IDs in this file will be extracted
    -o|--outfile: FILE    file to which output sequences will be written;
                          default is terminal (stdout)
    -t|--type: STRING     feature type to extract; can be used multiple
                          times to extract features of multiple types
    -v|--verbose          print verbose warning and error messages
    -w|--width: INT       width of each line of sequence in the Fasta
                          output; default is 80; set to 0 for no
                          formatting

ADD COMMENTlink written 5.1 years ago by Daniel Standage3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour