What is GFF3 and how to extract genomic sequences from it?
5
6
Entering edit mode
8.8 years ago
MAPK ★ 2.0k

Hi everybody,

I am new in bioinformatics and recently I am learning all these new things which is very exciting. I now have another challenge in working with the genome data. I have Tribolium genome from beetle base. I also have this GFF3 data from the same FTP site. I need to extract full length gene (intron, exons, noncoding regions...) using these two data. Could any of you please enlighten me on getting this done? Thank you for your help!

sequence gene Gff3 genomic • 11k views
0
Entering edit mode

Given links doesn't seem to be working.

0
Entering edit mode

4
Entering edit mode
8.8 years ago

Generally, you could use conversion tools in genomic toolkits (like gff2bed) to extract data quickly:

$gff2bed < annotations.gff | grep -w 'gene' | cut -f1-4 > genes.bed  For example: $ gff2bed < ChLG2_official_gene.gff3 | grep -w 'gene' | cut -f1-4
ChLG2    4835    6220    TCOGS2:TC004531
ChLG2    10132    20674    TCOGS2:TC004850
ChLG2    24354    25482    TCOGS2:TC004532
ChLG2    25740    30300    TCOGS2:TC004849
ChLG2    30769    31550    TCOGS2:TC004533
ChLG2    35291    36284    TCOGS2:TC004534
ChLG2    36824    39385    TCOGS2:TC004535
ChLG2    64940    65360    TCOGS2:TC004848
ChLG2    76047    76227    TCOGS2:TC004847
ChLG2    77616    90710    TCOGS2:TC004846
...


Once you have these coordinate ranges for your genes, you can query each one against an indexed FASTA file. One approach is discussed in another answer in this thread.

You could do something like the following to get a cleaned-up FASTA file and index:

$wget ftp://ftp.bioinformatics.ksu.edu//pub/BeetleBase/3.0/Tribolium_genome_sequence.fasta -O - \ | sed '/^$/d' \
> Tribolium_genome_sequence.fasta

$samtools faidx Tribolium_genome_sequence.fasta  (It looks like there are some blank lines in the FASTA file, which causes samtools to segfault. The sed statement strips those lines.) Once cleaned and indexed, you could do lookups of gene sequences, appending results to the fifth column of the genes.bed file: $ awk ' \
{ \
range = $1":"$2"-"$3; \ system("samtools faidx Tribolium_genome_sequence.fasta "$range); \
}' genes.bed \
> genes.fa

$awk ' \ { \ if (/^>/) { \ printf("\n"); \ } \ else { \ printf("%s",$0); \
} \
}' genes.fa \
| tail -n +2 - \
| paste genes.bed - \


The file answer.bed could be fairly large. It might be preferable to extract subsets of genes of interest from genes.bed using tools like BEDOPS bedops or bedmap, and then do a lookup on the subset, instead of the entire set. (On the other hand, if you build your answer file once, you can always just do repeated lookups on that end product.)

0
Entering edit mode

Thank you, Alex! could you please elaborate this "(It looks like there are some blank lines in the FASTA file, which causes samtools to segfault. The sedstatement strips those lines.)"

and how do I use sed?

0
Entering edit mode

The FASTA file from the KSU site has blank lines in it. That's bad FASTA. If you simply download the FASTA file and try to run it through samtools as it is, then samtools will crash (at least the version of samtools written at the time of answering this question).

The sed command strips out those lines, so that you can run samtools on a clean FASTA file and index it.

To clean the FASTA file, you can use sed on its own, or pipe input into it, as I show in the example above.

If this is unclear, read Wikipedia or other sources on standard input and output streams in UNIX, as you'll find basic input and output piping in use all over various modern bioinformatics analyses.

3
Entering edit mode
8.8 years ago

For some of that the path of least resistance is to simply use gffread from cufflinks, which can directly accept an annotation and a genome sequence in a fasta file and be told to spit out full sequences. I don't recall gffread being able to output intronic sequences (maybe it can, though I doubt it since that'd normally not be useful). For that, one could simply tweak the R (my post) or perl (Alejandro Reyes' post) scripts in this post on seqanswers.

0
Entering edit mode

0
Entering edit mode
1
Entering edit mode
8.8 years ago

You can get start and end coordinates of a gene from gff3 file.

For example:

ChLG10  TCOGS2  gene    11381335    11383395    .   -   .   ID=TCOGS2:TC004355;Name=TC004355;Alias=GLEAN_04355


The start and end coordinates for this gene are 11381335 and 11383395 respectively.

Now you can extract the sequence from the reference fasta file using these coordinates. You can use samtools faidx command. See below for the faidx description from samtools.

samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.faion the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

0
Entering edit mode
8.8 years ago
MAPK ★ 2.0k

I found this which is very useful:

https://usegalaxy.org/root/index

it has feature to fetch sequences using gff3 file.

0
Entering edit mode
8.4 years ago

The AEGeAn Toolkit includes the xtractore program, designed precisely for this task.

[standage@lappy AEGeAn] bin/xtractore -h

xtractore: extract sequences corresponding to annotated features from the
given sequence file

Usage: xtractore [options] features.gff3 sequences.fasta
Options:
-h|--help             print this help message and exit
-i|--idfile: FILE     file containing a list of feature IDs (1 per line
with no spaces); if provided, only features with
IDs in this file will be extracted
-o|--outfile: FILE    file to which output sequences will be written;
default is terminal (stdout)
-t|--type: STRING     feature type to extract; can be used multiple
times to extract features of multiple types
-v|--verbose          print verbose warning and error messages
-w|--width: INT       width of each line of sequence in the Fasta
output; default is 80; set to 0 for no
formatting