Generally, you could use conversion tools in genomic toolkits (like gff2bed) to extract data quickly:
$ gff2bed < annotations.gff | grep -w 'gene' | cut -f1-4 > genes.bed
For example:
$ gff2bed < ChLG2_official_gene.gff3 | grep -w 'gene' | cut -f1-4
ChLG2 4835 6220 TCOGS2:TC004531
ChLG2 10132 20674 TCOGS2:TC004850
ChLG2 24354 25482 TCOGS2:TC004532
ChLG2 25740 30300 TCOGS2:TC004849
ChLG2 30769 31550 TCOGS2:TC004533
ChLG2 35291 36284 TCOGS2:TC004534
ChLG2 36824 39385 TCOGS2:TC004535
ChLG2 64940 65360 TCOGS2:TC004848
ChLG2 76047 76227 TCOGS2:TC004847
ChLG2 77616 90710 TCOGS2:TC004846
...
Once you have these coordinate ranges for your genes, you can query each one against an indexed FASTA file. One approach is discussed in another answer in this thread.
You could do something like the following to get a cleaned-up FASTA file and index:
$ wget ftp://ftp.bioinformatics.ksu.edu//pub/BeetleBase/3.0/Tribolium_genome_sequence.fasta -O - \
| sed '/^$/d' \
> Tribolium_genome_sequence.fasta
$ samtools faidx Tribolium_genome_sequence.fasta
(It looks like there are some blank lines in the FASTA file, which causes samtools
to segfault. The sed
statement strips those lines.)
Once cleaned and indexed, you could do lookups of gene sequences, appending results to the fifth column of the genes.bed
file:
$ awk ' \
{ \
range = $1":"$2"-"$3; \
system("samtools faidx Tribolium_genome_sequence.fasta "$range); \
}' genes.bed \
> genes.fa
$ awk ' \
{ \
if (/^>/) { \
printf("\n"); \
} \
else { \
printf("%s", $0); \
} \
}' genes.fa \
| tail -n +2 - \
| paste genes.bed - \
> answer.bed
The file answer.bed
could be fairly large. It might be preferable to extract subsets of genes of interest from genes.bed
using tools like BEDOPS bedops
or bedmap
, and then do a lookup on the subset, instead of the entire set. (On the other hand, if you build your answer file once, you can always just do repeated lookups on that end product.)
Given links doesn't seem to be working.
Thanks for replying, I just corrected the link.