Question

Extract only locations of annotated genes from a reference genome

1

Entering edit mode

2.4 years ago

Ivan ▴ 60

I have a reference genome ref_genome.fna I got by downloading via datasets download genome via its accession number. GenBank assembly and RefSeq assembly are identical. When I use RefSeq accession number to find it in RefSeq database, I can see features that were annotated, and those are Gene; mRNA; CDS; ncRNA. When I go to it's annotation release, link here, I can see that 25,293 genes are protein coding. Likewise, annotation products are available on ftp site, link here.

What I want to do is extract the locations of those 25,293 protein coding genes (alongside any gene identifier) in a single file, eg.

chrom   chromStart   chromEnd   geneID
chr2       100000      155000   RK031
...         ...        ...      ...

For that purpose, what file do I need to download, and what tool do I need to use?

annotation RefSeq coding-genes • 936 views

ADD COMMENT • link updated 2.3 years ago by MirianT_NCBI ▴ 720 • written 2.4 years ago by Ivan ▴ 60

1

Entering edit mode

You can simply download the transcript and protein sequences for Astyanax mexicanus from NCBI.

Otherwise you can use AGAT (Extracting genomic feature sequences from GTF/GFF files with AGAT ) for extracting this information from genome file.

ADD REPLY • link 2.4 years ago by GenoMax 142k

score 0 · Answer 1 · 2022-01-04

Hi Ivan,
To create a table like you described in your post, you can use grep and awk to extract the fields from the feature table. Here's how I did it:

grep "protein_coding" GCF_000372685.2_Astyanax_mexicanus-2.0_feature_table.txt | awk -F '\t' '{print substr($5,0,5)$6,$8,$9,$16}' > GCF_000372685.2_Astyanax_mexicanus_protein_coding.txt

I used grep to extract all lines containing "protein_coding"; I piped the output to awk, and extracted the fields 5 (seq_type, and only the first five characters, so you have chrom), 6 (chromosome number), 8 (start), 9 (end) and 16 (GeneID). The fields 5 and 6 are concatenated, so you have only one column with the chromosome info. Those entries without chromosome info (unplaced scaffolds) will appear as unpla. I got only 25,117 protein coding from the feature table. Not sure about the discrepancy though.

I hope it helps :)