Question

Parse ncbiRefSeq (other like) annotations

0

Entering edit mode

2.2 years ago

onestop_data ▴ 330

Hi there - I'm trying to find a way to parse an annotations file http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz

and using the fields: "chromStart, chromEnd, cdsStart, cdsEnd,exonStarts, exonEnds, strand" to get the 5_UTR, 3_UTR, exons, introns start and end.

Is there a tool or python library that can help parse this type of data?

Here are all the fields: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=wgEncodeGencodeV19&hgta_table=wgEncodeGencodePolyaV19&hgta_doSchema=describe+table+schema

Thanks

refseq • 715 views

ADD COMMENT • link updated 2.2 years ago by vkkodali_ncbi ★ 3.7k • written 2.2 years ago by onestop_data ▴ 330

score 1 · Answer 1 · 2022-05-24

Not quite a solution to parse UCSCs ncbiRefSeq.txt.gz files but if you are interested in RefSeq annotation, you may want to download the GFF3 file for GRCh38 (hg38) using NCBI Datasets; direct link here (click the 'Download' button and choose GFF3).

Once you have the GFF3 file, you can then use the python script add_utrs_to_gff.py from here to add the UTR annotations to the GFF3 file. From there you can use a GFF3 parsing tool such as AGAT to extract specific data of interest to you.