Parse ncbiRefSeq (other like) annotations
1
0
Entering edit mode
23 months ago
onestop_data ▴ 330

Hi there - I'm trying to find a way to parse an annotations file http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ncbiRefSeq.txt.gz

and using the fields: "chromStart, chromEnd, cdsStart, cdsEnd,exonStarts, exonEnds, strand" to get the 5_UTR, 3_UTR, exons, introns start and end.

Is there a tool or python library that can help parse this type of data?

Here are all the fields: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=wgEncodeGencodeV19&hgta_table=wgEncodeGencodePolyaV19&hgta_doSchema=describe+table+schema

Thanks

refseq • 639 views
ADD COMMENT
1
Entering edit mode
23 months ago
vkkodali_ncbi ★ 3.7k

Not quite a solution to parse UCSCs ncbiRefSeq.txt.gz files but if you are interested in RefSeq annotation, you may want to download the GFF3 file for GRCh38 (hg38) using NCBI Datasets; direct link here (click the 'Download' button and choose GFF3).

Once you have the GFF3 file, you can then use the python script add_utrs_to_gff.py from here to add the UTR annotations to the GFF3 file. From there you can use a GFF3 parsing tool such as AGAT to extract specific data of interest to you.

ADD COMMENT

Login before adding your answer.

Traffic: 3045 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6