RefSeq transcripts *.rna.fna.gz files
0
0
Entering edit mode
6.0 years ago
seqall • 0

A very basic question:

What's the difference between the "*.rna.fna.gz" files at: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/

and "GRCh38_latest_rna.fna.gz" at: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/?

Are the RNA sequences in "GRCh38_latest_rna.fna.gz" exactly the same as the ones in the combination of all "*.rna.fna.gz"?

Thanks a lot!

sequence RNA-Seq • 2.7k views
ADD COMMENT
1
Entering edit mode

From what I can tell, the files in the directory ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/ are just symbolic links to data elsewhere on the NCBI servers, links that are updated such that they will always point to the most updated files.

GRCh38_latest_rna.fna.gz contains the mRNA sequence of all RefSeq transcripts, specifically:

grep -e ">" GRCh38_latest_rna.fna | cut -f1 -d "_" | sort | uniq -c
50052, NM
15544, NR
63555, XM
30847, XR

The files in the other directory are broken up into various chunks, for whatever reason, but the sequences in them are exactly the same as per GRCh38_latest_rna.fna.gz. According to the README files, these 'chunks' also contain all RefSeq transcripts

ADD REPLY
0
Entering edit mode

Thanks a lot for your info, Kevin! After investigation and inquiry of this two sets, it turns out that the "*.rna.fna.gz" files at: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/ are the latest and contain newer versions of transcripts. They have mRNA sequences:

grep -e ">" human.all.rna.fna | cut -f1 -d "_" | sort | uniq -c
50102 >NM
15535 >NR
63532 >XM
30843 >XR

After comparing all the transcript IDs

grep -e ">" GRCh38_latest_rna.fna | cut -f1 -d " " | sort > refseq_ids_GRCh38_latest
grep -e ">" human.all.rna.fna | cut -f1 -d " " | sort > refseq_ids_human_all
diff refseq_ids_human_all refseq_ids_GRCh38_latest

A part of the differences are as follows:

41c41
< >NM_000054.5
---
> >NM_000054.4
83c83
< >NM_000097.6
---
> >NM_000097.5
468c468
< >NM_000500.8
---
> >NM_000500.7
1944,1945c1944,1945
< >NM_001008703.3
< >NM_001008704.3
---
> >NM_001008703.2
> >NM_001008704.2
ADD REPLY
0
Entering edit mode

Interesting. Perhaps feed this back to the NCBI (if not already done).

ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6