Question: Retrieve reference sequence GenBank for mRNA transcript of interest
0
gravatar for speycast
9 months ago by
speycast0
speycast0 wrote:

Hi,

I want to retrieve reference sequence GenBank files for genes with transcript of interest using E-utils For example: GFM1 gene has two mRNA transcripts in the region. XM_005247840.1 and NM_024996.5, and I only want the mRNA and CDS for NM_024996.5 in my GenBank reference files.

This is what I have so far:

esearch -db gene -query "GFM1[gene] AND human[orgn] AND alive[prop]" | efetch -format docsum | xtract -pattern DocumentSummary -block LocationHistType -if ChrAccVer -equals NC_000003.11 -tab "\n" -element ChrAccVer,ChrStart,ChrStop | awk -F '\t' '{{OFS = "\t"} if ($2 < $3) {print $1, $2+1, $3+1} else {print $1, $2+1, $3+1}}' | xargs -n 3 sh -c 'efetch -db nucleotide -id "$0" -seq_start "$1" -seq_stop "$2"'

But, it gives me both transcripts, and I only want the one of interest.

Any help would be greatly appreciated!

genbank refseq mrna ncbi • 239 views
ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 20 • written 9 months ago by speycast0

There are multiple entries in RefSeq:

$  esearch -db gene -query "GFM1[gene] AND human[orgn] AND alive[prop]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"
>NM_024996.7 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 2, mRNA; nuclear gene for mitochondrial product
>NM_001374357.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 6, mRNA; nuclear gene for mitochondrial product
>NM_001374355.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 4, mRNA; nuclear gene for mitochondrial product
>NM_001374361.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 10, mRNA; nuclear gene for mitochondrial product
>NM_001374356.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 5, mRNA; nuclear gene for mitochondrial product
>NM_001374358.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 7, mRNA; nuclear gene for mitochondrial product
>NM_001374360.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 9, mRNA; nuclear gene for mitochondrial product
>NM_001374359.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 8, mRNA; nuclear gene for mitochondrial product
>NM_001308166.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 3, mRNA
>NR_164502.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 14, non-coding RNA
>NR_164499.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 11, non-coding RNA
>NR_164500.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 12, non-coding RNA
>NR_164501.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 13, non-coding RNA
>NM_001308164.1 Homo sapiens G elongation factor mitochondrial 1 (GFM1), transcript variant 1, mRNA
ADD REPLYlink written 9 months ago by GenoMax96k

Thanks genomax, I'm new to this. I'd like .gk files for GRCh37 assembly. With my command above, I was able to get the .gb file, but it contains both mRNA transcripts. I'm having trouble with downloading .gb reference for just NM_024996.5

ADD REPLYlink written 9 months ago by speycast0

You can get the GenBank format sequence by doing:

$ efetch -db nuccore -id "NM_024996" -format gb > NM_024996.gbk

These are RefSeq accessions and they are not tied to a particular genome build.

ADD REPLYlink modified 9 months ago • written 9 months ago by GenoMax96k

Yes, I understand. But I do need the reference sequence for this gene that encodes the transcript and chromosome number g. relative to the gene.

The genbank should contain the gene GFM1, build 37 NC_000003.11 and mRNA transcript NM_024996.5.

Thanks..

ADD REPLYlink written 9 months ago by speycast0

Entrez Direct will only return the latest data. And human build 37 aka GRCh37 is no longer actively annotated. Every once in a while an update is released by RefSeq. The latest update is here: https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/

You can download the GenBank flatfiles from that location as well as data in other formats. Keep in mind though that the data there are current as of the annotation update release date. Any additional updates to the RefSeqs made since that release are currently only available for GRCh38. Its not uncommon to come across RefSeq transcripts that are live and not present in the GRCh37 data because those RefSeq transcripts were created after the release of the GRCh37 105.20190906.

ADD REPLYlink written 7 months ago by vkkodali2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1749 users visited in the last hour
_