Question: Retrieve FASTA sequences for all genes from GRCh38.p13
0
gravatar for lokimoto
12 months ago by
lokimoto0
lokimoto0 wrote:

Hi,

I try to get all sequences for a large (about 17.000) list of NCBI gene Ids.

What I figured is one can use http://www.ensembl.org/biomart/martview/7cf551cc4abf51e75bb4f1d84477681e and then download the sequences but this is slow and only works for 500 gene ids at a time.

Is there somewhere a database / file available which I can use to retrieve the nucleotide sequences for the corresponding genes?

I figured there is https://www.ncbi.nlm.nih.gov/sites/batchentrez which can yield for genes:

 gene_id -> genomic_nucleotide_accession.version:start_position_on_the_genomic_accession-end_position_on_the_genomic_accession"

Then I used the "nucleotide" batch search. But when I enter e.g.

NC_000003.12:155869430-155944020

I get

An illegal character in a token. Possible wrong file format. Request processing canceled.

I figured one can download the genome from here ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/ but I am lost a bit here. In the readme it states

*_genomic.fna.gz file FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case (see below).

So I downloaded this file and I also grepped a bit in the file and can view for example NC_000003.12. But now I need also to retrieve the positions which is super slow when doing it the neive waw for example in python.

So my question is: How should I approach this task?

sequence assembly • 832 views
ADD COMMENTlink modified 12 months ago by vkkodali2.1k • written 12 months ago by lokimoto0
2
gravatar for vkkodali
12 months ago by
vkkodali2.1k
United States
vkkodali2.1k wrote:

I suggest you to take a look at the GCF_000001405.39_GRCh38.p13_feature_table.txt.gz file located in the FTP path you have specified. It is a tab-delimited table with information about all of the features annotated on the genome. If you have a list of NCBI GeneIDs, I'd join that list on GeneID field of the feature_table.txt.gz file, filter the rows that have gene in the first column and create a BED-like file using the data in columsn 7, 8, 9 and 10. Make sure to subtract 1 from the start/end positions depending on the strand because BED uses 0-based coordinate system whereas the coordinates in the feature_table.txt file are 1-based. Then, you can use bedtools getfasta in combination with the genome sequence FASTA files to extract the gene sequence in FASTA format.

ADD COMMENTlink written 12 months ago by vkkodali2.1k

Awesome thank you very much!! The only problem I am getting is, that the sequences seems to not match. E.g. looking at Gene ID 30849, I get the location from feature_table: "NC_000003.12:130678934 -130746829"

Now when I look up the gene at NCBI, I get this sequence: https://www.ncbi.nlm.nih.gov/nuccore/NC_000003.12?report=fasta&from=130678934&to=130746829&strand=true starting with

GAGGTCAGACCGGTTGCTTTCCCGGGAGTTCGGCG

But when I do

bedtools getfasta -fi GCF_000001405.39_GRCh38.p13_genomic.fna -bed test_gene_30849.bed

I get the sequence starting with

ACAGTTTAACTGCTTTGAATAgcatttttaatagcaaaaatgtGTTTATACTATAA

Also I don't know why there are small letters?

ADD REPLYlink written 12 months ago by lokimoto0
1

Your link is showing reverse complement sequence because the gene is on the opposite strand. bedtools getfasta has an option to fetch strand specific sequence if you include the strand in your input bed. I suggest you give that a try.

Also I don't know why there are small letters?

These are masked sequences. Take a look at the README file for more information about it.

ADD REPLYlink modified 12 months ago • written 12 months ago by vkkodali2.1k

Perfect, now everything works, thank you!

ADD REPLYlink written 12 months ago by lokimoto0
1
gravatar for piechota.marcin
12 months ago by
European Union
piechota.marcin70 wrote:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fna.gz
gzip -d chr1.fna.gz
samtools faidx chr1.fna
samtools faidx chr1.fna NC_000001.11:100000-100050

You can use samtools to create index for fasta file. Hope the above code helps. You can use xargs to iterate over bed file. I can put some code using xargs as a follow up if required.

ADD COMMENTlink modified 12 months ago • written 12 months ago by piechota.marcin70

Thank you very much! Similar to the sulution vkkodali posted I have problems finding the correct sequence.

E.g. looking at Gene ID 30849, the location is "NC_000003.12:130678934 -130746829"

from NCBI, I get this sequence: https://www.ncbi.nlm.nih.gov/nuccore/NC_000003.12?report=fasta&from=130678934&to=130746829&strand=true starting with

GAGGTCAGACCGGTTGCTTTCCCGGGAGTTCGGCG

But when I do

samtools faidx chr3.fna NC_000003.12:130678934-130746829

I get the sequence starting with

TACAGTTTAACTGCTTTGAATAGCATTTTTAATAGCAAAAATGTGTTTATACTATAATAG

Can you give me hints on why this is the case? (I used ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr3.fna.gz)

ADD REPLYlink modified 12 months ago • written 12 months ago by lokimoto0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2075 users visited in the last hour