Question

Is it possible to download Genebank files from NCBI having organisms names as the file name?

0

Entering edit mode

4.0 years ago

K.Gee ▴ 40

Hello, BIOstars!

I am looking to download a specific bunch of full genomes from NCBI https://www.ncbi.nlm.nih.gov/assembly/ I typed my desired organism and then from the option 'Download assemblies," I was able to download all the available Genebanks files. My issue here is that those files containing accession number or project number as file name, so I am asking if it is possible to download the same files but instead of accession numbers/project numbers having the organism name as file name?

Thanks in advance

NCBI genebank files • 1.2k views

ADD COMMENT • link 4.0 years ago by K.Gee ▴ 40

1

Entering edit mode

As you know the answer for your specific question is no.

Since you have already downloaded the files with accession number names, it should be easy to rename them after the fact.

Possible solutions:
https://stackoverflow.com/questions/54078687/automatically-rename-fasta-files-with-the-id-of-the-first-sequence-in-each-file
https://stackoverflow.com/questions/53094543/rename-genome-fasta-files-with-part-of-sequence-header

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

Ok, I see. Thanks for your response. I was looking for something before downloading my files but Its ok. Regarding your posted links, are useful for getting the idea. I think if I edit the commands for the fasta file, I can apply to the Genebanks' one. ;-)

ADD REPLY • link 4.0 years ago by K.Gee ▴ 40

1

Entering edit mode

You could script initial downloads such that files are renamed right after they are downloaded but if you already have the files in hand then solutions above would avoid having to re-download the data.

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

4.0 years ago

K.Gee ▴ 40

I made an update...

  ` for f in *.gbff; do d="$(grep ORGANISM "$f"| awk '{first=$1; $1=""; print $0}'|sed 's/^ *//; s/ /_/g').gbff"; if [ ! -f "$d" ]; then mv "$f" "$d" ; elseecho "File '$d' already exists! Skiped '$f'"; fi $i.gb; done`

* Explanation: Usually, the Genebank files contain in the field of ORGANISM 2-3 names (depending on the nomenclature). The above code works when we have 3 names and we don't mind about having spaces in the file name.

For example, let's say we have Homo Homo sapiens genebank file with the first code we will obtain Homo Homo Sapiens.gb. The issue with the first code is that by using grep we obtain ORGANISM Homo Homo sapiens which contains the word ORGANISM and SOME SPACES. (I tried the option -oP for hiding the word ORGANISM but it didn't work for me, so I used the $2 to print the first word after the word ORGANISM).

In this code:

By adding   `awk '{first=$1; $1=""; print $0}'|sed 's/^ *//; s/ /_/g')`

--> ORGANISM is "hidden"and all the spaces are substituted with _ and the output will be: Home_Homo_sapiens.gb

So it will be ideal when we have a folder mixing with genebanks files containing mixed names on nomenclature.

ADD COMMENT • link 4.0 years ago by K.Gee ▴ 40

score 1 · Accepted Answer · 2020-04-06

1

Entering edit mode

4.0 years ago

K.Gee ▴ 40

I got what I want. I used this command for everybody to need it for the future. Again thanks a lot @genomax for the links.

for f in *.gbff; do d="$(grep ORGANISM "$f" | awk '{print $2,$3,$4}').gbff"; if [ ! -f "$d" ]; then mv "$f" "$d"; else echo "File '$d' already exists! Skiped '$f'"; fi; done

ADD COMMENT • link 4.0 years ago by K.Gee ▴ 40