Question: Batch rename RefSeq assembly for the corresponding organism name
0
gravatar for genomes_and_MGEs
8 months ago by
genomes_and_MGEs0 wrote:

Hey everyone,

I just downloaded several genomes from NCBI assembly. Let's say I downloaded all E. coli genomes. After unzipping all files, I'll have several files with the RefSeq accession as the file name. My objective is to batch rename all those individual files and replace by the corresponding organism name. So, for example, for file named GCF_000005845.2.genomic.fna, I would like to replace it for Escherichia coli str. K-12. Could you please help me with this? Thank you

assembly genome • 355 views
ADD COMMENTlink modified 5 months ago • written 8 months ago by genomes_and_MGEs0

This might not work for your data. Can you show us the content of the headers of your fastas?

How do you want to handle the cases where 2 sequences share the same strain name?

ADD REPLYlink written 8 months ago by jrj.healey13k

How do you want to handle the cases where 2 sequences share the same strain name?

Perhaps prepend the organism name to the GCF accession that's already in the filename?

You can get the organism name for a given GCF accession in a two-column format using Entrez Direct as shown below. I removed dots and replaced all spaces in the organism name with underscores so that the final filenames will be more manageable.

esearch -db assembly -q 'GCF_000005845.2' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession Organism | sed -r 's/ /_/g; s/\.//g'
GCF_0000058452  Escherichia_coli_str_K-12_substr_MG1655_(E_coli)
ADD REPLYlink modified 8 months ago • written 8 months ago by vkkodali1.1k

Hey guys,

Thank you for your answers. So, I have a folder containing several individual genome fasta files. Each file may correspond to a multi-fasta or to a complete genome. Anyway, each file is related to a given strain, and are named according to the strain's RefSeq accession. Your command should work fine for my data. I have a list.txt comprising all the RefSeq accessions. Is it possible to use this command and the list as a query? Also, after having the two-column format output, how can I write a command to batch rename the RefSeq accession for the given organism name? I guess I should write a python or perl script for that, but I'm no pro in bioinformatics :D Thank you guys again for your time. Cheers

ADD REPLYlink written 8 months ago by genomes_and_MGEs0

This is not particularly clear to me. Can you show us a small example of the file structure you have? You can use the tree program to get an easy output (you may need to install it from apt or similar)

ADD REPLYlink written 8 months ago by jrj.healey13k

You don't need a python/perl script for this. You can do this in bash. I made a few assumptions: 1. You are going to use Assembly Accession + Organism name as your new filename 2. All your accessions are unique -- i.e., you don't have duplicate accessions with distinct versions such as Acc1.Ver1, Acc1.Ver2, etc. 3. You will manage to get the formatting as you want in the filenames.txt file using Entrez Direct and standard Unix commands

$ ls GCF*
GCF_000001234.1.genomic.fna  GCF_000005678.1.genomic.fna
$ cat filenames.txt 
GCF_000001234.1_Organism_1
GCF_000005678.1_Organism_2
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1 -d '.') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls renamed_files/
GCF_000001234.1_Organism_1.genomic.fna  GCF_000005678.1_Organism_2.genomic.fna
ADD REPLYlink written 8 months ago by vkkodali1.1k

So, here's the partial content of my folder

.
|__GCF_003047065.1_ASM304706v1_genomic.fna
|__GCF_002863405.1_ASM286340v1_genomic.fna
|__GCF_000159355.1_ASM15935v1_genomic.fna
|__GCF_000159335.1_ASM15933v1_genomic.fna
...

For each GCF file, there's a unique organism name and I want to fetch it so that I can rename each GCF file for the corresponding organism name. So, maybe I should run the first esearch command you provided, to retrieve a two-column format as the output. This option only works with a single query. Can you provide me a way of having a column with all GCF files at once?

Then, maybe I can use this column as the txt file in the loop you provided

ADD REPLYlink modified 8 months ago by finswimmer11k • written 8 months ago by genomes_and_MGEs0

That's great, thanks!

ADD REPLYlink written 8 months ago by genomes_and_MGEs0

Hey guys,

Another question: Some of the outputs don't have the strain name. I guess the reason is that the organism name doesn't have that info. For example here https://www.ncbi.nlm.nih.gov/assembly/GCF_003290365.1/. If I use

for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element Organism,Strain,AssemblyAccession | sed 's/ /_/g' ; done > filenames.txt

The strain name doesn't appear on filenames.txt. Could you please let me know what I'm doing wrong?

Cheers

ADD REPLYlink modified 5 months ago by genomax69k • written 5 months ago by genomes_and_MGEs0

If you have another question, please ask another question. Answers are for answers to the main question only.

ADD REPLYlink written 5 months ago by jrj.healey13k
2
gravatar for vkkodali
8 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

Here are the steps:

$ ls -1 GCF*
GCF_000159335.1_ASM15933v1_genomic.fna
GCF_000159355.1_ASM15935v1_genomic.fna
GCF_002863405.1_ASM286340v1_genomic.fna
GCF_003047065.1_ASM304706v1_genomic.fna
$ for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element AssemblyAccession,Organism | sed 's/ /_/g' ; done > filenames.txt
$ cat filenames.txt
GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes)
GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes)
GCF_002863405.1_Lactobacillus_jensenii_(firmicutes)
GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes)
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1,2 -d '_') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls -1 renamed_files/
'GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes).genomic.fna'
'GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes).genomic.fna'
'GCF_002863405.1_Lactobacillus_jensenii_(firmicutes).genomic.fna'
'GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes).genomic.fna'
ADD COMMENTlink written 8 months ago by vkkodali1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 715 users visited in the last hour