Question: Batch rename RefSeq assembly for the corresponding organism name
0
gravatar for joaobotelho9
11 days ago by
joaobotelho90 wrote:

Hey everyone,

I just downloaded several genomes from NCBI assembly. Let's say I downloaded all E. coli genomes. After unzipping all files, I'll have several files with the RefSeq accession as the file name. My objective is to batch rename all those individual files and replace by the corresponding organism name. So, for example, for file named GCF_000005845.2.genomic.fna, I would like to replace it for Escherichia coli str. K-12. Could you please help me with this? Thank you

assembly genome • 95 views
ADD COMMENTlink modified 6 days ago • written 11 days ago by joaobotelho90

This might not work for your data. Can you show us the content of the headers of your fastas?

How do you want to handle the cases where 2 sequences share the same strain name?

ADD REPLYlink written 11 days ago by jrj.healey8.8k

How do you want to handle the cases where 2 sequences share the same strain name?

Perhaps prepend the organism name to the GCF accession that's already in the filename?

You can get the organism name for a given GCF accession in a two-column format using Entrez Direct as shown below. I removed dots and replaced all spaces in the organism name with underscores so that the final filenames will be more manageable.

esearch -db assembly -q 'GCF_000005845.2' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession Organism | sed -r 's/ /_/g; s/\.//g'
GCF_0000058452  Escherichia_coli_str_K-12_substr_MG1655_(E_coli)
ADD REPLYlink modified 11 days ago • written 11 days ago by vkkodali450

Hey guys,

Thank you for your answers. So, I have a folder containing several individual genome fasta files. Each file may correspond to a multi-fasta or to a complete genome. Anyway, each file is related to a given strain, and are named according to the strain's RefSeq accession. Your command should work fine for my data. I have a list.txt comprising all the RefSeq accessions. Is it possible to use this command and the list as a query? Also, after having the two-column format output, how can I write a command to batch rename the RefSeq accession for the given organism name? I guess I should write a python or perl script for that, but I'm no pro in bioinformatics :D Thank you guys again for your time. Cheers

ADD REPLYlink written 8 days ago by joaobotelho90

This is not particularly clear to me. Can you show us a small example of the file structure you have? You can use the tree program to get an easy output (you may need to install it from apt or similar)

ADD REPLYlink written 8 days ago by jrj.healey8.8k

You don't need a python/perl script for this. You can do this in bash. I made a few assumptions: 1. You are going to use Assembly Accession + Organism name as your new filename 2. All your accessions are unique -- i.e., you don't have duplicate accessions with distinct versions such as Acc1.Ver1, Acc1.Ver2, etc. 3. You will manage to get the formatting as you want in the filenames.txt file using Entrez Direct and standard Unix commands

$ ls GCF*
GCF_000001234.1.genomic.fna  GCF_000005678.1.genomic.fna
$ cat filenames.txt 
GCF_000001234.1_Organism_1
GCF_000005678.1_Organism_2
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1 -d '.') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls renamed_files/
GCF_000001234.1_Organism_1.genomic.fna  GCF_000005678.1_Organism_2.genomic.fna
ADD REPLYlink written 8 days ago by vkkodali450

So, here's the partial content of my folder

.
|__GCF_003047065.1_ASM304706v1_genomic.fna
|__GCF_002863405.1_ASM286340v1_genomic.fna
|__GCF_000159355.1_ASM15935v1_genomic.fna
|__GCF_000159335.1_ASM15933v1_genomic.fna
...

For each GCF file, there's a unique organism name and I want to fetch it so that I can rename each GCF file for the corresponding organism name. So, maybe I should run the first esearch command you provided, to retrieve a two-column format as the output. This option only works with a single query. Can you provide me a way of having a column with all GCF files at once?

Then, maybe I can use this column as the txt file in the loop you provided

ADD REPLYlink modified 7 days ago by finswimmer7.0k • written 7 days ago by joaobotelho90
2
gravatar for vkkodali
7 days ago by
vkkodali450
United States
vkkodali450 wrote:

Here are the steps:

$ ls -1 GCF*
GCF_000159335.1_ASM15933v1_genomic.fna
GCF_000159355.1_ASM15935v1_genomic.fna
GCF_002863405.1_ASM286340v1_genomic.fna
GCF_003047065.1_ASM304706v1_genomic.fna
$ for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element AssemblyAccession,Organism | sed 's/ /_/g' ; done > filenames.txt
$ cat filenames.txt
GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes)
GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes)
GCF_002863405.1_Lactobacillus_jensenii_(firmicutes)
GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes)
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1,2 -d '_') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls -1 renamed_files/
'GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes).genomic.fna'
'GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes).genomic.fna'
'GCF_002863405.1_Lactobacillus_jensenii_(firmicutes).genomic.fna'
'GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes).genomic.fna'
ADD COMMENTlink written 7 days ago by vkkodali450
0
gravatar for joaobotelho9
6 days ago by
joaobotelho90 wrote:

That's great, thanks!

ADD COMMENTlink written 6 days ago by joaobotelho90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1483 users visited in the last hour