Good morning,
I am trying to separate my *.genomic.fna refseq files into individual genomes and name those files by Genus_species_strain. Headers have a variety of different formats. For example:
>NZ_SADZ01000474.1 Ochrobactrum sp. AV Scaffold_480, whole genome shotgun sequence
>NZ_PHFU01000092.1 Xylella fastidiosa subsp. fastidiosa strain CFBP8351 Xf_LSV462693, whole genome shotgun sequence
>NZ_SACX01000165.1 Escherichia coli strain JEONG-9595 NODE_13_length_103668_cov_11.7541_ID_25, whole genome shotgun sequence
To be clear, one genome often has several sequences, and I want all these in a single file. For example:
...
>NZ_SADZ01000789.1 Ochrobactrum sp. AV Scaffold_806, whole genome shotgun sequence
>NZ_SADZ01000790.1 Ochrobactrum sp. AV Scaffold_807, whole genome shotgun sequence
>NZ_SADZ01000791.1 Ochrobactrum sp. AV Scaffold_808, whole genome shotgun sequence
The variation in these headers prevented me from figuring out how to modify this awk command to work for all situations.
awk '{if(substr($0,1,1) == ">"){split(substr($0,2,length($0)),a," "); split(a[NF-5],b,"/"); filename=$2"_"$3"_"b[1]".fna"};print $0 > filename }' *.fna
If someone has a awk command that would work, I would really appreciate it.
Alternatively, a metadata file is available that I think could be used to separate the files. The file is: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
Columns 8 and 9 of this tab-delimited include Genus, species, and strain information. Notice the "/"! Notice that some strain information is redundant in column 8 and 9.
Column 8 Column 9
Brucella sp. 83/13 strain=83/13
Mycobacterium kansasii ATCC 12478 strain=ATCC 12478
Bacteroides coprophilus DSM 18228 = JCM 13818 strain=DSM 18228
Ideally, I would like the file names to be Genus_species_strain, but I don't mind if the file names look like:
Brucella_sp._83/13
Brucella_sp._83_13
or
Bacteroides_coprophilus_DSM_18228_=_JCM_13818
Bacteroides_coprophilus_DSM_18228
Thank you, Jason
Please don't use
/
and=
in file names. That is asking for trouble.Hi genomax! Hope you are doing well :) I thought that was the case. Thank you for confirming.
Please forgive my columns. I tried to space them because tab didn't work, but they came out funny in the post.
could you post the lines from assembly_summary text file, matching ochrobactrum, xylella and E. coli strain? Currently, it is easy to split the Organism specific fasta files and file names will have accession names i.e all the Ochrobactrum sequences would go to folder titled
NZ_SADZ
which is not is desired here. If you can post corresponding lines from the file, that would be helpful.Certainly, thank you :)
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original questionSomehow I missed the "GCF_004011925.1" in the first column of the O. sp. AV. Just added it.
Please take a back up of your input fasta files and read explanation for each function before you proceed. This is a temporary solution (4 step) :
Following is the explanation for each step:
Note 4: Step 4 creates files like
Ochrobactrum sp. AV strain=AV.fasta
. This kind of naming is not advisable.with OP data output tree:
Good morning cpad0112,
I can see that this is progressing in a great direction.
However, I'm getting the error below in my stderr and the "stdin.id_" still precedes my filenames (e.g. stdin.id_UJAD.fna). Bad : modifier in $ (0).
Why am I getting this error and what can be done about it?
Thank you, Jason
You can break the step 4 function into two lines:
Step 1 prints the file name before and after name change. Once you cross checked every thing is fine, remove
-n
option. This would removestdin.id_
string from all the file names. In addition, please verify following:If you still have issues, please post few file names and corresponding lines from mapping.txt
Good afternoon cpad0112,
The error is still occurring. ...
[INFO] write 269 sequences to file: split_fasta/stdin.id_MIFJ.fasta
[INFO] write 72 sequences to file: split_fasta/stdin.id_UIZP.fasta
[INFO] write 95 sequences to file: split_fasta/stdin.id_UIWI.fasta
Bad : modifier in $ (0).
Also, it looks like 4 letters will not be enough. This seems like a minor change as long as its okay to have LLLL## for some and LLLLLL for others (by L I mean letter). The mapping.txt also has sections where there is no code! Only the genus species [strain] are on a line (see below). Furthermore the mapping lines sometimes only include the genus and species without the strain. This will not work for my purposes. Any idea how these hurdles can be jumped?
MPDG00000000.2 Shewanella sp. SACH
MIJW00000000.1 Flavobacterium psychrophilum
MIJV00000000.1 Flavobacterium psychrophilum
JNBW00000000.1 Borreliella bissettii
MPHG00000000.1 Bacillus obstructivus
JSVH00000000.1 Enterobacter kobei
Neomicrococcus aestuarii
Chelatococcus daeguensis
Microbacterium paludicola
Acinetobacter baumannii
MPIN00000000.1 Cystobacter ferrugineus
LFCS00000000.1 Aeromonas sp. SCS5 ...
CAAKCJ000000000.1 Clostridioides difficile
CAAKCI000000000.1 Clostridioides difficile
CAAKCM000000000.1 Clostridioides difficile
CAAKGY000000000.1 Arthrobacter sp. DR-2P
CAAKGZ000000000.1 Pseudomonas sp. FG-3G
CAAKHA000000000.1 Bacillus thuringiensis serovar israelensis
CAAKMM000000000.1 Pseudomonas aeruginosa
CAAKMP000000000.1 Pseudomonas aeruginosa ...
At what step are you encountering this error ? @ jmwhitha. It seems file pattern is different for organism like Neomicrococcus aestuarii. From the original file, could you print the llines for Neomicrococcus aestuarii? In addition, file also seem to have more than 4 letter identifiers. Let me update the script. Btw, what shell are you using (c-shell or any derived shells)?
Try this for grouping the organisms: (Step 1 in OP):
Now the code doesn't rely on the number of letters. It's generalized now. Let me know if you are able to get fasta for organisms with more than four letter code. For last step, please try using zsh or bash.
The error occurs with this command: while read before after; do mv ${before:0:4}.fna "$after.fna";done < mapping.txt
I am now noticing that some have 4 and some have 6 letters and 31889 entries appear to have no identifier (wgs_master). Would the seqkit step still work with entries that lack an identifier?
Sorry, I'm starting to think the summary file is not as useful as I thought initially. Column 9 has strain info for most entries, but the field is blank for 18790 entries. The isolate field has strain info for some of those 18790, but not all.
If I just do the following, will I at least get individual genomes in my files or will something go wrong because of 4 letter entries and no letter entries? seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w{3}(.{6})" -O split_fasta
Code
$ seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w{3}(.{6})" -O split_fasta
will only pick up 6 letter entries.To catch all entries irrespective of the length (4,5,6, ...), please run following code
$ seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w+_([A-z]+)[0-9]+" -O split_fasta
Please run these commands again and let me know if code works (bash/zsh, seqkit):
If 4 th column is empty or NA, code doesn't work as there is no mapping between mapping text (from NCBI) and fasta files. Instead you can remove rows with "na" from mapping.txt.. Between fasta headers and mapping text, only alphanumeric code is common(4 column), as I understand.
I'm using tcsh
For these steps, I get the following error on the while step.
Missing }.
Can you please try the below code:
code does following: change shell to bash, run the code and exit the bash shell. (assuming that you have bash shell. If you don't have access to bash, please try zsh)
Hmm.
Still showing Missing }.
Just to be clear, I do have bash-4.2 available.
Bash version on my system is 5.0.17(1). I am not sure if that makes difference as read is a basic function. Can you please try this:
This is shell independent. Please mind the file extensions. Replace the extension as appropriate. Parallel version I am using is 20200722
With the awk command, all the mv commands went to stdout. awk -F "\t" '{sub(/[0-9]+.[0-9]/,"",$1); print "mv",$1".fna",$2".fna"}' mapping.txt
mv # See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file..fna .fna
mv wgs_master.fna organism_name.fna
mv .fna Drosophila melanogaster.fna
mv .fna Homo sapiens.fna
mv .fna Mus musculus.fna
mv .fna Arabidopsis thaliana.fna
mv AABR.fna Rattus norvegicus.fna
mv AAGU.fna Loxodonta africana.fna
...
With the parallel command between bash and exit, I got the error below.
parallel --colsep '\t' --dry-run mv {=1s/[0-9]+.[0-9]//=}.fasta {2}.fasta :::: mapping.txt
parallel: Command not found.
parallel
is a program that needs to be installed separately. It is not part of unix OS.With the parallel command, all the mv commands went to stdout.
mv =1s/[0-9]+.[0-9]//=.fna 2.fna #\ \ \ See\ ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt\ for\ a\ description\ of\ the\ columns\ in\ this\ file.
mv =1s/[0-9]+.[0-9]//=.fna 2.fna wgs_master organism_name
mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Drosophila\ melanogaster
mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Homo\ sapiens
mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Mus\ musculus
mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Arabidopsis\ thaliana
mv =1s/[0-9]+.[0-9]//=.fna 2.fna AABR00000000.7 Rattus\ norvegicus
...
Parallel and awk commands are dry-run functions. From the dry-run output, it seems your mapping text has multiple issues. Please fix mapping.txt as per your requirement. Do not install parallel from repos. Parallel version in most of the repos is old.