Parse multifasta files based on a values in two columns in a metadata file
0
0
Entering edit mode
3.6 years ago
jmwhitha • 0

Good morning,

I am trying to separate my *.genomic.fna refseq files into individual genomes and name those files by Genus_species_strain. Headers have a variety of different formats. For example:

>NZ_SADZ01000474.1 Ochrobactrum sp. AV Scaffold_480, whole genome shotgun sequence
>NZ_PHFU01000092.1 Xylella fastidiosa subsp. fastidiosa strain CFBP8351 Xf_LSV462693, whole genome shotgun sequence
>NZ_SACX01000165.1 Escherichia coli strain JEONG-9595 NODE_13_length_103668_cov_11.7541_ID_25, whole genome shotgun sequence

To be clear, one genome often has several sequences, and I want all these in a single file. For example:

...
>NZ_SADZ01000789.1 Ochrobactrum sp. AV Scaffold_806, whole genome shotgun sequence
>NZ_SADZ01000790.1 Ochrobactrum sp. AV Scaffold_807, whole genome shotgun sequence
>NZ_SADZ01000791.1 Ochrobactrum sp. AV Scaffold_808, whole genome shotgun sequence

The variation in these headers prevented me from figuring out how to modify this awk command to work for all situations.

awk '{if(substr($0,1,1) == ">"){split(substr($0,2,length($0)),a," "); split(a[NF-5],b,"/"); filename=$2"_"$3"_"b[1]".fna"};print $0 > filename }' *.fna

If someone has a awk command that would work, I would really appreciate it.

Alternatively, a metadata file is available that I think could be used to separate the files. The file is: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Columns 8 and 9 of this tab-delimited include Genus, species, and strain information. Notice the "/"! Notice that some strain information is redundant in column 8 and 9.

Column 8                                                                         Column 9   
Brucella sp. 83/13  strain=83/13
Mycobacterium kansasii ATCC 12478   strain=ATCC 12478
Bacteroides coprophilus DSM 18228 = JCM 13818   strain=DSM 18228

Ideally, I would like the file names to be Genus_species_strain, but I don't mind if the file names look like:

Brucella_sp._83/13
Brucella_sp._83_13

or

Bacteroides_coprophilus_DSM_18228_=_JCM_13818
Bacteroides_coprophilus_DSM_18228

Thank you, Jason

parse mutlifasta RefSeq database strain • 1.7k views
ADD COMMENT
0
Entering edit mode

Please don't use / and = in file names. That is asking for trouble.

ADD REPLY
0
Entering edit mode

Hi genomax! Hope you are doing well :) I thought that was the case. Thank you for confirming.

Please forgive my columns. I tried to space them because tab didn't work, but they came out funny in the post.

ADD REPLY
0
Entering edit mode

could you post the lines from assembly_summary text file, matching ochrobactrum, xylella and E. coli strain? Currently, it is easy to split the Organism specific fasta files and file names will have accession names i.e all the Ochrobactrum sequences would go to folder titled NZ_SADZ which is not is desired here. If you can post corresponding lines from the file, that would be helpful.

ADD REPLY
0
Entering edit mode

Certainly, thank you :)

GCF_004011925.1    PRJNA224116  SAMN10688388    SADZ00000000.1  na  2500158 2500158 Ochrobactrum sp. AV strain=AV       latest  Scaffold    Major   Full    1/11/2019   ASM401192v1 National Environmental Engineering Research Institute   GCA_004011925.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/925/GCF_004011925.1_ASM401192v1

GCF_004016405.1 PRJNA224116 SAMN07999362    PHFU00000000.1  na  644356  2371    Xylella fastidiosa subsp. fastidiosa    strain=CFBP8351     latest  Contig  Major   Full    1/14/2019   ASM401640v1 INRA    GCA_004016405.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/016/405/GCF_004016405.1_ASM401640v1

GCF_004022065.1 PRJNA224116 SAMN04160771    SACX00000000.1  na  562 562 Escherichia coli    strain=JEONG-9595       latest  Contig  Major   Full    1/14/2019   ASM402206v1 US Food and Drug Administration GCA_004022065.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/022/065/GCF_004022065.1_ASM402206v1
ADD REPLY
1
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question

ADD REPLY
0
Entering edit mode

Somehow I missed the "GCF_004011925.1" in the first column of the O. sp. AV. Just added it.

ADD REPLY
0
Entering edit mode

Please take a back up of your input fasta files and read explanation for each function before you proceed. This is a temporary solution (4 step) :

 1. seqkit seq -w 0 *.fa | seqkit split -i  --id-regexp "^\w{3}(.{4})" -O split_fasta
 2. cd split_fasta
 3. cut -f4,8 <path>/assembly_summary_refseq.txt > mapping.txt
 4. while read before after; do  mv stdin.id_${before:0:4}.fasta "$after.fasta";done < mapping.txt

Following is the explanation for each step:

1. Seqkit concatenates all the fasta files and splits the files based on 4 letters , after first three characters (from NZ_SADZ01000474.1 to SADZ) of the headers. All the files will have following names: test.id_fourlettersfromheader.fa  and are stored in `split_fasta` folder (for eg. test.id_PHFU.fa). Seqkit automagically creates output folder. If you are in doubt, you can run seqkit function in `--dry-run` mode. 
3. cut function prints only 4 and 8 columns (tab separated) and stores the 4 letter code and strain names in "mapping test" file. Please format the 8th column as per your requirements for new file names.
4. Changes the files names as per OP with restricted characters. Add `echo` before `mv` to check loop output.

Note 4: Step 4 creates files like Ochrobactrum sp. AV strain=AV.fasta. This kind of naming is not advisable.

with OP data output tree:

tree .                  
.
├── Escherichia coli strain=JEONG-9595.fasta
├── mapping.txt
├── Ochrobactrum sp. AV strain=AV.fasta
└── Xylella fastidiosa subsp. fastidiosa strain=CFBP8351.fasta
ADD REPLY
0
Entering edit mode

Good morning cpad0112,

I can see that this is progressing in a great direction.

However, I'm getting the error below in my stderr and the "stdin.id_" still precedes my filenames (e.g. stdin.id_UJAD.fna). Bad : modifier in $ (0).

Why am I getting this error and what can be done about it?

Thank you, Jason

ADD REPLY
0
Entering edit mode

You can break the step 4 function into two lines:

 1. rename -n "s/stdin.id_//g" *.fna 
 2. while read before after; do mv ${before:0:4}.fna "$after.fasta";done < mapping.txt

Step 1 prints the file name before and after name change. Once you cross checked every thing is fine, remove -n option. This would remove stdin.id_ string from all the file names. In addition, please verify following:

  • the file extensions (.fa vs .fna vs .fasta) and change accordingly.
  • Field separators (tab vs spaces or mix) in mapping.txt

If you still have issues, please post few file names and corresponding lines from mapping.txt

ADD REPLY
0
Entering edit mode

Good afternoon cpad0112,

The error is still occurring. ...

[INFO] write 269 sequences to file: split_fasta/stdin.id_MIFJ.fasta

[INFO] write 72 sequences to file: split_fasta/stdin.id_UIZP.fasta

[INFO] write 95 sequences to file: split_fasta/stdin.id_UIWI.fasta

Bad : modifier in $ (0).

Also, it looks like 4 letters will not be enough. This seems like a minor change as long as its okay to have LLLL## for some and LLLLLL for others (by L I mean letter). The mapping.txt also has sections where there is no code! Only the genus species [strain] are on a line (see below). Furthermore the mapping lines sometimes only include the genus and species without the strain. This will not work for my purposes. Any idea how these hurdles can be jumped?

MPDG00000000.2 Shewanella sp. SACH

MIJW00000000.1 Flavobacterium psychrophilum

MIJV00000000.1 Flavobacterium psychrophilum

JNBW00000000.1 Borreliella bissettii

MPHG00000000.1 Bacillus obstructivus

JSVH00000000.1 Enterobacter kobei

Neomicrococcus aestuarii

Chelatococcus daeguensis

Microbacterium paludicola

Acinetobacter baumannii

MPIN00000000.1 Cystobacter ferrugineus

LFCS00000000.1 Aeromonas sp. SCS5 ...

CAAKCJ000000000.1 Clostridioides difficile

CAAKCI000000000.1 Clostridioides difficile

CAAKCM000000000.1 Clostridioides difficile

CAAKGY000000000.1 Arthrobacter sp. DR-2P

CAAKGZ000000000.1 Pseudomonas sp. FG-3G

CAAKHA000000000.1 Bacillus thuringiensis serovar israelensis

CAAKMM000000000.1 Pseudomonas aeruginosa

CAAKMP000000000.1 Pseudomonas aeruginosa ...

ADD REPLY
0
Entering edit mode

At what step are you encountering this error ? @ jmwhitha. It seems file pattern is different for organism like Neomicrococcus aestuarii. From the original file, could you print the llines for Neomicrococcus aestuarii? In addition, file also seem to have more than 4 letter identifiers. Let me update the script. Btw, what shell are you using (c-shell or any derived shells)?

Try this for grouping the organisms: (Step 1 in OP):

$ seqkit seq -w 0 *.fa | seqkit split -i  --id-regexp "^\w+_([A-z]+)[0-9]+" -O split_fasta

Now the code doesn't rely on the number of letters. It's generalized now. Let me know if you are able to get fasta for organisms with more than four letter code. For last step, please try using zsh or bash.

ADD REPLY
0
Entering edit mode

The error occurs with this command: while read before after; do mv ${before:0:4}.fna "$after.fna";done < mapping.txt

I am now noticing that some have 4 and some have 6 letters and 31889 entries appear to have no identifier (wgs_master). Would the seqkit step still work with entries that lack an identifier?

GCF_001887225.1 PRJNA224116 SAMN03135879    JSVH00000000.1  na  208224  208224  Enterobacter kobei  strain=BH-18        latest  Contig  Major   Full    11/29/2016  ASM188722v1 Tianjin University of Science and Technology    GCA_001887225.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/225/GCF_001887225.1_ASM188722v1
GCF_001887245.1 PRJNA224116 SAMN05728458        representative genome   556325  556325  Neomicrococcus aestuarii    strain=B18      latest  Complete Genome Major   Full    11/29/2016  ASM188724v1 Southeast Chongqing Academy of Agricultural Sciences, Chongqing, 40030 P. R. China  GCA_001887245.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/245/GCF_001887245.1_ASM188724v1
GCF_001887265.1 PRJNA224116 SAMN06016468        na  444444  444444  Chelatococcus daeguensis    strain=TAD1     latest  Complete Genome Major   Full    11/29/2016  ASM188726v1 Fujian Agriculture and Forestry University  GCA_001887265.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/265/GCF_001887265.1_ASM188726v1
GCF_001887285.1 PRJNA224116 SAMN06019996        na  300019  300019  Microbacterium paludicola   strain=CC3      latest  Complete Genome Major   Full    11/29/2016  ASM188728v1 Jiangsu Normal University   GCA_001887285.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/285/GCF_001887285.1_ASM188728v1
GCF_001887305.1 PRJNA224116 SAMN06046790        na  470 470 Acinetobacter baumannii strain=HRAB-85      latest  Complete Genome Major   Full    11/29/2016  ASM188730v1 307th Hospital of PLA   GCA_001887305.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/305/GCF_001887305.1_ASM188730v1
GCF_001887355.1 PRJNA224116 SAMN06011414    MPIN00000000.1  representative genome   83449   83449   Cystobacter ferrugineus strain=Cbfe23       latest  Contig  Major   Full    11/29/2016  ASM188735v1 University of Mississippi   GCA_001887355.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/355/GCF_001887355.1_ASM188735v1
GCF_001887375.1 PRJNA224116 SAMN03734924    LFCS00000000.1  na  1519205 1519205 Aeromonas sp. SCS5  strain=SCS5     latest  Contig  Major   Full    11/29/2016  ASM188737v1 Chinese Academy of Sciences GCA_001887375.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/887/375/GCF_001887375.1_ASM188737v1
ADD REPLY
0
Entering edit mode

Sorry, I'm starting to think the summary file is not as useful as I thought initially. Column 9 has strain info for most entries, but the field is blank for 18790 entries. The isolate field has strain info for some of those 18790, but not all.

If I just do the following, will I at least get individual genomes in my files or will something go wrong because of 4 letter entries and no letter entries? seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w{3}(.{6})" -O split_fasta

ADD REPLY
0
Entering edit mode

Code $ seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w{3}(.{6})" -O split_fasta will only pick up 6 letter entries.

To catch all entries irrespective of the length (4,5,6, ...), please run following code$ seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w+_([A-z]+)[0-9]+" -O split_fasta

ADD REPLY
0
Entering edit mode

Please run these commands again and let me know if code works (bash/zsh, seqkit):

$ seqkit seq -w 0 *.fa | seqkit split -i  --id-regexp "^\w+_([A-z]+)[0-9]+" -O split_fasta
$ cd split_fasta
$ rename  "s/stdin.id_//g" *.fna 
$ cut -f4,8 <path>/assembly_summary_refseq.txt > mapping.txt
$ while read before after; do  mv ${before%%[0-9]*\.[0-9]}.fasta "$after.fasta";done < mapping.txt

If 4 th column is empty or NA, code doesn't work as there is no mapping between mapping text (from NCBI) and fasta files. Instead you can remove rows with "na" from mapping.txt.. Between fasta headers and mapping text, only alphanumeric code is common(4 column), as I understand.

ADD REPLY
0
Entering edit mode

I'm using tcsh

For these steps, I get the following error on the while step.

Missing }.

ADD REPLY
0
Entering edit mode

Can you please try the below code:

$ bash
$ while read before after; do  mv ${before%%[0-9]*\.[0-9]}.fasta "$after.fasta";done < mapping.txt
$exit

code does following: change shell to bash, run the code and exit the bash shell. (assuming that you have bash shell. If you don't have access to bash, please try zsh)

ADD REPLY
0
Entering edit mode

Hmm.

Still showing Missing }.

Just to be clear, I do have bash-4.2 available.

ADD REPLY
0
Entering edit mode

Bash version on my system is 5.0.17(1). I am not sure if that makes difference as read is a basic function. Can you please try this:

$ parallel --colsep '\t' --dry-run mv {=1s/\[0-9\]\+\.\[0-9\]//=}.fasta {2}.fasta :::: mapping.txt

This is shell independent. Please mind the file extensions. Replace the extension as appropriate. Parallel version I am using is 20200722

ADD REPLY
0
Entering edit mode

With the awk command, all the mv commands went to stdout. awk -F "\t" '{sub(/[0-9]+.[0-9]/,"",$1); print "mv",$1".fna",$2".fna"}' mapping.txt

mv # See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file..fna .fna

mv wgs_master.fna organism_name.fna

mv .fna Drosophila melanogaster.fna

mv .fna Homo sapiens.fna

mv .fna Mus musculus.fna

mv .fna Arabidopsis thaliana.fna

mv AABR.fna Rattus norvegicus.fna

mv AAGU.fna Loxodonta africana.fna

...

With the parallel command between bash and exit, I got the error below.

parallel --colsep '\t' --dry-run mv {=1s/[0-9]+.[0-9]//=}.fasta {2}.fasta :::: mapping.txt

parallel: Command not found.

ADD REPLY
0
Entering edit mode

parallel is a program that needs to be installed separately. It is not part of unix OS.

ADD REPLY
0
Entering edit mode

With the parallel command, all the mv commands went to stdout.

mv =1s/[0-9]+.[0-9]//=.fna 2.fna #\ \ \ See\ ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt\ for\ a\ description\ of\ the\ columns\ in\ this\ file.

mv =1s/[0-9]+.[0-9]//=.fna 2.fna wgs_master organism_name

mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Drosophila\ melanogaster

mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Homo\ sapiens

mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Mus\ musculus

mv =1s/[0-9]+.[0-9]//=.fna 2.fna '' Arabidopsis\ thaliana

mv =1s/[0-9]+.[0-9]//=.fna 2.fna AABR00000000.7 Rattus\ norvegicus

...

ADD REPLY
0
Entering edit mode

Parallel and awk commands are dry-run functions. From the dry-run output, it seems your mapping text has multiple issues. Please fix mapping.txt as per your requirement. Do not install parallel from repos. Parallel version in most of the repos is old.

ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6