Split file into multiple individual files based on specific attributes
1
0
Entering edit mode
15 months ago

Hi there,

I have a file with two columns as follows:

Yersinia_ruckeri_GCA_008086925.1.fna    ./Yersinia_ruckeri_GCA_008086925.1.fna
Yersinia_ruckeri_GCA_017498685.1.fna    ./Yersinia_ruckeri_GCA_017498685.1.fna
Yersinia_pestis_E_GCA_013421535.1.fna     ./Yersinia_pestis_E_GCA_013421535.1.fna
Yersinia_pestis_E_GCA_013421545.1.fna     ./Yersinia_pestis_E_GCA_013421545.1.fna
Yersinia_pestis_E_GCA_015274745.1.fna     ./Yersinia_pestis_E_GCA_015274745.1.fna

...

My goal is to split the file into multiple individual files corresponding to each bacterial species, each one containing only the file names (col1) and file paths (col2) for the corresponding species. In the example above, I would like to create two files, one named Yersinia_ruckeri.txt, containing the two entries

Yersinia_ruckeri_GCA_008086925.1.fna    ./Yersinia_ruckeri_GCA_008086925.1.fna
Yersinia_ruckeri_GCA_017498685.1.fna    ./Yersinia_ruckeri_GCA_017498685.1.fna

And another file named Yersinia_pestis_E.txt, with the three entries

Yersinia_pestis_E_GCA_013421535.1.fna     ./Yersinia_pestis_E_GCA_013421535.1.fna
Yersinia_pestis_E_GCA_013421545.1.fna     ./Yersinia_pestis_E_GCA_013421545.1.fna
Yersinia_pestis_E_GCA_015274745.1.fna     ./Yersinia_pestis_E_GCA_015274745.1.fna

Thanks in advance!

Sequence • 587 views
ADD COMMENT
0
Entering edit mode

While someone will give you a ready answer .... have you thought of cutting the columns out and then grep separating the file names. You can then ultimately paste the ruckeri and pestis files back together if you actually wanted the two columns. Not sure what is the significance of two columns.

ADD REPLY
0
Entering edit mode

Thanks for the quick reply. The thing is that I have multiple species with an uneven number of columns before 'GCA'. In the example above, the species names have two and three columns (Yersinia_ruckeri and Yersinia_pestis_E), respectively. I would have to find a way to group entries in the large file and split into multiple files according to the columns shown before 'GCA'.

ADD REPLY
1
Entering edit mode
15 months ago

someting like:

awk '{S="";N=split($1,a,/_/);for(i=1;i+1<N;i++) S=sprintf("%s%s%s",S,(i==1?"":"_"),a[i]); S=sprintf("%s.txt",S); print >> S}' input.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6