How to separate a big file into smaller ones using linux commands?
2
0
Entering edit mode
4.4 years ago

I have a file that looks like below. I want to separate out the chunks from first '//' to the next '//' and create separate files using linux commands. How can I do that?

Position    Code    Kinase  Peptide Score   Cutoff

//sp|P62754|RS6_MOUSE 40S ribosomal protein S6 OS=Mus musculus OX=10090 GN=Rps6 PE=1 SV=1

96  S   AGC TGERKRKSVRGCIVD 4.005   0.386

139 S   AGC RLGPKRASRIRKLFN 3.514   0.386

181 T   AGC PKIQRLVTPRVLQHK 0.728   0.386

235 S   AGC IAKRRRLSSLRASTS 8.053   0.386

236 S   AGC AKRRRLSSLRASTSK 6.778   0.386

//tr|E9QK41|E9QK41_MOUSE Actin-binding LIM protein 1 OS=Mus musculus OX=10090 GN=Ablim1 PE=1 SV=1

32  S   AGC ERASLRNSHRRLLIE 0.591   0.386

56  T   AGC PAHRRRGTVIHLVYL 3.773   0.386

77  S   AGC PPELRFSSYDPSVAH 0.978   0.386

355 S   AGC PPNIPRSSSDFFYPK 1.563   0.386

356 S   AGC PNIPRSSSDFFYPKS 0.403   0.386

393 T   AGC NKNPRQPTRTSSESI 1.234   0.386
LINUX BASH • 1.1k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
1
Entering edit mode
4.4 years ago
Ram 43k

Use awk. Set Input Record Separator to // and print each record to a separate file. You should be able to get to the solution from these hints and by reading the awk manual + searching online. This will be a good exercise to learn awk and some shell scripting.

ADD COMMENT
0
Entering edit mode

awk '/ // / {n++}{print > "output" n ".txt"}' <filename.txt> works but is creating double the number of files. I have 13 '//' but after splitting the number of new files created are 26.

ADD REPLY
0
Entering edit mode

What is showing up in the extra files? Are they empty or have the same exact content?

ADD REPLY
0
Entering edit mode

They are copies of the first 13 files

ADD REPLY
0
Entering edit mode

You don't need the n, you can just use NR. Try using "output_"NR-1".txt" instead of "output" n ".txt". I'm doing the NR-1 because in my trial run, it picked up the part before the first // as the first record even though my input file started with // (the first record was an empty record). Also, try setting RS to "//" so things are more explicit.

Also, you've not set the initial value for n explicitly. Verbose, explicit programming is always better than relying on system defaults or assumptions.

ADD REPLY
1
Entering edit mode

thanks you! awk '/^>/ {n++}{print > "output" n ".txt"}' <filename> worked fine.

ADD REPLY
0
Entering edit mode

I have another question: I have file that looks like

>sp|Q5T4S7|UBR4_HUMAN E3 ubiquitin-protein ligase UBR4 OS=Homo sapiens OX=9606 GN=UBR4 PE=1 SV=1                    
362 S   AGC AQQVRTGSTSSKEDD 2.108   0.386

364 S   AGC QVRTGSTSSKEDDYE 0.556   0.386

555 S   AGC LQRQRKGSMSSDASA 4.466   0.386

625 S   AGC ESSPRVKSPSKQAPG 2.518   0.386

904 T   AGC DSNSRRATTPLYHGF 3.45    0.386

1049    S   AGC SSRLRISSYVNWIKD 0.972   0.386

1473    T   AGC AWLTRMTTSPPKDSD 0.463   0.386

1504    S   AGC TYIVRENSQVGEGVC 1.114   0.386

1787    S   AGC EEKPKKSSLCRTVEG 1.593   0.386

1941    T   AGC DSSKRKLTLTRLASA 1.859   0.386

I used cut -d'|' -f2 output2.txt | head -1 to output Q5T4S7. Now I want to use this text to rename the same file. How do I do that?

ADD REPLY
1
Entering edit mode
cut -d'|' -f2 output.txt | head -1 | xargs -n 1 sh -c 'mv output.txt $0'
ADD REPLY
0
Entering edit mode

redirection and renaming in the same pipeline - is this even possible? Is it not better to do this in two separate steps or at least use the subshell for the cut instead of the mv?

NEW_NAME=$(head -n 1 output.txt | cut -d'|' -f2)
mv output.txt $NEW_NAME

or

mv output.txt $(head -n 1 output.txt | cut -d'|' -f2)

PS: I also optimized the cut | head to a better order of operations head | cut.

ADD REPLY
0
Entering edit mode

Yes it does work. But your solution is more elegant.

ADD REPLY
0
Entering edit mode

So many ways to get things done. TIL xargs sh -c can work on the exact file that gives xargs its input. It makes sense though, as we are not doing xargs mv.

ADD REPLY
1
Entering edit mode
4.4 years ago
csplit input.txt '/^\/\//'  '{*}'
ADD COMMENT
0
Entering edit mode

if you can explain what the different / mean that will be very helpful. I am new to linux.

ADD REPLY

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6