Question: How to separate a big file into smaller ones using linux commands?
0
gravatar for akansha.gitanjali
8 months ago by
akansha.gitanjali10 wrote:

I have a file that looks like below. I want to separate out the chunks from first '//' to the next '//' and create separate files using linux commands. How can I do that?

Position    Code    Kinase  Peptide Score   Cutoff

//sp|P62754|RS6_MOUSE 40S ribosomal protein S6 OS=Mus musculus OX=10090 GN=Rps6 PE=1 SV=1

96  S   AGC TGERKRKSVRGCIVD 4.005   0.386

139 S   AGC RLGPKRASRIRKLFN 3.514   0.386

181 T   AGC PKIQRLVTPRVLQHK 0.728   0.386

235 S   AGC IAKRRRLSSLRASTS 8.053   0.386

236 S   AGC AKRRRLSSLRASTSK 6.778   0.386

//tr|E9QK41|E9QK41_MOUSE Actin-binding LIM protein 1 OS=Mus musculus OX=10090 GN=Ablim1 PE=1 SV=1

32  S   AGC ERASLRNSHRRLLIE 0.591   0.386

56  T   AGC PAHRRRGTVIHLVYL 3.773   0.386

77  S   AGC PPELRFSSYDPSVAH 0.978   0.386

355 S   AGC PPNIPRSSSDFFYPK 1.563   0.386

356 S   AGC PNIPRSSSDFFYPKS 0.403   0.386

393 T   AGC NKNPRQPTRTSSESI 1.234   0.386
bash linux • 247 views
ADD COMMENTlink modified 8 months ago by Pierre Lindenbaum129k • written 8 months ago by akansha.gitanjali10

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 8 months ago by genomax87k
1
gravatar for RamRS
8 months ago by
RamRS28k
Houston, TX
RamRS28k wrote:

Use awk. Set Input Record Separator to // and print each record to a separate file. You should be able to get to the solution from these hints and by reading the awk manual + searching online. This will be a good exercise to learn awk and some shell scripting.

ADD COMMENTlink written 8 months ago by RamRS28k

awk '/ // / {n++}{print > "output" n ".txt"}' <filename.txt> works but is creating double the number of files. I have 13 '//' but after splitting the number of new files created are 26.

ADD REPLYlink modified 8 months ago by genomax87k • written 8 months ago by akansha.gitanjali10

What is showing up in the extra files? Are they empty or have the same exact content?

ADD REPLYlink written 8 months ago by genomax87k

They are copies of the first 13 files

ADD REPLYlink written 8 months ago by akansha.gitanjali10

You don't need the n, you can just use NR. Try using "output_"NR-1".txt" instead of "output" n ".txt". I'm doing the NR-1 because in my trial run, it picked up the part before the first // as the first record even though my input file started with // (the first record was an empty record). Also, try setting RS to "//" so things are more explicit.

Also, you've not set the initial value for n explicitly. Verbose, explicit programming is always better than relying on system defaults or assumptions.

ADD REPLYlink modified 8 months ago • written 8 months ago by RamRS28k
1

thanks you! awk '/^>/ {n++}{print > "output" n ".txt"}' <filename> worked fine.

ADD REPLYlink modified 8 months ago • written 8 months ago by akansha.gitanjali10

I have another question: I have file that looks like

>sp|Q5T4S7|UBR4_HUMAN E3 ubiquitin-protein ligase UBR4 OS=Homo sapiens OX=9606 GN=UBR4 PE=1 SV=1                    
362 S   AGC AQQVRTGSTSSKEDD 2.108   0.386

364 S   AGC QVRTGSTSSKEDDYE 0.556   0.386

555 S   AGC LQRQRKGSMSSDASA 4.466   0.386

625 S   AGC ESSPRVKSPSKQAPG 2.518   0.386

904 T   AGC DSNSRRATTPLYHGF 3.45    0.386

1049    S   AGC SSRLRISSYVNWIKD 0.972   0.386

1473    T   AGC AWLTRMTTSPPKDSD 0.463   0.386

1504    S   AGC TYIVRENSQVGEGVC 1.114   0.386

1787    S   AGC EEKPKKSSLCRTVEG 1.593   0.386

1941    T   AGC DSSKRKLTLTRLASA 1.859   0.386

I used cut -d'|' -f2 output2.txt | head -1 to output Q5T4S7. Now I want to use this text to rename the same file. How do I do that?

ADD REPLYlink modified 8 months ago by genomax87k • written 8 months ago by akansha.gitanjali10
1
cut -d'|' -f2 output.txt | head -1 | xargs -n 1 sh -c 'mv output.txt $0'
ADD REPLYlink modified 8 months ago • written 8 months ago by genomax87k

redirection and renaming in the same pipeline - is this even possible? Is it not better to do this in two separate steps or at least use the subshell for the cut instead of the mv?

NEW_NAME=$(head -n 1 output.txt | cut -d'|' -f2)
mv output.txt $NEW_NAME

or

mv output.txt $(head -n 1 output.txt | cut -d'|' -f2)

PS: I also optimized the cut | head to a better order of operations head | cut.

ADD REPLYlink modified 8 months ago • written 8 months ago by RamRS28k

Yes it does work. But your solution is more elegant.

ADD REPLYlink written 8 months ago by genomax87k

So many ways to get things done. TIL xargs sh -c can work on the exact file that gives xargs its input. It makes sense though, as we are not doing xargs mv.

ADD REPLYlink written 8 months ago by RamRS28k
1
gravatar for Pierre Lindenbaum
8 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
csplit input.txt '/^\/\//'  '{*}'
ADD COMMENTlink written 8 months ago by Pierre Lindenbaum129k

if you can explain what the different / mean that will be very helpful. I am new to linux.

ADD REPLYlink written 8 months ago by akansha.gitanjali10

https://www.howtoforge.com/linux-csplit-command/

ADD REPLYlink written 8 months ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1490 users visited in the last hour