Question

unix, awk or sed question on splitting lines

0

Entering edit mode

3.9 years ago

geneart$$ ▴ 50

So here is my problem: conceptually I know what how to do it, but I am still learning to script and hence dont quite know actually how to do this efficiently !

Any help with any of these options or a computationally better solution would be appreciated! I looked up quite a few options for one liners in awk, sed and grep etc but they mostly tip on getting common lines between multiple files and such. I am trying to get that one liner to work for contents of the same file across multiple lines. I have implemented option 2 but that is not very elegant, hence the post

I have this file-A with the following contents:

HWI-X00545:36:C1V15ACPX:2:1206:21033:18295
HWI-X00545:36:C1V15ACPX:2:2210:7893:55181
HWI-X00545:36:C1V15ACPX:2:2306:18502:73182

I want to get just the common parts of all the lines in file-A.

option 1: I could count the number of ":" and print everything before the fourth ":" not inclusive of the fourth ":" ( I could not figure how to do that!) and uniq

option 2: I could split lines at ":" and then awk the columns and join. ( this I feel is cumbersome,there must be a better way and easiest would be to option 3)

 sed 's/:/\t/g' trial_header.txt | awk '{print $1":"$2":"$3":"$4}' | uniq
 HWI-X00545:36:C1V15ACPX:2

option 3: just print the common parts of all lines in file-A and uniq it! ( not sure how to do this either! sort , uniq can work on when all the characters of all lines are same, not when partial lines are same??)

Note: BTW, I cannot do a character count, like cut -c -25 as this works for just this one file. I have other files with varying number of characters before the fourth ":" and I cant put it in a for loop due to that.

Thankyou in advance!

unix awk sed • 960 views

ADD COMMENT • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by geneart$$ ▴ 50

1

Entering edit mode

Given that this is a question of the best way to achieve a given task, I have changed the label from forum to question. Forums are for open-ended discussions that likely don't have a definitive answer.

ADD REPLY • link 3.9 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

I don't understand your question :

I want to get just the common parts of all the lines in file-A.

what would be the output for the common parts in

HWI-X00545:36:C1V15ACPX:2:1206:21033:18295
HWI-X00545:36:C1V15ACPX:2:2210:7893:55181
HWI-X00545:36:C1V15ACPX:2:2306:18502:73182

?

ADD REPLY • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

HWI-X00545:36:C1V15ACPX:2
HWI-X00545:36:C1V15ACPX:2
HWI-X00545:36:C1V15ACPX:2

That would be the output , and after uniq it would be HWI-X00545:36:C1V15ACPX:2

ADD REPLY • link 3.9 years ago by geneart$$ ▴ 50

score 1 · Answer 1 · 2020-06-10

1

Entering edit mode

3.9 years ago

Mensur Dlakic ★ 27k

I don't think it can be cone much shorter than your solution, but here goes:

cut --fields=1-4 --delimiter=: trial_header.txt | uniq

or

cut --fields=1-4 --delimiter=: trial_header.txt | sort -u

ADD COMMENT • link 3.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thankyou ! THis worked great on this trial file which has only 3 lines. My original file has 76780139 lines like the same. So to do sort -u it is taking a lot of time But your first option was slighlty better.

Now I am wondering, If I use this in a for loop across many other files ( I have about 174 files which I have to run this on, doing the same thing like what we did here) it would take more time computationally. probably does not have any computationally faster solution? I appreciate your time and help !

ADD REPLY • link 3.9 years ago by geneart$$ ▴ 50

0

Entering edit mode

Honestly, I think you have enough solutions to just go ahead and do it. If one solution saves you half an hour over the other, and you end up waiting two hours on this forum for someone to come up with a winner, that really wouldn't be any savings.

The difference between uniq and sort -u is that the former works only on lines that are already sorted (identical fields are consecutive), while the latter will work in any case. That's why uniq is faster but not a universal solution.

ADD REPLY • link 3.9 years ago by Mensur Dlakic ★ 27k

score 1 · Answer 2 · 2020-06-10

1

Entering edit mode

3.9 years ago

Pierre Lindenbaum 161k

awk -F ':' '{for(i=1;i+1<=NF;i++) {for(L=i;L<=NF;L++) {for(j=i;j<=L;j++) {printf("%s:",$j);} printf("\n"); }}}' input.txt | sort | uniq  -c | LC_ALL=C sort -n  | tail -1

  3 HWI-X00545:36:C1V15ACPX:2:

ADD COMMENT • link 3.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thankyou for your time Pierre !
I have to acknowledge I don't completely understand your code. But here is what I got from reading it, correct me if I am wrong: It looks like you are implementing the code in the { } on each line , hence 3 times as I have 3 lines and then doing an equivalent of uniq and also printing the number of lines in the file. 1. Please could you explain what you have within your { } ? 2. Also as I mentioned in my comment to Mensur's solution, I have 76780139 lines, so I am not sure if this would be any faster than the sort | uniq command? Any further input is appreciated !

ADD REPLY • link 3.9 years ago by geneart$$ ▴ 50

score 0 · Answer 3 · 2020-06-10

Sounds like you are trying to find out how many flowcells worth of data are mixed in the data file you have? You already have received answers on how to do the parsing so I won't repeat it here.

Following comments are based on relevant fields you need to track, with separator being (:)

If they were all run on the same sequencer then you can ignore ALL fields except number 3 (C1V15ACPX). This will tell you if more than one flowcell's data is in your file.
If there are more than one sequencers involved then worry about field 1 (sequencer serial) and 3 (flowcell serial). This will tell you serial numbers of all sequencers and flowcells.
If you are interested in tracking data at the level of which lane it was run on then you need to summarize field number 4 in addition. This will tell you if there is data from more than one lane is in your file.

Doing cumulative counts for occurrences would give you number of reads for any of above three.

If there is data from only one flow cell in each of these files, you can find the number of reads in a fastq file using: How to count fastq reads First three fields will be identical for all reads in the file. First four, if the data is only from a single lane.