unique list based on multiple column
3
0
Entering edit mode
4.6 years ago
Sam ▴ 140

Hi

How I can obtain unique reads based on two different column ? Thanks

input:
A, 1
A, 2
A, 2
B, 1
B, 2
B, 1
C, 1
C, 3
C, 3
output:
A, 1
A, 2
B, 2
B, 1
C, 1
C ,3

awk sort • 1.5k views
2
Entering edit mode
4.6 years ago
JC 13k

Use sort and uniq commands:

sort *myfile* | uniq > output

0
Entering edit mode

Thanks for your code , but I want obtain unique reads according two different column in my input file , please check my example

1
Entering edit mode

The code above generates results you ask for. If that example data does not represent real data then you need to provide an appropriate example.

0
Entering edit mode

if the columns are not at the beginning of the table, you can extract the columns using cut:

cut -f2,4 *myfile* | sort | uniq > output

0
Entering edit mode
4.6 years ago

Here is a way that gets around some issues with other approaches:

$awk '!a[$0]++' input.txt > output.txt


Here's what output would look like, from your example:

$cat output.txt A, 1 A, 2 B, 1 B, 2 C, 1 C, 3  If your input looks like something else, then this approach would need modifications. ADD COMMENT 0 Entering edit mode my input format is as same as below and I need to obtain unique reads according 2nd and 4th columns MIRT000415 , hsa-let-7a-5p, Homosapiens, CDK6, 1021, Homosapiens, Luciferase reporter assay  ADD REPLY 1 Entering edit mode In that case, use the following modification: $ awk -v FS=',' '!a[$2$4]++' input.txt > output.txt


This will report the first line seen for the combination of the 2nd and 4th columns. Second and subsequent "hits" are not reported.

If you want to instead use sort, you will need to use some additional options:

$sort -u -k2,2 -k4,4 -t, input.txt > output.txt  Without reading the man pages, I'm unsure if sort is stable, so you might get a different answer on repeated runs. In addition to flexibility on the keys used for filtering, the awk approach runs much faster on very large input (at the expense of memory), so if you're working with whole-genome scale input, then you may want to use awk, instead of sort | uniq or sort -u -based approaches. ADD REPLY 0 Entering edit mode Hi Alex, could you please help me about this post ? compare two text file ADD REPLY 0 Entering edit mode The answer here should work, I think: C: compare two text file ADD REPLY 0 Entering edit mode no unfortunately , I've already tested them. ADD REPLY 0 Entering edit mode It would perhaps be easier to help if you posted your two files somewhere public (pastebin, Dropbox, etc.), and explain more explicitly what your filters are. ADD REPLY 0 Entering edit mode can I have your email address ? ADD REPLY 0 Entering edit mode You could just use pastebin: https://pastebin.com/ ADD REPLY 0 Entering edit mode please check this link https://mega.nz/fm/V6413RBB ADD REPLY 0 Entering edit mode I’m sorry but I will not be signing up for an account with that site. Just use pastebin or publish to a public folder in Dropbox or similar, if you want to. ADD REPLY 0 Entering edit mode Problem finally solved, it was due to blank line in text1 file. Thanks for your time ADD REPLY 0 Entering edit mode Then accept answer(s) that worked (use the green check mark against the answer) to provide closure to this thread. ADD REPLY 0 Entering edit mode 4.6 years ago aka001 ▴ 190 Based on the example in one of your comments, you can do it with this: awk -F',' '!seen[$2,\$4]++' your_file.txt


You might have problems later on when there are multiple lines with the same 2nd and 4th columns but different values in some other columns. However, as you didn't mention it, above awk will work fine.