How can I find matches between two csv files according to multiple columns using awk/bash?
1
0
Entering edit mode
2.7 years ago
Gargantu8 • 0

*Reposted question with no images, removed useless information.

I want to compare File 1 and 2 in two ways:

  • First, match "subject id" in File 2 with "seqnames" in File 1
  • Then, see if File 2's s.start-s.end range overlaps or falls within File 1's start and end range.

If such matches occur, I would like to print out columns 2-23 from File 1, and then every column but #2 (subject id) from File 2 to the right of that.

File 1:

strand  seqnames      start  end    width  [18moreColumns]
1       Ha412HOChr01  1000   10000  9000    ...
2       Ha412HOChr01  19000  21000  2000    ...
3       Ha412HOChr08  2000   11000  9000    ...
4       Ha412HOChr17  92500  94400  1900    ...

File 2:

query id  subject id    % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
XP_02299  Ha412HOChr01  100         335        0         0         2800     3200   ...
XP_02356  Ha412HOChr01  99.86       189        2         1         18900    19800  ...
XP_04271  Ha412HOChr03  98.21       431        4         0         2100     2432   ...
XP_05788  Ha412HOChr08  93.29       38         0         0         7600     8210   ...
XP_06354  Ha412HOChr08  97.89       98         0         0         32100    34980  ...

Output:

seqnames      start  end    width  [18moreColumns]  query id  % identity  alignment  mismatch  gap open  s.start  s.end  [3 more columns]
Ha412HOChr01  1000   10000  9000    ...             XP_02299  100         335        0         0         2800     3200   ...
Ha412HOChr01  19000  21000  2000    ...             XP_02356  99.86       189        2         1         18900    19800  ...
Ha412HOChr08  2000   11000  9000    ...             XP_05788  93.29       38         0         0         7600     8210   ...
awk genomics bash • 1.6k views
ADD COMMENT
0
Entering edit mode

using sqldf in R:

> sqldf('select * from df1 join df2 on df1.seqnames= df2.subject_id where 
+     df2.s_start between df1.start and df1.end OR
+     df2.s_end between df1.start and df1.end')

  strand     seqnames start   end width query_id   subject_id x_identity alignment mismatch gap_open s_start s_end
1      1 Ha412HOChr01  1000 10000  9000 XP_02299 Ha412HOChr01     100.00       335        0        0    2800  3200
2      2 Ha412HOChr01 19000 21000  2000 XP_02356 Ha412HOChr01      99.86       189        2        1   18900 19800
3      3 Ha412HOChr08  2000 11000  9000 XP_05788 Ha412HOChr08      93.29        38        0        0    7600  8210

using fuzzyjoin in R (default inner_join):

> names(df2)[c(2,7,8)]=c("seqnames","start","end")
> genome_join(df1, df2, by=c("seqnames","start","end"))
  strand   seqnames.x start.x end.x width query_id   seqnames.y x_identity alignment mismatch gap_open start.y end.y
1      1 Ha412HOChr01    1000 10000  9000 XP_02299 Ha412HOChr01     100.00       335        0        0    2800  3200
2      2 Ha412HOChr01   19000 21000  2000 XP_02356 Ha412HOChr01      99.86       189        2        1   18900 19800
3      3 Ha412HOChr08    2000 11000  9000 XP_05788 Ha412HOChr08      93.29        38        0        0    7600  8210

Please do not use special characters in column headers (%, space, dot etc).

ADD REPLY
0
Entering edit mode

This is purely informatics based question, rather bioinformatics!

When in doubt with tabular data, pandas is best used.

ADD REPLY
0
Entering edit mode
2.7 years ago
Tm ★ 1.1k

You can use intersectbed for this purpose. It can work on BAM/BED/GFF/VCF formats. However if you don't have these formats as can be seen by file shared by you, then try to make files in a way that you have IDs to be compared in 1st column and start, end in 2nd and 3rd column and rest information in following columns.

ADD COMMENT

Login before adding your answer.

Traffic: 1954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6