Question

Deleted:How can I find matches between two csv files according to multiple columns using awk/bash?

0

Entering edit mode

2.7 years ago

Gargantu8 • 0

Hello! I am working on identifying genomic regions that code for notable genes that have been knocked out completely in my samples. File 1 is a list of locations in the genome where notable genes occur. File 2 is a list of locations in my samples' genomes that have complete knockouts (2 means diploid/expected, 0 means no copies of that region). Columns 6-23 correspond to my 18 samples.

I want to compare File 1 and 2 in two ways:

First, match "subject id" in File 1 with "seqnames" in File 2
Then, see if File 1's s.start-s.end range overlaps or falls within File 2's start and end range.

If such matches occur, I would like to print out columns 2-23 from File 2, and then every column from File 1 to the right of that.

The closest solution I have found so far is an awk solution on StackExchange. In this post, "Y" would be the equivalent of "seqnames/subject id" and Ymin/Ymax would correspond to "start/end" from File 2. Y1 would be similar to "s.start/s.end" from File 1. I just haven't had success in expanding on that for my solution.

I am thinking an awk/bash solution similar to the StackExchange post would make the most sense, but any and all suggestions would be greatly appreciated! Cheers.

An example output would be something like this: Output Example *I made up this example. "seqnames" from File 2 matches "subject id" from File 1. "s.start" and/or "s.end" from File 1 overlaps/falls within "start" and "end" from File 2. Because of this match,

File 1:

AnthocyaninGeneLocationsExample

File 2: cn.MOPSLocationsExample

awk genomics bash • 485 views

ADD COMMENT • link 2.7 years ago by Gargantu8 • 0