Hello! I am working on identifying genomic regions that code for notable genes that have been knocked out completely in my samples. File 1 is a list of locations in the genome where notable genes occur. File 2 is a list of locations in my samples' genomes that have complete knockouts (2 means diploid/expected, 0 means no copies of that region). Columns 6-23 correspond to my 18 samples.
I want to compare File 1 and 2 in two ways:
- First, match "subject id" in File 1 with "seqnames" in File 2
- Then, see if File 1's s.start-s.end range overlaps or falls within File 2's start and end range.
If such matches occur, I would like to print out columns 2-23 from File 2, and then every column from File 1 to the right of that.
The closest solution I have found so far is an awk solution on StackExchange. In this post, "Y" would be the equivalent of "seqnames/subject id" and Ymin/Ymax would correspond to "start/end" from File 2. Y1 would be similar to "s.start/s.end" from File 1. I just haven't had success in expanding on that for my solution.
I am thinking an awk/bash solution similar to the StackExchange post would make the most sense, but any and all suggestions would be greatly appreciated! Cheers.
An example output would be something like this: *I made up this example. "seqnames" from File 2 matches "subject id" from File 1. "s.start" and/or "s.end" from File 1 overlaps/falls within "start" and "end" from File 2. Because of this match,
File 1:
File 2: