Hi everyone, I'm carrying on a whole genome alignment analysis between two assemblies belonging to the same specie and obtained with different sequencing methods. In brief, they show a great different in the estimated genome size, with one of them being more in agreement with what expected from flow cytometry and I was interested in characterize the regions of the bigger one (here called as the query genome) that cannot be aligned with the smaller one ( the reference genome). I have already aligned the two assemblies using Mummer v.4 and I have produced a coord file with the following code:
show-coords -THrcl out.qdelta > out.qcoord
In the output are present the alignment coordinates of the reference and the query in a format like this:
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [LEN R] [LEN Q] [TAGS]
5080 5651 12880 13453 572 574 58439140 50106 CM018522.1 tig00016503
5083 6078 39063 40043 996 981 58439140 62467 CM018522.1 tig00016642
5217 6079 20844 21712 863 869 58439140 22868 CM018522.1 tig00015536
5330 6079 36215 35465 750 751 58439140 61697 CM018522.1 tig00015298
where [S1] and [E1] represent respectively the start and the end alignment coordinates of the reference contig/scaffold/chromosome while [S2] and [E2] of the query contig/scaffold/chromosome.
I was wondering if there were a fast way to produce something like a BED file with the coordinates of the genomic regions that do not overlap with these intervals from which I can make a bit of filtering and extract them. E.g. Lets suppose that the contig named "tig00016503" ( first row) is 15,000 pb in length, the expected output should be something like this:
tig00016503 1 12879
tig00016503 13454 15000
I have made some attempts with some bash codes and the intersect function from the bedtools package, but I have not found an easy way without producing a large amount of intermediate files.
Thank you very much in advance for your help!
Thank you a lot, I completely miss that bedtools function! To much text for such a simple answer.