I have a list of RefSeq transcription start sites (TSS) obtained from the UCSC table browser. I want to remove those TSS within X #bp from another TSS on the same list. This will include both genes with multiple TSS, as well as genes on the opposite strand that may be in a head to head arrangement. I say X #of bp as I will be filtering for different distances depending on the analysis. How do I go about this? Is there a tool?
Ideally, I would be able to separately filter out instances of multiple TSS for a given gene and nearby TSS from a gene on the opposite strand.
I apologize as this has probably been answered before, but I have been unable to uncover an answer using the search function.
As you can see, in this example there are two genes, they both happen to be paralogs of Hsp70, but they could be any two genes, that are oriented in close proximity in a head to head manner. One gene is on the + strand, the other on the - strand. The TSS for each of the genes in this example are close together, less than 2kb. Both TSS would show up in a list of RefSeq TSS for Drosophila.
The data I am currently working with are reads from a ChIP-seq experiment in mouse. While this genome is less compact, there are similar examples to be sure. This types of genes can yield specific types of artifacts in my analysis, and I need to filter them out.
The list TSS List I am working with is structured as follows:
#name chrom strand txStart NM_001008533 chr1 - 134199214 NM_001039510 chr1 - 134199214 NM_001282945 chr1 - 134199214 NM_175642 chr1 - 25067475
This list contains genes from both strands, however the first five lines all happen to be - strand TSS. Likewise, you can see the second instance I want to be able to filter out, those genes with multiple TSS, as is the case with the gene represented in the first three lines of this sample file
Both situations; those genes nearby to other genes in the genome, and those genes with multiple TSS, need to be filtered out for my analysis.
Thanks everyone for your time.