Question

List of well-separated hg19 genes (bed)

0

Entering edit mode

7.2 years ago

bradbenjamin526 • 0

I am trying to use deeptools to analyze ChIP-seq data to characterize elongation behavior of Pol II and other factors. However, I have been unable to find a bed file of hg.19 genes that are well separated from neighbors. I would like the cut-off upstream of promoter and downstream of TES to be ~5kb. I have used the UCSC table browser to get a list of all hg19 genes, which enables me to perform some decent analysis, but I can tell my data is being distorted by noise from neighboring genes.

Is there a way to get a list of such well-separated hg19 genes? If not UCSC table browser, is there another resource that people use for getting lists of genes?

ChIP-Seq gene lists transcription Pol II • 2.5k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 35k • written 7.2 years ago by bradbenjamin526 • 0

0

Entering edit mode

What do you mean by "well-separated" genes? Genes are where they are at (to the best of our current knowledge). Unless you start cherry-picking ones you like how can they be well-separated?

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks for the quick response!!

Very true, but genes that are close together screw up genome-wide ChIP profiles, especially genes that are arranged head-to-head or tail-to-tail. I guess examples are the best way to explain what I'm trying to say!

Check out the gene TRIB1: http://imgur.com/jk8beqp. It has no genes within 8 kb upstream or downstream. You can ascribe the ChIP signal seen in both Pol II and Pol II CTR phospho-Ser2 (pS2) to discrete events within transcription: initiation, elongation, termination. There's a peak of total Pol II at the promoter, and a second small peak in the termination zone downstream of the end of the gene. This second peak corresponds to Ser2 phosphorylation, which only has a single peak. This is a good model gene to understand phosphorylation.

Alternatively, check out Cdk4: http://imgur.com/oWsgbvq. We cannot see a termination zone peak of Pol II because it is obscured by the promoter-proximal peak of TSPAN31. We can't necessarily assign the pS2 peak to the termination zone of Cdk4, because we also have to consider behavior on TSPAN31. This would be a good gene to NOT include in a genome-wide profile.

So, I would like to get a big list of genes that include the well-separated genes that have no neighboring or overlapping genes, so that we can make confident calls of Pol II and TF behavior. I don't think this is cherry-picking, I think it is providing clarity. Would you agree?

ADD REPLY • link 7.2 years ago by bradbenjamin526 • 0

score 1 · Answer 1 · 2017-02-03

If you want to combine the genes if they are very close to each other, such that all transcript in the file will have a minimum distance from each other, you can use bedtools merge. The "-d" argument allows to you control how close they are if you want to merge them.

If you want to find genes that have a good distance from the other genes, you can first use bedtools complement. This allows you to locate regions without genes in the file. Next, you can use some scripts, for exmaple Python or R, to find all large consecutive complements that occur in your file. The genes that you want are all located in the middle of these large consecutive complements.

Here is a simple R script for this purpose:

ComFile <- as.matrix(read.table("YourComplement.bed", sep="\t", header=F))

#The genes' minimum distance from the other genes. (Edit this line)
minDistance <- 3000

#Chr, Start and End of your wanted genes
Chr <- c()
Start <- c()
End <- c()
Previous_Com_Length <-  as.numeric(ComFile[1,3]) - as.numeric(ComFile[1,2])
Previous_Chr <-  as.character(ComFile[1,1])
for (i in 2:nrow(ComFile)) {
    This_Com_Length <-  as.numeric(ComFile[i,3]) - as.numeric(ComFile[i,2])
    This_Chr <- as.character(ComFile[i,1])
    if (This_Chr == Previous_Chr) {
        if (Previous_Com_Length >= minDistance && This_Com_Length >= minDistance) {
            Chr <- c(Chr, This_Chr)
            Start <- c(Start, as.numeric(ComFile[i-1,3]))
            End <- c(End, as.numeric(ComFile[i,2]))
        }
    }
    Previous_Com_Length <- This_Com_Length
    Previous_Chr <- This_Chr
}

Answer <- data.frame(Chr=Chr, Start=Start, End=End)

write.table(Answer, "YourGenes.bed", quote=FALSE, sep="\t", col.names=FALSE, row.names=FALSE)

Note: the reason I prefer using bedtools complement, instead of directly working on the current gene list, is because you can provide chromosome lengths to complement first. This makes the script a bit less complicated.

I hope this helps,

score 1 · Answer 2 · 2017-02-03

If you want a list of genes that are at least 5kb from adjacent genes, you could combine BEDTools 'flank' and 'subtract' commands (note: requires gene start and end positions in BED or GFF format, plus GENOME file):

bedtools flank -i GENES.BED -g GENOME -b 5000 > FLANKS.BED
bedtools subtract -a GENES.BED -b FLANKS.BED -A > SEPARATED_5K.BED

score 0 · Answer 3 · 2017-02-03

Via wget, awk and BEDOPS sort-bed, you can quickly build a sorted BED file of RefSeq genes:

$ wget -qO- http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz | gunzip -c | awk '{ print $3"\t"$5"\t"$6"\t"$13"\t.\t"$4 }' | sort-bed - > refseq.hg19.bed

Once you have a sorted BED file of genes, you can use the following BEDOPS one-liner to quickly filter out genes that overlap within 5kb:

$ bedmap --delim '\t' --count --echo --range 5000 refseq.hg19.bed | awk '$1==1' | cut -f2- > refseq.hg19.filtered.bed

Then you can use refseq.hg19.filtered.bed as an input for your downstream work.

Unix standard input and output streams are a powerful way to string Unix and BEDOPS tools together to get results fast, without wasting time and energy making a lot of intermediate files that you have to clean up, afterwards.