You can browse the ENCODE project ChIP-Seq data for all available genomic sites using UCSC genome browser; you also found other hints about possible regulatory elements from the same project. Please see as ref. http://www.ncbi.nlm.nih.gov/pubmed/21526222
In addition to the good reasons listed above for why binding can be significant outside of promoter regions, there is also the bad reason: experimental error.
Depending upon the TF, they can certainly bind outside of the proximal enhancer/promoter regions. Androgen receptor, estrogen receptor are great examples of this. I think MYC will bind to to intergenic sites as well. There is a solid body of literature on ChIP-Seq with these factors. The ER papers on ChIA-PET clearly demonstrate TF binding at sites very far away from promoters. Also, the suggestion to look at ENCODE is an excellent one.
Yes, however ChIP-Seq is a high-throughput experiment which generates a large number of results, and thus provides many opportunities for rare errors to occur; and like anything else, it is more difficult to perform in vivo than in vitro. I just wanted to remind Dr. Ferguson to evaluate his results with a healthy dose of skepticism. :-)
Ok great discussion guys! What do you guys think about using Motif analysis to reduce potential false positives for ChIP-Seq transcription factor binding. In theory, one would assume that 50-70% of the positive TF peaks from the ChIP-seq contain its respective Motif sequence. A simplistic example would be, that 50% of peaks from a GR ChIP-Seq should ideal contain consensus GRE sites. What do you guys think? Additionally, would it not make sense to rank those respective peaks based on there proximity to TSS?
Dr Ferguson, I think it is certainly worth comparing motif positions to your ChIP-Seq results, if you have well-defined motifs. It's my impression that most ChIP-Seq papers that have motifs in hand do such comparisons, however I'm not sure how much correspondence one expects to find between the two. A motif in the DNA represents a potential for binding, if the cell is in an appropriate state; no organism is likely to have cells in every appropriate state at once, so one can expect some potential binding sites not to be occupied. There are also other determinants for protein localization on DNA, such as chromatin state and where that particular segment of DNA lies in the nucleus; in general there are frequently many more occurrences of a binding motif than are believed to be used in any cell state. In the published studies I've seen, the regions identified by ChIP-Seq are a low fraction of all occurrences of a binding motif in the genome, and the possibility that such frequent binding motifs cooccur with the ChIP-Seq results by chance must be taken into account.
It would certainly be possible to use motif occurrence to limit the set of binding loci you identified by ChIP-Seq to those containing a motif occurrence; however if you do that, any position the protein actually binds that is not identified as containing an occurrence of the motif is erroneously excluded. The level of error you will encounter is strongly protein-dependent, because different proteins have stronger or weaker dependence on the DNA sequence to determine binding. Motifs are usually identified with a scoring matrix and a cutoff to convert the scores to "present/absent" decisions; if you are going to restrict interest to loci containing motifs, it might be useful to determine how well motif scores correlate with the ChIP-Seq loci. If there is a good correspondence between high scores and ChIP-Seq identification, perhaps your motif is a useful screen (and perhaps you could even use the decay of that correlation to choose an appropriate motif score cutoff, although that might be a bit of circular reasoning on my part). However, if I had to guess up front, I would predict you won't find the correlation as strong as you might want.
The question about how to use the position of the site relative to the promoter region is a thorny one. I'm a computational biologist, so I deal with the question of how to score various attributes of sequence sites all the time. There are statistical reasons for scoring each attribute of a site, including its position, on how frequently the site has that attribute (actually, the score should be the log of the frequency (probability) so that adding scores from independent attributes corresponds to multiplying the probabilities). Thus, whether it is a good idea to rate sites by some function of their distance from the transcription start site depends on whether a bound protein tends to have the property you're interested in (e.g. whether it activates or represses transcription) more frequently when it is close to the transcription start site (TSS) than when it is far away.
I seriously doubt that distance is strongly correlated to activity for transcription factors, at least over more than 1 kb. In the cases that I've seen studied in detail, there is a proximal, basal promoter region that is typically just upstream of the TSS; in complex multicellular organisms like mammals, there are also typically several regions containing binding sites for proteins that regulate various aspects of when and in what cell types transcription is activated. These are usually quite close to the basal promoter, however they can be either upstream or downstream of the basal promoter, overlapping the first exon and/or intron, and if that first intron is short (which it usually isn't), may extend even farther into the transcribed DNA. In fact, in contrast to the fact that most computational studies tend to treat the first kb upstream of the TSS as "the promoter region", large-scale ChIP-Seq studies have actually tended to show that transcription factor binding sites are slightly _more frequent_ immediately downstream of the TSS than they are immediately upstream.
Beyond these proximal promoter regions, many if not most genes are also regulated by binding at enhancer sites; proteins bound at enhancers come into contact with proteins in the proximal promoter region to influence transcription. As far as I know, there is no significant relationship between the distance between an enhancer along the chromosome and its functional relevance. Chromatin is structured into loops within loops at various levels, and this high-order structure appears to be regulated; thus distance along the linear chromosome is a very poor representation of how close an enhancer is, in the nucleus, to the proximal promoter region.
The upshot of this comment, which is already too long, is this: if one knows what the proximal promoter region actually is for a particular gene, then it makes sense to rate sites within that region higher than more distant sites; however within the proximal promoter I am not convinced that "closer is better", and outside the proximal promoter I am convinced that distance has no relevance at all. The standard assumption (among computational biologists) that the promoter region is the first kb upstream of the TSS is convenient, but isn't supported by any data I know of.
Thank you for your excellent points! Several counterpoints
"I seriously doubt that distance is strongly correlated to activity for transcription factors"
Check out the paper below... they essential illustrate a strong correlation between proximity to TSS and gene expression up to 10kb from tss:
A quantitative model of transcriptional regulation reveals the influence of binding location on expression.
MacIsaac KD, Lo KA, Gordon W, Motola S, Mazor T, Fraenkel E.
I agree with you I generally count the promixal promoter as -250 or +250bp from the TSS... ideally I like to do analysis 250, 1k, and 3k from tss.
In regards to the enhancer point... I guess it would then make sense to pair that transcription factor ChIP-Seq with an chromatin mark for enhancer bind (H3K4me1) or (p300/CBP) ?
Thanks for your excellent insights and comments Max!
Vladimir, I don't have one handy; the comment is based on a talk I attended a few years ago at the University of Washington, in which a graph of ChIP-Seq hits for a variety of transcription factors was plotted relative to the nearest TSS. Unfortunately I can't recall who the speaker was--someone from the Broad Institute, if I recall correctly--but I asked about the pattern and he said they had noted it, too.
I will look, and if I can find a reference I'll post it here.
I think, it depends on the transcription factor. CTCF, as a brightest example, marks insulators - from Drosophila to man http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0030112 http://www.ncbi.nlm.nih.gov/pubmed/22244452
Based on what I have been able to find, I should retract my statement. In the first place, I can't recall whether the graph I saw referred to ChIP-Seq or ChIP-chip data, and there are some good analyses of ChIP-chip data from Eytan Domany's group (e.g. Nucleic Acids Research, 2008, Vol. 36, No. 21 6795–6805) that show that probe density on the chip can make plots of raw occupancy data have the appearance I described, but after careful analysis the data supports highly localized binding _upstream_ of the TSS. I haven't read this just-published paper ("Integration of 198 ChIP-seq Datasets Reveals Human cis-Regulatory Regions", see http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0100), but it is an analysis across a large number of transcription factors and should be particularly relevant.