How to find and retrieve promoter sequences from genome databases (Source: http://www.protocol-online.org/forums/page/index.html/_/bioinformatics-and-biostatistics/how-to-find-promoter-sequence-for-methylation-s-r25)
Promoter sequences are usually the sequence immediately upstream the transcription start site (TSS) or first exon. If we know the TSS of a gene, we will know with confidence where the promoter is even without experimental characterization. For many organisms, such as as human, mouse, the genome is well annotated and TSS well defined. Thus promoter sequence retrieval is an easy task. There are three major genome browsers: NCBI, Ensembl and UCSC. For our purpose, Ensembl provides the most convenient interface. Here is an example:
1. go to ensembl website: http://www.ensembl.org/index.html
2. choose an organism such as human http://www.ensembl.o...iens/Info/Index
3. Search your gene such as BRCA2 http://www.ensembl.o...ns;idx=;q=brca2
4. Click the right hit on the search result page and it will bring you to the gene summary page. For example the link to BRCA2 gene is http://www.ensembl.o...ns;idx=;q=brca2
5. On the left, under "Gene Summary", click "Sequence", the sequence of the gene including 5' flanking, exons, introns and flanking region will be displayed.
6. The exons are high lighted in pink background and red text, the sequence in front of the first exon is the promoter sequence.
7. By default, 600 bp 5'-flanking sequence (promoter) is displayed. If you want to get more, click "Configure this page" in the lower left column, a popup window opens allowing to input the size of 5' Flanking sequence (upstream). You can put for example "1000" and then save the configuration.
8. Sometimes there are discrepancies between Ensembl and UCSC annotation regarding TSS. To make sure the first exon given by ensembl is right, copy the promoter sequence
9. Go to UCSC BLAT search at http://genome.ucsc.e...t?command=start and choose the right genome (eg, human), paste the sequence there. On the result page, click browse of the first hit, this will bring you to the genome browser Page. the query sequence is now aligned with UCSC genome sequence. Zoom out a bit, you will be able to determine whether the promoter sequence matches UCSC annotation. If it matches, the sequence is very likely the right one. Here is the BRCA2 promoter sequence aligned to BRAC2 gene.
10. In UCSC genome broswer, you can turn on CpG island feature, if there is CpG island in the promoter sequence, the sequence is highly likely a true promoter. In the above example (BRCA2), a CpG island is displayed in the proximal promoter.
11. Beware some genes have alternative promoters. To find those sequences, it requires extensive bioinformatics and experimental analysis.
If you are working on an organism that has a sequenced genome then you can just blast to the genome and select the region upstream of the gene or miRNA. But if you don't have a genome sequence then it’s a bit trickier as you will have to do genome walking to get the upstream sequence.
You could use a genome browser (a platform like Artemis gives you examples on how to load the mouse and human genomes: http://www.sanger.ac.uk/resources/software/artemis/ngs/) and then copy a region downstream of the Open Reading Frame of your gene/miRNA of interest. This is important: it has to be downstream of the ORF, since you are trying to go 3'->5' direction on the mRNA of your gene of interest. Then, use something like the NCBI online tool for designing the primers: http://www.ncbi.nlm.nih.gov/tools/primer-blast/. This will ensure that your primer will not form a dimer, etc.
Once you are done with this you will have to run your reaction in a purified mRNA sample. A protocol like Rapid Amplification of cDNA Ends (RACE) would get the job done. So in your reaction mix, your primer will bind downstream of the ORF on the mRNA of your gene/miRNA of interest and from there, go upstream, until it falles off. it will fall off exactly where your promoter binds. Then you just sequence and align the PCR product. I have done this proccess many times on bacterial genome, but the idea is still the same. Here is the RACE protocol for eukaryotes: http://www.biomarket.cc/UpFiles/product/ExactSTART%20Eukaryotic%20mRNA%205-%20&%203-RACE%20Kit.pdf
I usually use UCSC as genome browser to get DNA sequence http://genome.ucsc.edu
You can select your animal species and look for your gene of interest. When you click on the gene, you go to genomic sequence and you can find there the sequence of the gene promoter.
the promoter sequence could be determined through following strategies combined: RNA-seq data, ChIP-seq data and gene prediction. For RNA-seq data and gene annotated, it could help identification of transcriptional start site. Combining ChIP-seq data, it could help determine the promoter regions regulated by Transcriptional factors. Finally, based on sequence conservation, the promoter sequence could be identified. This approach for protein coding genes would be easy to handle, while for miRNA-coding genes, it should be difficult and careful, because many miRNA-coding genes, their primary transcripts would be very long, so promoter sequence would be long distance to mature miRNAs.
We have developed a technique called Bru-Seq (PNAS, 110:2240-45, 2013) that is based on the labeling and analysis of newly made RNA that allows you to map the start sites of even unstable RNAs such primary miRNA transcripts.
As was already mentioned, when nothing is known about a particular promoter, people usual select 2-3kb upstream the ORF and assume that has a region that contains the promoter. There are some softwares that allow you to predict specific motifs in the DNA that might be binding regions for transcription factor. Usually by deleting these specific regions people can assess what are the regulators or a particular gene. Often people express the puitative proter region with consecutive deletions in order to narrow down the region that is necessary to trigger transcription. This os very general reply because I work in plants, but if I can help with something in more detail let me know. Good luck.
How to find and retrieve promoter sequences from genome databases (Source: http://www.protocol-online.org/forums/page/index.html/_/bioinformatics-and-biostatistics/how-to-find-promoter-sequence-for-methylation-s-r25)
Promoter sequences are usually the sequence immediately upstream the transcription start site (TSS) or first exon. If we know the TSS of a gene, we will know with confidence where the promoter is even without experimental characterization. For many organisms, such as as human, mouse, the genome is well annotated and TSS well defined. Thus promoter sequence retrieval is an easy task. There are three major genome browsers: NCBI, Ensembl and UCSC. For our purpose, Ensembl provides the most convenient interface. Here is an example:
1. go to ensembl website: http://www.ensembl.org/index.html
2. choose an organism such as human http://www.ensembl.o...iens/Info/Index
3. Search your gene such as BRCA2 http://www.ensembl.o...ns;idx=;q=brca2
4. Click the right hit on the search result page and it will bring you to the gene summary page. For example the link to BRCA2 gene is http://www.ensembl.o...ns;idx=;q=brca2
5. On the left, under "Gene Summary", click "Sequence", the sequence of the gene including 5' flanking, exons, introns and flanking region will be displayed.
6. The exons are high lighted in pink background and red text, the sequence in front of the first exon is the promoter sequence.
7. By default, 600 bp 5'-flanking sequence (promoter) is displayed. If you want to get more, click "Configure this page" in the lower left column, a popup window opens allowing to input the size of 5' Flanking sequence (upstream). You can put for example "1000" and then save the configuration.
8. Sometimes there are discrepancies between Ensembl and UCSC annotation regarding TSS. To make sure the first exon given by ensembl is right, copy the promoter sequence
9. Go to UCSC BLAT search at http://genome.ucsc.e...t?command=start and choose the right genome (eg, human), paste the sequence there. On the result page, click browse of the first hit, this will bring you to the genome browser Page. the query sequence is now aligned with UCSC genome sequence. Zoom out a bit, you will be able to determine whether the promoter sequence matches UCSC annotation. If it matches, the sequence is very likely the right one. Here is the BRCA2 promoter sequence aligned to BRAC2 gene.
10. In UCSC genome broswer, you can turn on CpG island feature, if there is CpG island in the promoter sequence, the sequence is highly likely a true promoter. In the above example (BRCA2), a CpG island is displayed in the proximal promoter.
11. Beware some genes have alternative promoters. To find those sequences, it requires extensive bioinformatics and experimental analysis.
Hi Dr. Moustapha. Thank you for the comprehensive explanation. Having followed all the steps you outline in confirming whether the promoter sequence retrieved from Ensembl matches the one in the UCSC browser, I would like to ask how this match is determined, is it by making a comparison between my sequence from Blat search and the RefSeq gene? When I followed the example you gave with the BRCA2 gene, , the RefSeq gene I get from UCSC Blat output is ZAR1L...and I retrieved my promoter from BRCA2gene...does it mean I did not identify the correct BRCA2 gene?The ensembl ID for the BRCA2 gene I retrieved is ENSG00000139618. Thank you
I am trying to find the promoter of a mirna, reading different papers I saw that some authors uses genomic sequence (as download from 5'UTR in ensembl), others uses mirstart to download sequences upstream from TSS, but what TSS is the best?If different TSS could be involved depending on tissue specificity or other parameters? Moreover, promoter sequences from ensembl always matches some upstream sequences from a TSS (after mirstart) , it is correct to choose the simple method, such ensembl promoters?.