Is it valid to normalize gene expression to the average between biological repeats? Could that approach be used for large data sets were most genes show some degree of variation or is there a strict requirement for a reference gene?
There is no law regulating/dictating what to use as the reference to normalize on. For gene expression, the aim of normalization is to make different samples comparable, if you can't assure that the same amount of (biological) material was used in all reactions. Thus you may normalize to anything that is linearily related to the amount of (biological) material. It is irrelevant if this reference is the cell number, the content of total RNA, ribosomal RNA, DNA or protein, the number of ribosomes, the volume of cytoplasm, the expression of another gene or the averages of several different protein contents or gene expressions. Whatever you can think of (but some things are easier to determine than others!). Only - of course! - the amount *must not* depend on the treatment or group structure of your experiment.
As Didier said, using the expression of reference genes to normalize on is hoped to decrease the inter-experiment variance (between PCR cyclers, master-mixes, detection chemistries,...) because these influences should cancel out if both, gene of interest and reference, are measured within the same experiment(al setup).
For large (e.g. whole array) datasets, there are a number of normalization algorithms that normalize to a baseline (as opposed to something like quantile normalization as in RMA/GCRMA). Kernel density type normalizations use a baseline (or any type transformation+normalization scheme), for example. What baseline is used is up to you - you can use a reference sample, the mean or median of controls or some other group of samples, or the mean or median of all samples. There is no universal "best" and if using a baseline normalization approach, one often needs to try a few variations to see which performs best (you can use graphics like MVA plots, for example, to see the effect of the specific normalization scheme).
For smaller data sets such as qPCR sets (e.g. real-time TaqMan or similar type assays), usually one includes multiple housekeeping genes and then normalized against the mean of those (or at least, the subset that shows relatively flat expression curves over all samples). As long as your housekeeping genes worked as intended (mainly, have little to no change in expression over all samples), then normalizing to the mean of those will do better than a global normalization, at least in my experience with small (~50-200 genes) qPCR datasets. Small data sets are more likely to have large differences in expression over the limited set of genes, which will skew a global normalization.
You need to standardize to a reference gene so you can compare between samples. If you standardize to the average then you still dont know if the variation is real (biological) or due to extrinsic factors like pipetting error. Usually taking the geometric mean of several "housekeeping" genes is considered good enough. You could also measure the mRNA concentration of your samples using ribogreen and use that value to standardize. I attached a good paper about standardizing with multiple reference genes.
To my mind, it's equivalent to a general discussion on the controls used in experiments: internal standard or controls , compared to "external" (and the mean can be considered as external standardization because based on inter experiments one).
We have advantages and inconvenients with the two methods; and the use of internal standard gene developped greatly with quantitative PCR, to decrease inter-experiment interference; but with all other factors that could ineterfere maintained constant; that is also difficult to obtain.
There is no law regulating/dictating what to use as the reference to normalize on. For gene expression, the aim of normalization is to make different samples comparable, if you can't assure that the same amount of (biological) material was used in all reactions. Thus you may normalize to anything that is linearily related to the amount of (biological) material. It is irrelevant if this reference is the cell number, the content of total RNA, ribosomal RNA, DNA or protein, the number of ribosomes, the volume of cytoplasm, the expression of another gene or the averages of several different protein contents or gene expressions. Whatever you can think of (but some things are easier to determine than others!). Only - of course! - the amount *must not* depend on the treatment or group structure of your experiment.
As Didier said, using the expression of reference genes to normalize on is hoped to decrease the inter-experiment variance (between PCR cyclers, master-mixes, detection chemistries,...) because these influences should cancel out if both, gene of interest and reference, are measured within the same experiment(al setup).
With any type of normalization be aware of the potential introduction of bias. Recently it has come out that c-myc induces a general transcription activation and that most of the studies performed using normalization do not observe this effect obtaining results that are straight false (see: http://www.sciencemag.org/content/338/6107/593.long).
The is an relatively new paper (Pereira et al. (2018) - Comparison of normalization methods for the analysis of metagenomic gene abundance data) on this topic. The authors compare a variety of normalization methods on a mock dataset and evaluate the TDR and FDR of differentially expressend genes. Probably worth a read.