No, sometimes the L1 and L2 can be referenced simultaneously, with the reference being supplied by whichever supplies the answer first. This allows the search of the L2 cache to be done in parallel with that for the L1 cache. The decreases search time from performing the searches sequentially.
The L1 cache is the fastest, and smallest. It has to be fast, and speed of light
and other latency considerations require that it be within a certain distance of the registers, This limits the physical size of the L1 cache, and therefore the size in Mbytes (actually Kbytes).
The L1 cache is usually direct mapped (cache associativity of 1) so only one line of the tag cache need be searched. The L2 usually is at least 4 way associative, which means it tales longer to search. The slower L2 cache can therefor be further form the registers, and hence physically larger, and hence larger in MBytes.
There is no theoretical reason. The most compelling practical reason for this is cost. Maybe a minor practical reason is that is hard to implement a fast search for a larger cache.
The thing is that fast memory is very expensive. Otherwise we would use computers with the number of registers equivalent to several gigabytes of data. Because the processor logic would be quite complex for billions of registers (and thus expensive) we are using a L1 cache. This cache is still quite fast, but having a larger size of this kind of fast memory would again cost a lot. Therefore, the trade-off is again a small size. In the beginning of computing it would only take about 2 CPU cycles to access RAM (which by today's standards is also really small). With every increase in CPU speed another level of cache is needed to compensate for speed differences between CPU registers and RAM. Large and slow memory costs as much as fast and small memory. This is always a trade-off between size, speed, and money. You cannot optimize for all three at the same time. The main reason to have a cache is to hide latency to the RAM (or to the L3 cache, or to the L2 cache). Sure, this only works for programs that adhere to memory locality. Otherwise, you would have roughly 200 CPU cycles without any work; just waiting for data from the RAM.
CPU caches are small pools of memory that store information the CPU is most likely to need next. Which information is loaded into cache depends on sophisticated algorithms and certain assumptions about programming code. The goal of the cache system is to ensure that the CPU has the next bit of data it will need already loaded into cache by the time it goes looking for it (also called a cache hit).
A cache miss, on the other hand, means the CPU has to go scampering off to find the data elsewhere. This is where the L2 cache comes into play — while it’s slower, it’s also much larger. Some processors use an inclusive cache design (meaning data stored in the L1 cache is also duplicated in the L2 cache) while others are exclusive (meaning the two caches never share data). If data can’t be found in the L2 cache, the CPU continues down the chain to L3.
Generally L1, as it's the first level between the processor and the slow memory subsystems. There is no rule for how many levels of cache there need be, and they are not all implemented in the same fashion, yielding different performance from one architecture to another, so you might as well look at overall performance of the processor in the specific tasks you want to target, instead of basing your decisions on cache alone.
As multicore processors became more common, L3 cache started appearing more frequently on consumer hardware. These new chips,used L3 as more than just a larger, slower backstop for L2. In addition to this function, the L3 cache is often shared between all of the processors on a single piece of silicon. That’s in contrast to the L1 and L2 caches, both of which tend to be private and dedicated to the needs of each particular core.
To clarify the exact difference, you should familiarize yourself with the control protocols. A good start is to read the book Computer Architecture ....:
Just go below the L1 cache and you can find the answer. Why number of registers is smaller than L1 cache? please think in that way. you will get the answer.
Coyle's initial answer is close, but some of the details are a bit off. He is correct that the primary answer is size, though cache latency is determined more by electrical loading issues than by maximum signal propagation speed. A great way to learn about this is to read the technical reports describing the CACTI cache modeling software -- starting with the report for the initial version and moving forward chronologically. These are available at http://www.hpl.hp.com/research/cacti/
Recent Intel processors demonstrate that there is no need for the L1 associativity to be smaller than L2 associativity. Both are 8-way associative in the last 3 generations of Intel processors (Nehalem/Westmere, Sandy Bridge/Ivy Bridge, and Haswell/Broadwell), with 32 KiB L1 Data Caches and 256 KiB L2 Caches used by all three generations.
There is definitely a cost for associativity -- you can query the tags for the "ways" in parallel (using more power) or query them sequentially (resulting in higher latency). (Numerous other options are available, including dynamic prediction of the "way(s)" to be queried first, but these describing the bounding physics.) The same principle applies to the "cost" for multiple levels of cache -- you can query the tags of the various levels in parallel (using more power) or query them sequentially (resulting in higher latency). It is also possible to *start* the query of larger caches in parallel with the query of the smaller caches, then *cancel* the query of the larger cache if you get a hit in the smaller cache. This was done for the L3 cache in the AMD Family 10h processors, for example. The CACTI technical reports describe similar partial overlap of cache tag and cache data accesses -- again trading increased power consumption for decreased latency.
So the "theoretical" reason is physics --- bigger is slower, primarily due to the need to drive more address lines across wider arrays.
The "practical" reason is also physics, but involves more subtle trade-offs between performance and power consumption that (typically) lead to different decisions at each level of the cache hierarchy. For example, an approach that decreases the L1 Data cache latency from 5 cycles to 4 cycles will likely have a large enough performance benefit to justify a modest increase in area and power consumption, but applying that technique to reduce L2 latency from 14 cycles to 13 cycles will have a very small performance benefit, and so can only be used if the area and power increases are very small.