No, sometimes the L1 and L2 can be referenced simultaneously, with the reference being supplied by whichever supplies the answer first. This allows the search of the L2 cache to be done in parallel with that for the L1 cache. The decreases search time from performing the searches sequentially.
The L1 cache is the fastest, and smallest. It has to be fast, and speed of light
and other latency considerations require that it be within a certain distance of the registers, This limits the physical size of the L1 cache, and therefore the size in Mbytes (actually Kbytes).
The L1 cache is usually direct mapped (cache associativity of 1) so only one line of the tag cache need be searched. The L2 usually is at least 4 way associative, which means it tales longer to search. The slower L2 cache can therefor be further form the registers, and hence physically larger, and hence larger in MBytes.
There is no theoretical reason. The most compelling practical reason for this is cost. Maybe a minor practical reason is that is hard to implement a fast search for a larger cache.
The thing is that fast memory is very expensive. Otherwise we would use computers with the number of registers equivalent to several gigabytes of data. Because the processor logic would be quite complex for billions of registers (and thus expensive) we are using a L1 cache. This cache is still quite fast, but having a larger size of this kind of fast memory would again cost a lot. Therefore, the trade-off is again a small size. In the beginning of computing it would only take about 2 CPU cycles to access RAM (which by today's standards is also really small). With every increase in CPU speed another level of cache is needed to compensate for speed differences between CPU registers and RAM. Large and slow memory costs as much as fast and small memory. This is always a trade-off between size, speed, and money. You cannot optimize for all three at the same time. The main reason to have a cache is to hide latency to the RAM (or to the L3 cache, or to the L2 cache). Sure, this only works for programs that adhere to memory locality. Otherwise, you would have roughly 200 CPU cycles without any work; just waiting for data from the RAM.
CPU caches are small pools of memory that store information the CPU is most likely to need next. Which information is loaded into cache depends on sophisticated algorithms and certain assumptions about programming code. The goal of the cache system is to ensure that the CPU has the next bit of data it will need already loaded into cache by the time it goes looking for it (also called a cache hit).
A cache miss, on the other hand, means the CPU has to go scampering off to find the data elsewhere. This is where the L2 cache comes into play — while it’s slower, it’s also much larger. Some processors use an inclusive cache design (meaning data stored in the L1 cache is also duplicated in the L2 cache) while others are exclusive (meaning the two caches never share data). If data can’t be found in the L2 cache, the CPU continues down the chain to L3.
Generally L1, as it's the first level between the processor and the slow memory subsystems. There is no rule for how many levels of cache there need be, and they are not all implemented in the same fashion, yielding different performance from one architecture to another, so you might as well look at overall performance of the processor in the specific tasks you want to target, instead of basing your decisions on cache alone.
As multicore processors became more common, L3 cache started appearing more frequently on consumer hardware. These new chips,used L3 as more than just a larger, slower backstop for L2. In addition to this function, the L3 cache is often shared between all of the processors on a single piece of silicon. That’s in contrast to the L1 and L2 caches, both of which tend to be private and dedicated to the needs of each particular core.
To clarify the exact difference, you should familiarize yourself with the control protocols. A good start is to read the book Computer Architecture ....:
Just go below the L1 cache and you can find the answer. Why number of registers is smaller than L1 cache? please think in that way. you will get the answer.