Category Archives: technical

How Did Researchers Predict the End of the Road for Conventional Microarchitectures 14 Years Ago

Over the past a few decades, we have been witnessing drastic changes in microprocessors. If I have to summarize, there are two milestone transitions: from single core to multicore, and from multicore toward more specialized accelerators. While it’s clear that the so-called dark silicon is the reason that we are gradually shifting from the multicore era to the specialization regime, it is not as clean and easy to explain why we moved from single core to multicore. Continue reading

A Framework For Understanding Cache Coherency


The correctness of shared-memory processor means each processor, or a process/thread from a programmer’s perspective, has a coherent/consistent view of the memory, i.e., they agree upon the value of a particular memory location, i.e., it is not possible that there are two “correct” values of an address at any given point. Whatever the cache coherency protocol does, it has to guarantee a consistent memory view across all the processors and memory. Continue reading

What is the Optimal Logic Depth Per Pipeline Stage?

There are two things that affect the performance: cycles per second (clock rate, frequency), and instruction per cycle (IPC).

The way to increase the frequency is: 1) faster transistor through process generations, and 2) less logic in each pipeline stage (evaluated by #FO4 delays).

This seminal paper assumes that the context is within a particular process generation, that is the FO4 delay stays the same. Continue reading

Token Coherence: Decoupling Performance and Correctness

I really like the idea of Token Coherency. It makes so much sense to me, especially if I think about it using my framework of understanding cache coherency.

Workload trend suggests that we want to enable cache to cache transfer to minimize latency. So we need to use direct communication scheme. However, due to the distributed nature of the direct communication, it is hard to enforce order, and traditionally we can only enforce a total order (the strongest assumption) using an ordered interconnect (e.g., bus). For example, destination-set predictor uses the direct communication scheme but has to use an ordered network.That limits the performance. The technology trend suggests that we want to use unordered interconnects. Continue reading

Predicting the Communication Scheme to Improve the Bandwidth/Latency Trade-off in Cache Coherency

The DSP paper focuses on how to choose the communication scheme in a cache coherency implementation. The communication scheme is the middle layer in my framework of understanding cache coherency.

Direct communication, i.e., broadcast, achieves low latency while requiring high bandwidth due to broadcast. This is what snoopy implementation does.

Indirect communication, i.e., centralized, requires low bandwidth while incurring high latency. This is what directory based implementation does. Continue reading