Over the past a few decades, we have been witnessing drastic changes in microprocessors. If I have to summarize, there are two milestone transitions: from single core to multicore, and from multicore toward more specialized accelerators. While it’s clear that the so-called dark silicon is the reason that we are gradually shifting from the multicore era to the specialization regime, it is not as clean and easy to explain why we moved from single core to multicore. Continue reading
Three critical issues in the ACMP scheduling: scheduling unit, scheduling objective, and scheduling heuristics.
The entire scheduling problem is essentially to use certain scheduling heuristics to decide the best ACMP execution configuration for each scheduling unit such that the scheduling objective is achieved. Continue reading
The correctness of shared-memory processor means each processor, or a process/thread from a programmer’s perspective, has a coherent/consistent view of the memory, i.e., they agree upon the value of a particular memory location, i.e., it is not possible that there are two “correct” values of an address at any given point. Whatever the cache coherency protocol does, it has to guarantee a consistent memory view across all the processors and memory. Continue reading
There are two things that affect the performance: cycles per second (clock rate, frequency), and instruction per cycle (IPC).
The way to increase the frequency is: 1) faster transistor through process generations, and 2) less logic in each pipeline stage (evaluated by #FO4 delays).
Before the PIE paper, ACMP scheduling uses the memory (compute) intensity as the scheduling heuristics. They assume that memory intensity implies high exploitable MLP ratio, and high compute-intensity implies high exploitable ILP ratio. If both ratios are low, scheduling to little core is a good idea. Continue reading
Workload trend suggests that we want to enable cache to cache transfer to minimize latency. So we need to use direct communication scheme. However, due to the distributed nature of the direct communication, it is hard to enforce order, and traditionally we can only enforce a total order (the strongest assumption) using an ordered interconnect (e.g., bus). For example, destination-set predictor uses the direct communication scheme but has to use an ordered network.That limits the performance. The technology trend suggests that we want to use unordered interconnects. Continue reading
Direct communication, i.e., broadcast, achieves low latency while requiring high bandwidth due to broadcast. This is what snoopy implementation does.
Indirect communication, i.e., centralized, requires low bandwidth while incurring high latency. This is what directory based implementation does. Continue reading