With the demand on improving performance and energy-efficiency, novel technologies including non-volatile memory (e.g. spin-transfer torque RAM (STT-RAM)), 3D integration technology, and near-threshold voltage computing (NTC) have been increasingly deployed in the state-of-the-art throughput processors, i.e., general-purpose computing on graphics processing units (GPGPUs). Since the novel technologies are not designed for dependable computing, the reliability challenges which have being a crucial issue in conventional throughput architecture design become the major obstacle for integrating them into next-generation throughput processors. The paramount reliability challenges in throughput processors are particle strikes induced soft errors, aging effects driven hard errors, and manufacturing process variations. The novel technologies have both positive and negative impacts on those three major reliability issues. For example, STT-RAM has a positive impact on SER as it is generally recognized as soft-error free; however, in NTC environment, the same amount of process variations cause a much larger negative impact on circuit delay. Furthermore, 3D integration technology causes thermal issue at layers far to the heat sink and thus increases the aging effects induced error rate.
It is critical to harness novel technologies’ benefits and overcome their shortcomings on reliability to develop robust, high-performance, and energy-efficient throughput processors. However, they are largely ignored in prior studies on exploring fault detection and tolerant techniques to enhance the robustness of throughput processors. Therefore, there is a pressing need for the investigation of innovative techniques that are able to take advantage of throughput processors’ unique features for characterizing and improving the reliability of the next-generation new-technology employed throughput architecture design.
Intellectual Merits: The proposed CAREER research will construct new foundations for vulnerability characterization and prediction, error detection, and furthermore, fault tolerance against the dominant reliability challenges in throughput processors integrated with novel technologies. The project includes four research goals: (1) modeling and analyzing the vulnerability of novel-technology (e.g. STT-RAM, NTC, and 3D) enabled throughput processors in the presence of soft error, aging effects, and process variations; (2) fast and accurate predictive model to forecast the vulnerability phase behavior of throughput processors under new technologies; (3) leveraging throughput processors’ unique characteristics to develop light-weight error detection mechanisms at the near-threshold computing (NTC) environment; and (4) exploring the opportunities and challenges introduced by the novel technologies to cost-effectively tolerate various types of errors in next-generation throughput architecture design.
Broader Impact: If successful, the proposed research will significantly promote the capability of architecting reliable throughput processors in future technologies beyond CMOS, making it possible to fulfill the Moore’s Law without suffering the negative effects caused by various fault mechanisms. Therefore, the proposed research will facilitate throughput architecture design to most effectively leverage advanced processing technologies to meet the increasing demand for high-performance, power-efficient and reliable computing, and hence benefit numerous real-life applications. Moreover, the to-be-developed techniques will enforce the desire of applying throughput processors into a wide range of computing scale from mobile computing to cloud computing which could then gain hundreds times speedup in parallel computing, and increasing the employment of throughput processors to support the supercomputing in science and engineering (e.g. finance, medical, biology, petroleum, aerospace, and geology). The integrated education and outreach plan will help the PIs to educate a broad spectrum of students. This includes engaging high-school and undergraduate students from minority-serving institutions into research, expanding the computer engineering curriculum with GPGPUs reliability modeling and optimization techniques, attracting women and under-represented groups into graduate education, disseminating research infrastructure for education and training, and collaborating with the GPU R&D industries.
Key Words: The state-of-the-art throughput processors; GPGPUs; reliability; novel technology