Updated on 2025/02/28

写真a

 
SASAKI HIROSHI
 
Organization
School of Engineering Associate Professor
Title
Associate Professor
Contact information
メールアドレス
External link

Degree

  • 博士 ( 2008.3   東京大学 )

Research Areas

  • Informatics / Computer system  / Computer Architecture

  • Informatics / Information security  / Computer Security

Education

Research History

  • Tokyo Institute of Technology   Associate Professor

    2020.4

      More details

    Country:Japan

    researchmap

  • Department of Computer Science, Columbia University   Associate Research Scientist

    2016.4 - 2020.3

      More details

    Country:United States

    researchmap

  • Department of Computer Science, Columbia University   Visiting Research Scientist

    2014.4 - 2016.3

      More details

    Country:United States

    researchmap

  • IBM T. J. Watson Research Center   Visiting Research Scientist

    2013.7 - 2014.3

      More details

    Country:United States

    researchmap

  • Kyushu University   Research Associate Professor

    2011.8 - 2014.3

      More details

    Country:Japan

    researchmap

  • The University of Tokyo   Research Assistant Professor

    2010.4 - 2011.7

      More details

    Country:Japan

    researchmap

  • The University of Tokyo   Research Assistant Professor

    2008.4 - 2010.3

      More details

    Country:Japan

    researchmap

▼display all

Papers

  • RAPLET: Demystifying Publish/Subscribe Latency for ROS Applications Reviewed

    Keisuke Nishimura, Takahiro Ishikawa, Hiroshi Sasaki, Shinpei Kato

    In Proceedings of the IEEE 27th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)   41 - 50   2021

     More details

    Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    DOI: 10.1109/rtcsa52859.2021.00013

    researchmap

  • Practical Byte-Granular Memory Blacklisting using Califorms. Reviewed

    Hiroshi Sasaki, Miguel A. Arroyo, M. Tarek Ibn Ziad, Koustubha Bhat, Kanad Sinha, Simha Sethumadhavan

    In Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO)   558 - 571   2019

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3352460.3358299

    researchmap

  • Why Do Programs Have Heavy Tails? Reviewed

    Hiroshi Sasaki, Fang-Hsiang Su, Teruo Tanimoto, Simha Sethumadhavan

    In Proceedings of the 2017 IEEE International Symposium on Workload Characterization (IISWC)   135 - 145   2017

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/IISWC.2017.8167771

    researchmap

  • Characterization and Mitigation of Power Contention across Multiprogrammed Workloads Reviewed

    Hiroshi Sasaki, Alper Buyuktosunoglu, Augusto Vega, Pradip Bose

    In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC)   55 - 64   2016

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Shared resource contention has been a major performance issue for CMPs. In this paper we focus on power, which is one of the most valuable shared resources of CMPs. We believe it is important to study power contention, especially with the prevalence of power capping features among modern commercial microprocessors. When multiple processes compete for power in such systems, the power management system attempts to mitigate the contention (i.e., reduce the power consumption) by slowing down the processor, which results in degraded total system performance. We characterize this phenomenon using a real testbed with an Intel processor with power capping capability realized by the RAPL technology. We observe noticeable performance degradation for SPEC CPU2006, especially at tighter power caps. In order to solve this problem, we develop a shared resource-aware scheduling algorithm that improves system performance by mitigating the contention for power and the shared memory subsystem at the same time. Evaluation results across a variety of multiprogrammed workloads show performance improvements over a state-of-the-art scheduling policy which only considers memory subsystem contention. In addition, we present a guard mechanism implemented on top of the proposed scheduler that greatly improves performance when there is severe power contention that introduces performance anomalies.

    DOI: 10.1109/IISWC.2016.7581266

    Web of Science

    researchmap

  • Power and Performance Characterization and Modeling of GPU-Accelerated Systems. Reviewed

    Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

    In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS)   113 - 122   2014

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/IPDPS.2014.23

    researchmap

  • McRouter: Multicast within a Router for High Performance Network-on-Chips Reviewed

    Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura

    In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT)   319 - 329   2013

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    The inevitable advent of the multi-core era has driven an increasing demand for low latency on-chip interconnection networks (or NoCs). Being a critical part of the memory hierarchy for modern chip multi-processors (CMPs), these networks face stringent design constraints to provide fast communication with tight power budget. Modern NoC's first-order concern is clearly its latency, while we also find that internal bandwidth of its routers is relatively plentiful; thus, we present a low latency router design utilizing a technique we call "multicast within a router" or McRouter, which allows productive utilization of remaining bandwidth inside a NoC router. McRouter allows a single cycle transfer of flits which shortens the communication latency when there is enough remaining bandwidth within the router. The key idea is to transmit a header flit to all possible output ports (multicast) so that it is always transmitted to the correct output port without relying on route computation. In addition, we find it is affordable with marginal power overhead while still being a stand-alone design by maintaining portability and modularity (unlike look-ahead routing based designs). Our evaluation with application traffic shows that McRouter helps achieving system speed-ups of 1.28, 1.17 and 1.05 over the conventional router (CR), the VSA router (VSAR) and the prediction router (PR), respectively.

    DOI: 10.1109/PACT.2013.6618828

    Web of Science

    researchmap

    Other Link: http://doi.ieeecomputersociety.org/10.1109/PACT.2013.6618828

  • Coordinated Power-Performance Optimization in Manycores. Reviewed

    Hiroshi Sasaki, Satoshi Imamura, Koji Inoue

    In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT)   51 - 61   2013

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/PACT.2013.6618803

    researchmap

    Other Link: http://doi.ieeecomputersociety.org/10.1109/PACT.2013.6618803

  • Scalability-Based Manycore Partitioning Reviewed

    Hiroshi Sasaki, Teruo Tanimoto, Koji Inoue, Hiroshi Nakamura

    In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT)   107 - 116   2012

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Multicore processors have been popular for years, and the industry is gradually shifting towards the era of manycore processors. Single-thread performance of microprocessors is not growing at a historical rate, but the existence of a number of active processes in the computer system and the continuing development of multi-threaded applications benefit from the growing core counts to sustain system throughput. This trend brings us a situation where a number of parallel applications simultaneously being executed on a single system. Since multi-threaded applications try to maximize its throughput by utilizing the whole system, each of them usually create equal or larger number of threads compared to underlying logical core counts. This introduces much greater number of threads to be co-scheduled in the entire system. However, each program has different characteristics (or scalability) and contends for shared resources, which are the CPU cores and memory hierarchies, with each other. Therefore, it is clear that OS thread scheduling will play a major role in achieving high system performance under such conditions. We develop a sophisticated scheduler that (1) dynamically predicts the scalability of programs via the use of hardware performance monitoring units, (2) decides the optimal number of cores to be allocated for each program, and (3) allocates the cores to programs while maximizing the system utilization to achieve fair and maximum performance. The evaluation results on a 4S-core AMD Opteron system show improvements over the Linux scheduler for a variety of multiprogramming workloads.

    DOI: 10.1145/2370816.2370833

    Web of Science

    researchmap

  • Practical Byte-Granular Memory Blacklisting using Califorms. Reviewed

    Hiroshi Sasaki, Miguel A. Arroyo, M. Tarek Ibn Ziad, Koustubha Bhat, Kanad Sinha, Simha Sethumadhavan

    CoRR   abs/1906.01838   2019

     More details

    Publishing type:Research paper (scientific journal)  

    researchmap

  • Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs Reviewed

    Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

    IEICE Transactions on Information and Systems   E101D ( 9 )   2247 - 2257   2018.9

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG  

    The power consumption of server platforms has been increasing as the amount of hardware resources equipped on them is increased. Especially, the capacity of DRAM continues to grow, and it is not rare that DRAM consumes higher power than processors on modern servers. Therefore, a reduction in the DRAM energy consumption is a critical challenge to reduce the system-level energy consumption. Although it is well known that improving row buffer locality (RBL) and bank-level parallelism (BLP) is effective to reduce the DRAM energy consumption, our preliminary evaluation on a real server demonstrates that RBL is generally low across 15 multithreaded benchmarks. In this paper, we investigate the memory access patterns of these benchmarks using a simulator and observe that cache line-grained channel interleaving schemes, which are widely applied to modern servers including multiple memory channels, hurt the RBL each of the benchmarks potentially possesses. In order to address this problem, we focus on a row-grained channel interleaving scheme and compare it with three cache line-grained schemes. Our evaluation shows that it reduces the DRAM energy consumption by 16.7%, 12.3%, and 5.5% on average (up to 34.7%, 28.2%, and 12.0%) compared to the other schemes, respectively.

    DOI: 10.1587/transinf.2017EDP7296

    Web of Science

    researchmap

  • Power-Efficient Breadth-First Search with DRAM Row Buffer Locality-Aware Address Mapping Reviewed

    Satoshi Imamura, Yuichiro Yasui, Koji Inoue, Takatsugu Ono, Hiroshi Sasaki, Katsuki Fujisawa

    In 2016 High Performance Graph Data Management and Processing Workshop (HPGDMP)   17 - 24   2017.1

     More details

    Publishing type:Research paper (international conference proceedings)  

    © 2016 IEEE. Graph analysis applications have been widely used in real services such as road-traffic analysis and social network services. Breadth-first search (BFS) is one of the most representative algorithms for such applications; therefore, many researchers have tuned it to maximize performance. On the other hand, owing to the strict power constraints of modern HPC systems, it is necessary to improve power efficiency (i.e., performance per watt) when executing BFS. In this work, we focus on the power efficiency of DRAM and investigate the memory access pattern of a state-of-the-art BFS implementation using a cycle-accurate processor simulator. The results reveal that the conventional address mapping schemes of modern memory controllers do not efficiently exploit row buffers in DRAM. Thus, we propose a new scheme called per-row channel interleaving and improve the DRAM power efficiency by 30.3% compared to a conventional scheme for a certain simulator setting. Moreover, we demonstrate that this proposed scheme is effective for various configurations of memory controllers.

    DOI: 10.1109/HPGDMP.2016.010

    Scopus

    researchmap

  • Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors. Reviewed

    Teruo Tanimoto, Takatsugu Ono, Koji Inoue, Hiroshi Sasaki

    IEEE Computer Architecture Letters.   16 ( 2 )   111 - 114   2017

     More details

    Publishing type:Research paper (scientific journal)  

    DOI: 10.1109/LCA.2017.2684813

    researchmap

  • Heavy Tails in Program Structure. Reviewed

    Hiroshi Sasaki, Fang-Hsiang Su, Teruo Tanimoto, Simha Sethumadhavan

    IEEE Computer Architecture Letters.   16 ( 1 )   34 - 37   2017

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Institute of Electrical and Electronics Engineers Inc.  

    Designing and optimizing computer systems require deep understanding of the underlying system behavior. Historically many important observations that led to the development of essential hardware and software optimizations were driven by empirical observations about program behavior. In this paper, we report an interesting property of program structures by viewing dynamic program execution as a changing network. By analyzing the communication network created as a result of dynamic program execution, we find that communication patterns follow heavy-tailed distributions. In other words, a few instructions have consumers that are orders of magnitude larger than most instructions in a program. Surprisingly, these heavy-tailed distributions follow the iconic power law previously seen in man-made and natural networks. We provide empirical measurements based on the SPEC CPU2006 benchmarks to validate our findings as well as perform semantic analysis of the source code to reveal the causes of such behavior.

    DOI: 10.1109/LCA.2016.2574350

    Scopus

    researchmap

  • Mitigating Power Contention: A Scheduling Based Approach. Reviewed

    Hiroshi Sasaki, Alper Buyuktosunoglu, Augusto Vega, Pradip Bose

    IEEE Computer Architecture Letters.   16 ( 1 )   60 - 63   2017

     More details

    Publishing type:Research paper (scientific journal)  

    DOI: 10.1109/LCA.2016.2572080

    researchmap

  • A Runtime Optimization Selection Framework to Realize Energy Efficient Networks-on-Chip Reviewed

    Yuan He, Masaaki Kondo, Takashi Nakada, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura

    IEICE Transactions on Information and Systems   E99D ( 12 )   2881 - 2890   2016.12

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:IEICE-INST ELECTRONICS INFORMATION COMMUNICATIONS ENG  

    Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.

    DOI: 10.1587/transinf.2016PAP0026

    Web of Science

    researchmap

  • A scalability analysis of many cores and on-chip mesh networks on the TILE-Gx platform Reviewed

    Ye Liu, Hiroshi Sasaki, Shinpei Kato, Masato Edahiro

    In Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)   46 - 52   2016

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    TILE-Gx processors that have emerged in recent years can be considered as the representative of prevailing manycore processors. The available TILE-Gx processors are featured with directory-based cache coherence protocol, two-dimensional mesh networks and up to 72 on-chip cores. In this paper, we study and analyze problems of performance scalability and network collision of many-core processors using the TILE-Gx36 processor.
    We find that most multi-threaded programs from the PARSEC benchmark suite, which aim at shared-memory on-chip processors, cannot scale well on Linux as the number of cores increases. Meanwhile, applications compiled with Pthreads get affected by the approach of task-to-core assignment. The results also show that current multi-threaded applications do not entirely utilize the hardware resources on TILE-Gx36 processor. Moreover, OS designers might need to pay attention to the memory allocation if memory stripping is not supported. Because huge memory accesses to only one memory controller can burden the twodimensional mesh network. This observation appears if cores access the further memory controllers intensively as well.

    DOI: 10.1109/MCSoC.2016.40

    Web of Science

    researchmap

  • Runtime Multi-Optimizations for Energy Efficient On-chip Interconnections Reviewed

    Yuan He, Masaaki Kondo, Takashi Nakada, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura

    In Proceedings of the 33nd IEEE International Conference on Computer Design (ICCD)   455 - 458   2015

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    On-chip interconnection (or NoC) is a major performance and power contributor to modern and future multicore processors. So far, many optimization techniques have been developed to improve its bandwidth, latency and power consumption. But it is not clear how energy efficiency is affected since an optimization technique normally comes with overheads. This paper thus attempts to address when and how such optimization techniques should be applied and tuned to help achieve better energy efficiency. We firstly model the performance and energy impacts of representative NoC optimization techniques. These models help us more easily understand the consequences when applying these optimization techniques and their combinations under different circumstances. Moreover, based on such modeling, we propose and implement an adaptive control over these NoC optimization techniques to improve both performance and energy efficiency of the network. Our results show that, this proposal can achieve an average improvement of 26% and 57% on network performance and energy delay product, respectively.

    DOI: 10.1109/ICCD.2015.7357147

    Web of Science

    researchmap

  • A Flexible Hardware Barrier Mechanism for Many-Core Processors Reviewed

    Takeshi Soga, Hiroshi Sasaki, Tomoya Hirao, Masaaki Kondo, Koji Inoue

    In Proceedings of the 20th Asia and South Pacific Design Automation Conference (ASP-DAC)   61 - 68   2015

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    This paper proposes a new hardware barrier mechanism which offers the flexibility to select which cores should join the synchronization, allowing for executing multiple multi-threaded applications by dividing a many-core processor into several groups. Experimental results based on an RTL simulation show that our hardware barrier achieves a 66-fold reduction in latency over typical software based implementations, with a hardware overhead of the processor of only 1.8%. Additionally, we demonstrate that the proposed mechanism is sufficiently flexible to cover a variety of core groups with minimal hardware overhead.

    DOI: 10.1109/ASPDAC.2015.7058982

    Web of Science

    researchmap

  • Power-Capped DVFS and Thread Allocation with ANN Models on Modern NUMA Systems. Reviewed

    Satoshi Imamura, Hiroshi Sasaki, Koji Inoue, Dimitrios S. Nikolopoulos

    In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD)   324 - 331   2014

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/ICCD.2014.6974701

    researchmap

  • SMYLEref: A Reference Architecture for Manycore-Processor SoCs Invited Reviewed

    Masaaki Kondo, S. T. Nguyen, Tomoya Hirao, Takeshi Soga, Hiroshi Sasaki, Koji Inoue

    In Proceedings of the 18th Asia and South Pacific Design Automation Conference (ASP-DAC)   561 - 564   2013

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented.

    DOI: 10.1109/ASPDAC.2013.6509656

    Web of Science

    researchmap

  • Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality Reviewed

    Keitarou Oka, Hiroshi Sasaki, Koji Inoue

    In Proceedings of the 18th Asia and South Pacific Design Automation Conference (ASP-DAC)   669 - 674   2013

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    This paper proposes a new last level cache architecture called line sharing cache (LSC), which can reduce the number of cache misses without increasing the size of the cache memory. It stores lines which contain the identical value in a single line entry, which enables to store greater amount of lines. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks.

    DOI: 10.1109/ASPDAC.2013.6509677

    Web of Science

    researchmap

  • Predict-More Router: A Low Latency NoC Router with More Route Predictions. Reviewed

    Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura

    In Proceedings of the 2013 IEEE International Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), Communication Architecture for Scalable Systems (CASS)   842 - 850   2013

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    Network-on-Chip (NoC) is a critical part of the memory hierarchy of emerging multicores. Lowering its communication latency while preserving its bandwidth is key to achieving high system performance. By now, one of the most effective methods helps achieving this goal is prediction router (PR). PR works by predicting the route an incoming packet may be transferred to and it speculatively allocates resources (virtual channels and the switch crossbar) to the packet and traverses the packet's flits using this predicted route in a single cycle without waiting for route computation
    however, if prediction misses, the packet will then be processed in the conventional pipeline (in our work, four cycles) and the speculatively allocated router resources will be wasted. Obviously, prediction accuracy contributes to the amount of successful predictions, latency reduction and bandwidth consumption. We find that predictions hit around 65% for most applications even under the best algorithm so in such cases PR can at most accelerate about 65% of the packets while the left 35% will consume extra router resources and bandwidth. In order to increase the prediction accuracy, we propose a technique, which makes use of multiple prediction algorithms at the same time for one incoming packet. Such a prediction is more accurate. With this proposal, we design and implement predict-more router (PmR). While effectively increasing the prediction accuracy, PmR also helps utilizing remaining bandwidth within the router more productively. When both PmR and PR are evaluated under their best algorithm(s), we find that PmR is over 15% higher in prediction accuracy than PR, which helps PmR outperform PR by 3.5% on average in speeding-up the system. We also find that although PmR creates more contentions in prediction, these contentions can be well resolved and are kept within the router so both router internal bandwidth and link bandwidth are not exacerbated with it. © 2013 IEEE.

    DOI: 10.1109/IPDPSW.2013.40

    Scopus

    researchmap

  • Power and Performance of GPU-Accelerated Systems: A Closer Look. Reviewed

    Yuki Abe, Hiroshi Sasaki, Shinpei Kato, Koji Inoue, Masato Edahiro, Martin Peres

    In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC)   109 - 110   2013

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/IISWC.2013.6704675

    researchmap

  • Power and Performance Analysis of GPU-Accelerated Systems. Reviewed

    Yuki Abe, Hiroshi Sasaki, Martin Peres, Koji Inoue, Kazuaki J. Murakami, Shinpei Kato

    In 2012 Workshop on Power-Aware Computing and Systems (HotPower)   2012

     More details

    Publishing type:Research paper (international conference proceedings)  

    researchmap

  • Performance Evaluation of 3D Stacked Multi-Core Processors with Temperature Consideration. Reviewed

    Takaaki Hanada, Hiroshi Sasaki, Koji Inoue, Kazuaki J. Murakami

    In Proceedings of the 2011 IEEE International 3D Systems Integration Conference (3DIC)   1 - 5   2012

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/3DIC.2012.6263025

    researchmap

  • Energy-Efficient Dynamic Instruction Scheduling Logic Through Instruction Grouping Reviewed

    Hiroshi Sasaki, Masaaki Kondo, Hiroshi Nakamura

    IEEE Transactions on Very Large Scale Integration Systems (TVLSI)   17 ( 6 )   848 - 852   2009.6

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC  

    Dynamic instruction scheduling logic is quite complex and dissipates significant energy in microprocessors that support superscalar and out-of-order execution. We propose a novel microarchitectural technique to reduce the complexity and energy consumption of the dynamic instruction scheduling logic. The proposed method groups several instructions as a single issue unit and reduces the required number of ports and the size of the structure. This paper describes the microarchitecture mechanisms and shows evaluation results for energy savings and performance. These results reveal that the proposed technique can greatly reduce energy with almost no performance degradation, compared to the conventional dynamic instruction scheduling logic.

    DOI: 10.1109/TVLSI.2009.2013397

    Web of Science

    researchmap

  • Power-Performance Modeling of Heterogeneous Cluster-Based Web Servers. Reviewed

    Hiroshi Sasaki, Takatsugu Oya, Masaaki Kondo, Hiroshi Nakamura

    In Proceedings of the 2009 20th IEEE/ACM International Conference on Grid Computing (Grid)   35 ( 1 )   225 - 231   2009

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/GRID.2009.5353057

    researchmap

  • Cooperative Shared Resource Access Control for Low-Power Chip Multiprocessors Reviewed

    Noriko Takagi, Hiroshi Sasaki, Masaaki Kondo, Hiroshi Nakamura

    In Proceedings of the 14th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)   177 - 182   2009

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:ASSOC COMPUTING MACHINERY  

    In a single-chip multiprocessor (CMP), the last-level cache and its lower memory hierarchy components are typically shared by multiple processors. Conflicts in these resources lead to poor overall performance of the CMP and/or unpredictable performance of the individual cores. If applications on different; cores have different performance constraints, even though these constraints can be satisfied by dynamic voltage and frequency scaling (DVFS) control of each core, conflicts in shared resources will lead to increased power consumption. Therefore, in the present paper, we derive a condition whereby, under resource conflicts, the total power consumption is minimized by a newly developed power consumption model and propose a method by which to minimize the power consumption of CMPs by cooperative access control of multiple shared resources and DVFS control. Experimental results reveal that the proposed technique can reduce power consumption by 15% on average in a dual-core CMP and by 13% in a quad-core CMP, as compared to the case in which only DVFS control is applied.

    DOI: 10.1145/1594233.1594278

    Web of Science

    researchmap

  • Improving Fairness, Throughput and Energy-Efficiency on a Chip Multiprocessor through DVFS. Reviewed

    Masaaki Kondo, Hiroshi Sasaki, Hiroshi Nakamura

    SIGARCH Computer Architecture News   35 ( 1 )   31 - 38   2007

     More details

    Publishing type:Research paper (scientific journal)  

    DOI: 10.1145/1241601.1241609

    researchmap

  • An Intra-Task DVFS Technique Based on Statistical Analysis of Hardware Events. Reviewed

    Hiroshi Sasaki, Yoshimichi Ikeda, Masaaki Kondo, Hiroshi Nakamura

    In Proceedings of the 4th ACM International Conference on Computing Frontiers (CF)   123 - 130   2007

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/1242531.1242551

    researchmap

  • Energy-Efficient Dynamic Instruction Scheduling Logic through Instruction Grouping. Reviewed

    Hiroshi Sasaki, Masaaki Kondo, Hiroshi Nakamura

    In Proceedings of the 2006 ACM International Symposium on Low Power Electronics and Design (ISLPED)   43 - 48   2006

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/1165573.1165585

    researchmap

  • Dynamic Instruction Cascading on GALS Microprocessors Reviewed

    Hiroshi Sasaki, Masaaki Kondo, Hiroshi Nakamura

    In 2005 International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS)   30 - 39   2005

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:SPRINGER-VERLAG BERLIN  

    As difficulty and the costs of distributing a single global clock throughout a processor is growing generation by generation, Globally-Asynchronous Locally-Synchronous (GALS) designs are an alternative approach to the conventional synchronous processors.
    In this paper, we propose Dynamic Instruction Cascading (DIC). DIC is a technique to execute two dependent instructions in one cycle by scaling down the clock frequency. Lowering the clock frequency enables the signal to reach farther, thereby computing two instructions in one cycle becomes possible. DIC is effectively applied to GALS processors because lowering only the clock frequency of the target domain is needed and therefore unwanted performance degradation will be prevented.
    The results showed average performance improvement of 7% on SPEC CPU2000 Integer and MediaBench applications when assuming that DIC is possible by lowering the clock frequency to 80%.

    DOI: 10.1007/11556930_4

    Web of Science

    researchmap

▼display all

Research Projects

  • ネットワーク分析を用いた低コストで柔軟な高信頼メモリアーキテクチャ

    Grant number:22K19771  2022 - 2024

    日本学術振興会  科学研究費助成事業  挑戦的研究(萌芽)

    佐々木 広

      More details

    Authorship:Principal investigator 

    Grant amount:\6500000 ( Direct Cost: \5000000 、 Indirect Cost:\1500000 )

    researchmap

  • RISC-Vシステム設計プラットフォームの研究開発

    2021 - 2024

    新エネルギー・産業技術総合開発機構 (NEDO)  高効率・高速処理を可能とするAIチップ・次世代コンピューティングの技術開発/ 研究開発項目④AIエッジコンピューティングの産業応用加速のための技術開発 

      More details

    Authorship:Coinvestigator(s) 

    researchmap

  • エネルギーセキュアな計算機システムの研究

    Grant number:26700004  2014 - 2018

    日本学術振興会  科学研究費助成事業  若手研究(A)

    佐々木 広

      More details

    Authorship:Principal investigator 

    Grant amount:\4290000 ( Direct Cost: \3300000 、 Indirect Cost:\990000 )

    researchmap

  • 安全・安定かつ省電力な計算機システムを実現するハード・ ソフトウェア協調技術の研究

    2014 - 2016

    Japan Society for the Promotion of Science  Overseas Research Fellowships 

      More details

    Authorship:Principal investigator 

    researchmap

  • Dynamic optimization of CMPs based on statistical analysis

    Grant number:21700054  2009 - 2010

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Young Scientists (B)

    SASAKI Hiroshi

      More details

    Grant amount:\4420000 ( Direct Cost: \3400000 、 Indirect Cost:\1020000 )

    In a chip multiprocessor (CMP) architecture, multiple cores usually share resources in the memory hierarchy including the last-level cache, the memory bus, and the DRAM memory banks. We derive the condition where the total CPU power consumption becomes minimum by constructing a power consumption model under resource conflicts, and propose a novel dynamic optimization method to minimize the power consumption by a cooperative access control to multiple shared resource with DVFS.

    researchmap