Faculty Profiles - KOBAYASHI RYOHEI

写真a

KOBAYASHI RYOHEI

Organization

Institute of Integrated Research Supercomputing Research Center Associate Professor

Homepage

https://sites.google.com/site/ryokbya/

Profile

Ryohei Kobayashi received his Ph.D. in Engineering from Tokyo Institute of Technology in 2016. He then joined the Center for Computational Sciences (CCS), University of Tsukuba as an Assistant Professor in April 2016 and served until September 2024. Since October 2024, he has been an Associate Professor at the Supercomputing Research Center (SCRC), Institute of Integrated Research, Institute of Science Tokyo. He has also held a concurrent appointment as a Visiting Researcher with the Processor Research Team, RIKEN Center for Computational Science (R-CCS) since July 2021. His research interests include computer systems, high-performance computing (HPC), accelerators (GPU/FPGA), and reconfigurable computing, with a focus on GPU–FPGA cooperative computing and FPGA systems for HPC. He leads the Advanced Computing ACceleration (AC2) Laboratory, which advances hardware–software co-design for massively parallel systems based on accelerators. His honors include the HPC in Asia Poster Award (ISC 2018) and the IEICE CPSY Young Presentation Award (2015). He has served in program roles such as Proceedings Chair for HPC Asia 2026 and Publicity Co-Chair for IEEE Cluster 2025. He is a member of ACM, IEEE/IEEE CS, IPSJ, and IEICE.

External link

Degree

Doctor of Engineering ( 2016.3 Tokyo Institute of Technology )
Master of Engineering ( 2013.3 Tokyo Institute of Technology )

Research Interests

Accelerator
FPGA
GPU
Accelerating Parallel Applications
Reconfigurable Computing System

Research Areas

Informatics / High performance computing
Informatics / Computer system

Education

Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Computer Science, Ph.D course

2013.4 - 2016.3

　 More details

Country： Japan

researchmap
Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Computer Science, Master course

2011.4 - 2013.3

　 More details

Country： Japan

researchmap
Sophia University Faculty of Science and Technology Electrical and Electronics Engineering

2007.4 - 2011.3

　 More details

Country： Japan

researchmap

Research History

Institute of Science Tokyo Supercomputing Research Center, Institute of Integrated Research Associate Professor

2024.10

　 More details

researchmap
RIKEN Center for Computational Science Processor Research Team Visiting Researcher

2021.7

　 More details

researchmap
University of Tsukuba Center for Computational Sciences Assistant Professor

2016.4 - 2024.9

　 More details

Country：Japan

researchmap

Professional Memberships

Institute of Electrical and Electronics Engineers

2020.5

　 More details

researchmap
Association for Computing Machinery

2017.11

　 More details

researchmap
The Institute of Electronics, Information and Communication Enineers (IEICE)

2014.4

　 More details

researchmap
Information Processing Society of Japan

2011.4

　 More details

researchmap

Committee Memberships

SC25 Program Committee Member

2025.3 - 2025.11

　 More details

Committee type：Academic society

researchmap
xSIG 2025 Program Committee Member

2025.3 - 2025.8

　 More details

Committee type：Academic society

researchmap
FPL 2025 (35th International Conference on Field Programmable Logic & Applications) Program Committee Member

2025.2 - 2025.9

　 More details

Committee type：Academic society

researchmap
Euro-PAR 2025: 31st International European Conference on Parallel and Distributed Computing Program Committee Member

2025.2 - 2025.8

　 More details

Committee type：Academic society

researchmap
32nd Reconfigurable Architectures Workshop (RAW 2025) Program Committee Member

2024.11 - 2025.6

　 More details

Committee type：Academic society

researchmap
CCGRID2025: The 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing Program Committee Member

2024.10 - 2025.5

　 More details

Committee type：Academic society

researchmap
COOL Chips 28 Program Committee Vice Chair

2024.9 - 2025.4

　 More details

Committee type：Academic society

researchmap
SCA/HPCAsia 2026 Proceedings Chair

2024.8 - 2026.1

　 More details

Committee type：Academic society

researchmap
IEEE Cluster 2025 Publicity Co-Chair

2024.8 - 2025.9

　 More details

Committee type：Academic society

researchmap
SupercomputingAsia 2025 Program Committee Member

2024.8 - 2025.3

　 More details

researchmap
International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2025) Program Committee Member

2024.7 - 2025.5

　 More details

Committee type：Academic society

researchmap
『Special Section on Low-Power and High-Speed Chips and Systems』小特集編集委員会編集幹事

2024.5 - 2026.6

　 More details

researchmap
情報処理学会ハイパフォーマンスコンピューティング研究運営委員会幹事

2024.4 - 2028.5

　 More details

researchmap
革新的ハイパフォーマンス・コンピューティング・インフラ (HPCI) 学際共同研究WG委員

2024.4 - 2026.3

　 More details

Committee type：Academic society

researchmap
FPL 2024 (34th International Conference on Field Programmable Logic & Applications) Program Committee Member

2024.4 - 2024.9

　 More details

researchmap
ICPP2024 Program Committee Member

2024.4 - 2024.8

　 More details

researchmap
xSIG 2024 Program Committee Member

2024.4 - 2024.8

　 More details

researchmap
Euro-PAR 2024: 30th International European Conference on Parallel and Distributed Computing Program Committee Member

2024.3 - 2024.8

　 More details

researchmap
学際大規模情報基盤共同利用・共同研究拠点 (JHPCN) 共同研究課題審査委員会委員

2024.1 - 2025.12

　 More details

Committee type：Academic society

researchmap
2nd Workshop on FPGA Technologies for Adaptive Computing (FTAC 2024) Program Committee Member

2024.1 - 2024.6

　 More details

researchmap
『Special Section on Forefront Computing』小特集編集委員会編集幹事

2023.11 - 2026.1

　 More details

任期： 2023年11月15日～2025年1月1日(特集号発行まで)

researchmap
International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2024) Program Committee Member

2023.11 - 2024.6

　 More details

researchmap
ICS 2024: International Conference on Supercomputing 2024 Web Liaison

2023.7 - 2024.6

　 More details

researchmap
COOL Chips 27 Program Committee Vice Chair

2023.7 - 2024.4

　 More details

researchmap
電子情報通信学会 ISSソサイエティ誌編集幹事

2023.6 - 2025.6

　 More details

researchmap
電子情報通信学会コンピュータシステム研究専門委員会幹事

2023.6 - 2025.6

　 More details

researchmap
IEEE Cluster 2024 Digital Chair

2023.6 - 2024.9

　 More details

researchmap
IEEE RTCSA/NVMSA2023 Poster Chair

2023.6 - 2023.9

　 More details

researchmap
アダプティブコンピューティング研究推進体 (ACRi) 広報イベントWG 副グループ長

2023.4 - 2026.3

　 More details

researchmap
FPT’23 Program Committee Member

2023.4 - 2023.12

　 More details

researchmap
CANDAR2023 CSA Program Committee Member

2023.4 - 2023.11

　 More details

researchmap
CANDAR2023 Program Committee Member

2023.4 - 2023.11

　 More details

researchmap
Open Accelerated Computing Summit 2023 Review Committee Member

2023.4 - 2023.10

　 More details

researchmap
『Low-Power and High-Speed Chips』小特集編集委員会編集幹事

2023.3 - 2024.6

　 More details

researchmap
xSIG 2023 Program Committee Member

2023.1 - 2023.8

　 More details

Committee type：Academic society

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 実行委員

2022.12 - 2026.12

　 More details

researchmap
FPL 2023 (33rd International Conference on Field Programmable Logic & Applications) Program Committee Member

2022.12 - 2023.9

　 More details

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 組織委員 (オブザーバ)

2022.11 - 2023.12

　 More details

researchmap
International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2023) Program Committee Member

2022.11 - 2023.6

　 More details

researchmap
COOL Chips 26 Program Committee Vice Chair

2022.9 - 2023.4

　 More details

researchmap
HPC Asia 2023 Proceedings Chair

2022.5 - 2023.3

　 More details

Committee type：Academic society

researchmap
FPT’23 Publication Chair

2022.4 - 2023.12

　 More details

researchmap
CANDAR2022 Program Committee Member

2022.4 - 2022.11

　 More details

researchmap
CANDAR2022 CSA Program Committee Member

2022.3 - 2022.11

　 More details

researchmap
FPL 2022 (32nd International Conference on Field Programmable Logic & Applications) Publicity Co-chair

2022.2 - 2022.9

　 More details

Committee type：Academic society

researchmap
xSIG 2022 Program Committee Member

2022.2 - 2022.7

　 More details

Committee type：Academic society

researchmap
FPL 2022 (32nd International Conference on Field Programmable Logic & Applications) Program Committee Member

2022.1 - 2022.9

　 More details

Committee type：Academic society

researchmap
『Special Section on Forefront Computing』小特集編集委員会編集委員

2021.11 - 2023.12

　 More details

researchmap
International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2022) Publication Chair

2021.8 - 2022.6

　 More details

Committee type：Academic society

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 組織委員長

2021.7 - 2022.10

　 More details

Committee type：Academic society

researchmap
COOL Chips 25 Program Committee Vice Chair

2021.7 - 2022.4

　 More details

Committee type：Academic society

researchmap
電子情報通信学会英文論文誌編集委員

2021.6 - 2025.6

　 More details

Committee type：Academic society

researchmap
電子情報通信学会 ISSソサイエティ誌編集委員

2021.6 - 2025.6

　 More details

Committee type：Academic society

researchmap
電子情報通信学会コンピュータシステム研究専門委員会幹事補佐

2021.6 - 2023.6

　 More details

Committee type：Academic society

researchmap
FPGA for HPC Workshop 2021 (HPC FPGA 2021) Program Committee Member

2021.6 - 2021.9

　 More details

Committee type：Academic society

researchmap
FPGA for HPC Workshop 2021 (HPC FPGA 2021) Organizing Deputy Co-Chairs

2021.6 - 2021.9

　 More details

Committee type：Academic society

researchmap
International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2021) Publication Chair

2021.5 - 2021.6

　 More details

Committee type：Academic society

researchmap
情報処理学会ハイパフォーマンスコンピューティング研究運営委員会運営委員

2021.4 - 2024.3

　 More details

researchmap
CANDAR 2021 Program Committee Member

2021.4 - 2021.11

　 More details

Committee type：Academic society

researchmap
CANDAR2021 CSA Program Committee Member

2021.4 - 2021.11

　 More details

Committee type：Academic society

researchmap
FPGA Technologies for Adaptive Computing (IEEE MCSoC 2021 Special Session) Program Committee Member

2021.1 - 2021.12

　 More details

researchmap
HPC Asia 2022 Digital Chair

2020.12 - 2022.1

　 More details

Committee type：Academic society

researchmap
xSIG 2021 Program Committee Member

2020.12 - 2021.7

　 More details

Committee type：Academic society

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 組織副委員長

2020.8 - 2021.7

　 More details

Committee type：Academic society

researchmap
COOL Chips 24 Program Committee Member

2020.8 - 2021.4

　 More details

Committee type：Academic society

researchmap
CANDAR 2020 Program Committee Member

2020.4 - 2020.11

　 More details

Committee type：Academic society

researchmap
CANDAR2020 CSA Program Committee Member

2020.3 - 2020.11

　 More details

Committee type：Academic society

researchmap
IEEE Cluster 2020 Registration Chair

2020.2 - 2020.9

　 More details

Committee type：Academic society

researchmap
xSIG 2020 Program Committee Member

2020.1 - 2020.7

　 More details

Committee type：Academic society

researchmap
SC20 Program Committee Member

2019.10 - 2020.11

　 More details

Committee type：Academic society

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 組織委員（懇親会担当）

2019.7 - 2020.7

　 More details

Committee type：Academic society

researchmap
COOL Chips 23 Program Committee Member

2019.7 - 2020.4

　 More details

Committee type：Academic society

researchmap
FPT'19 Program Committee Member

2019.5 - 2019.12

　 More details

Committee type：Academic society

researchmap
CANDAR2019 CSA Program Committee Member

2019.4 - 2019.11

　 More details

Committee type：Academic society

researchmap
xSIG 2019 Program Committee Member

2019.1 - 2019.5

　 More details

Committee type：Academic society

researchmap
COOL Chips 22 Program Committee Member

2018.12 - 2019.4

　 More details

Committee type：Academic society

researchmap
ICPP2019 Publicity Chair

2018.6 - 2019.8

　 More details

Committee type：Academic society

researchmap
FPT'18 Program Committee Member

2018.6 - 2018.12

　 More details

Committee type：Academic society

researchmap
CANDAR2018 CSA Program Committee Member

2018.4 - 2018.11

　 More details

Committee type：Academic society

researchmap
Summer United Workshops on Parallel, Distributed and Cooperative Processing (SWoPP) 実行委員

2018.1 - 2021.7

　 More details

Committee type：Academic society

researchmap
電子情報通信学会リコンフィギャラブルシステム研究専門委員会専門委員

2017.6 - 2029.6

　 More details

Committee type：Academic society

researchmap
電子情報通信学会コンピュータシステム研究専門委員会専門委員

2017.6 - 2022.6

　 More details

Committee type：Academic society

researchmap
International Symposium on Computing and Networking (CANDAR2017) Program Committee Member

2017.4 - 2017.11

　 More details

Committee type：Academic society

researchmap

▼display all

Papers

CXLメモリプール実験システムの初期評価

遠藤敏夫, 坂本龍一, 野村哲弘, 小林諒平, 大辻弘貴, 加藤純, 古藤明音, 三輪真弘

研究報告ハイパフォーマンスコンピューティング（HPC） 2025-HPC-200 ( 24 ) 1 - 7 2025.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.) Publisher：情報処理学会

HPC・クラウドシステムでは，ノードごとに大容量メモリを固定割り当てするため，導入コストや消費電力の増大が深刻な課題となっている．その解決策の一つとして，Compute Express Link (CXL) 2.0規格に基づくメモリプールシステムが注目されており，それによって複数ノード間でメモリ資源を効率的に共有・柔軟に割り当てるアプローチが可能になる．本研究では，H3社製Falcon C5022モジュールを用いて1TiBのCXLメモリプールを構築し，Intel Granite Rapids CPU搭載サーバ上で実機性能評価を実施した．具体的には，Intel Memory Latency Checker v3.11によるメモリアクセスレイテンシ測定とSTREAMベンチマークによるバンド幅評価を行い，その結果をもとにCXLメモリプール技術の性能特性を定量的に明らかにした．最後に，得られた知見を踏まえ，CXLメモリプールの最適設計や運用に向けた実用的な指針について議論する．

researchmap
Accelerating Deep Learning Inference with a Parallel FPGA System Reviewed

Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku

Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 49 - 56 2025.5

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：ACM

Deep learning has experienced rapid growth in applications such as image recognition and natural language processing, resulting in increasingly complex models that require more processing power and energy. While GPUs are widely used for training due to their highly parallel computing power and wide memory bandwidth, FPGAs offer a compelling alternative for inference tasks where stable, low-latency performance is essential. FPGAs allow for fine-grained hardware tuning and dedicated pipeline implementations, which can be leveraged to build multi-FPGA systems that seamlessly fuse computation and communication for Convolutional Neural Network (CNN) acceleration. However, existing multi-FPGA approaches typically require advanced hardware knowledge and are often implemented as dedicated systems, creating significant barriers for general-purpose application developers accustomed to high-level programming environments such as MPI with the host CPU. In this study, we propose a multi-FPGA-based deep learning inference accelerator that operates at the OpenCL abstraction level, enabling software engineers without extensive hardware expertise to partition and deploy CNN models, such as ResNet-50, across multiple FPGAs. Our approach combines both model and data parallelism to achieve high throughput while maintaining controlled latency. Experimental results show that our design increases throughput by a factor of 12 with only a 1.9-fold increase in latency compared to a baseline. This work paves the way for more accessible FPGA-based acceleration solutions for deep learning inference in real-world applications.

DOI： 10.1145/3728179.3728186

researchmap
イタレーションレベルApproximate Computing手法の提案と予備評価

和田康孝, 小林諒平, 森江善之, 坂本龍一

研究報告ハイパフォーマンスコンピューティング（HPC） 2025-HPC-199 ( 6 ) 1 - 5 2025.5

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.) Publisher：情報処理学会

演算精度を変更することにより，演算性能・消費電力・演算結果の正確さの間でトレードオフを最適化するApproximate Computing（AC）手法は，消費電力などの制約下において限界を超えた性能を得るために有望な手段の一つである．HPCアプリケーションのように演算精度に対して敏感なアプリケーションにおいてACの効果を得るためには，アプリケーション全体で統一した演算精度を用いるのではなく，アプリケーションの要素ごとに細粒度に演算精度を調整し，きめ細やかに最適化を施す必要がある．本稿では，HPCアプリケーションに特徴的な時間発展ループ等の構造を利用してACを適用するイタレーションレベルAC手法について述べ，その予備的な評価結果について紹介する．

researchmap
Evaluation of Trade-Off Between Compression Ratio and Hardware Cost for Adaptive Bandwidth Compression Hardware Platform Reviewed International coauthorship International journal

Tomohiro Ueno, Kaito Kitazume, Masato Kiyama, Kazutomo Yoshii, Kento Sato, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku, Kentaro Sano

2025 IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL CHIPS) 1 - 6 2025.4

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

Hardware-based, high-throughput data compression is a promising approach to reduce data movement costs in largescale systems and networks, thereby improving overall performance and power efficiency. However, since a data compression algorithm is only effective if it is suitable for the characteristics of the target data, data compression hardware with a fixed algorithm is unrealistic for use in general-purpose environments. To address this challenge, we have been researching adaptive bandwidth compression (ABC) hardware that flexibly provides an effective algorithm depending on the input data. This paper presents the design and the Chisel-based implementation of the ABC hardware platform for encoding compressed data and generating output blocks, as well as introducing a quantization parameter to reduce the circuit area. Our evaluation shows that the proposed quantization parameter can not only reduce the hardware cost, but also control the trade-off between the effective compression ratio and the hardware cost. In addition, based on the evaluation results, we discuss the design optimization of the ABC hardware platform.

DOI： 10.1109/coolchips65488.2025.11018561

researchmap
並列FPGA間通信フレームワークCIRCUSへのフロー制御の実装と評価

北爪開人, 藤田典久, 小林諒平, 朴泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2025-HPC-198 ( 58 ) 1 - 9 2025.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
高スループット非同期集団通信の性能モデル化に向けた予備評価

森江善之, 和田康孝, 小林諒平, 坂本龍一, 南里豪志

研究報告ハイパフォーマンスコンピューティング（HPC） 2025-HPC-198 ( 49 ) 1 - 6 2025.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
GPU演算加速による一般相対論的輻射磁気流体シミュレーションコードの性能評価

小林諒平, 高橋博之, 額田彰, 朝比奈雄太, 朴泰祐, 大須賀健

研究報告ハイパフォーマンスコンピューティング（HPC） 2025-HPC-198 ( 60 ) 1 - 8 2025.3

　More details

Authorship：Lead author,　Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
Accelerating General Relativistic Radiation Magnetohydrodynamic Simulations with GPUs Reviewed

Ryohei Kobayashi, Hiroyuki R. Takahashi, Akira Nukada, Yuta Asahina, Taisuke Boku, Ken Ohsuga

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 72 - 79 2025.2

　More details

Authorship：Lead author,　Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：ACM

DOI： 10.1145/3712031.3712032

researchmap
CHARM-SYCL & IRIS: A Tool Chain for Performance Portability on Extremely Heterogeneous Systems. Reviewed International coauthorship

Norihisa Fujita, Beau Johnston, Narasinga Rao Miniskar, Ryohei Kobayashi, Mohammad Alaul, Haque Monil, Keita Teranishi, Seyong Lee, Jeffrey S. Vetter, Taisuke Boku

2024 IEEE 20th International Conference on e-Science (e-Science) 1 - 10 2024.9

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/e-Science62913.2024.10678717

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/eScience/2024
適応型帯域圧縮ハードウェアプラットフォームのChisel実装と評価

北爪開人, 上野知洋, 吉井一友, 木山真人, 藤田典久, 小林諒平, 佐野健太郎, 朴泰祐

電子情報通信学会技術研究報告 = IEICE technical report : 信学技報 124 ( 188 ) 41 - 46 2024.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
Using SYCLomatic to Migrate CUDA Code to oneAPI Adapting NVIDIA GPU. Reviewed

Wentao Liang, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops) 192 - 193 2024.9

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/CLUSTERWorkshops61563.2024.00054

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/cluster/2024w
Preliminary Performance Evaluation of Grace-Hopper GH200. Reviewed

Toshihiro Hanawa, Kengo Nakajima, Yohei Miki, Takashi Shimokawabe, Kazuya Yamazaki, Shinji Sumimoto, Osamu Tatebe, Taisuke Boku, Daisuke Takahashi, Akira Nukada, Norihisa Fujita, Ryohei Kobayashi, Hiroto Tadano, Akira Naruse

CLUSTER Workshops 184 - 185 2024.9

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/CLUSTERWorkshops61563.2024.00050

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/cluster/2024w
Preliminary Evaluation of Kyokko for Inter-FPGA Communication Framework CIRCUS. Reviewed

Kaito Kitazume, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops) 194 - 195 2024.9

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/CLUSTERWorkshops61563.2024.00055

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/cluster/2024w
Improving Performance on Replica-Exchange Molecular Dynamics Simulations by Optimizing GPU Core Utilization Reviewed

Boku, Taisuke, Sugita, Masatake, Kobayashi, Ryohei, Furuya, Shinnosuke, Fujie, Takuya, Ohue, Masahito, Akiyama, Yutaka

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing 1082 - 1091 2024.8

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

While GPUs are the main players of the accelerating devices on high performance computing systems, their performance depends on how to utilize a numerous number of cores in parallel on each device. Typically, a loop structure with a number of iterations is assigned to a device to utilize their cores to map calculations in iterations so that there must be enough count of iterations to fill the thousands of GPU cores in the high-end GPUs.

In the advanced GPU represented by NVIDIA H100, several techniques, such as Multi-Process Service (MPS) or Multi-Instance GPU (MIG), which divides GPU cores to be mapped to the multiple user processes, are provided to enhance the core utilization even in a case with a small degree of parallelism. We apply MPS to a practical Molecular Dynamics (MD) simulation with AMBER software for improving the efficiency of GPU core utilization to save the computation resources. The critical issue here is to analyze the core utilization and overhead when running multiple processes on a GPU device as well as the multi-GPU and multi-node parallel execution for overall performance improvement.

In this paper, we introduce a method to apply MPS for AMBER to simulate the membrane permeation process of a drug candidate peptide by a two-dimensional replica-exchange method on an advanced supercomputer with NVIDIA H100. We applied several optimizations on parameter settings with NVIDIA H100 and V100 GPUs investigating their performance behavior. Finally, we found that the GPU core utilization improves up to twice compared with a simple process assignment method to maximize the GPU utilization efficiency.

DOI： 10.1145/3673038.3673097

researchmap
多様な環境におけるマルチ・タスク・ミニベンチマークの評価とPerformance Portability International coauthorship

藤田, 典久, Beau, Johnston, 小林, 諒平, Mohammad, Alaul, Haque Monil, Narasinga, Rao Miniskar, Keita, Teranishi, Seyong, Lee, Jeffrey, S. Vetter, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2024-HPC-195 ( 3 ) 1 - 10 2024.8

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

HPC システムの多様性が増してきているため，アプリケーションの可搬性は多様なシステムを利用する上で重要な課題となっている．本稿では，複数の演算加速装置を統一的に扱えるプログラミング環境である CHARM-SYCL をアプリケーションの可搬性を実現するための開発環境として提案する．CHARM-SYCL は単一のコードから複数の演算加速装置に対応するカーネルを生成できるのに加えて，ORNL で開発されている IRIS ライブラリをバックエンドとして利用できる．IRIS は高性能なスケジューラを持ち計算タスクを複数の演算加速装置上で実行でき，CHARM-SYCL と IRIS を組み合わせることで高いアプリケーションの可搬性を実現する．本稿では，モンテカルロ法シミュレーションのベンチマークコードに CHARM-SYCL 開発環境を適用し，提案するシステムによって高いアプリケーションの可搬性が実現できていることを示す．

researchmap
GH200の予備性能評価

塙, 敏博, 建部, 修見, 中島, 研吾, 朴, 泰祐, 三木, 洋平, 下川辺, 隆史, 山崎, 一哉, 住元, 真司, 高橋, 大介, 額田, 彰, 藤田, 典久, 小林, 諒平, 多田野, 寛人, 田浦, 健次朗, 細川, 颯介, 髙橋, 淳一郎, 成瀬, 彰

研究報告ハイパフォーマンスコンピューティング（HPC） 2024-HPC-195 ( 4 ) 1 - 11 2024.8

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

最先端共同 HPC 基盤施設 (JCAHPC) では，2025 年 1 月に稼働開始する Miyabi の導入準備を進めている．1,120 ノードの Miyabi-G 計算ノードには，GH200 Grace-Hopper Superchip が搭載され，国内のスパコンとして初めて GH200 が導入される．本稿では，GH200実験システムを用いて各種の予備性能評価を実施したので，その結果を報告する．

researchmap
GPU・FPGA連携による高性能計算

小林, 諒平

DAシンポジウム2024論文集 2024 293 - 293 2024.8

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

スーパーコンピュータに対する要求性能と利用可能な電力容量の制限，昨今の脱炭素化への動向などから，スーパーコンピュータの電力効率の向上は喫緊の課題であり，その解として演算加速装置（アクセラレータ）の利活用が高性能計算分野の主流となりつつある．現在最も多用されているアクセラレータは GPU（Graphics Processing Unit）であるが，これによる効率的な計算には極めて大量かつ均一性の高い空間並列性，均一なメモリアクセス，比較的少ない並列通信データ量など，様々な制約が存在するため，GPU だけではアプリケーションを十分に演算加速し切れない場合がある．そこで，GPU では非効率となる演算を加速させるハードウェアを FPGA（Field Programmable Gate Array）に実装し，GPU と FPGA の双方の計算デバイスを相補的に活用することによってアプリケーション全体の性能を向上させるアプローチをこれまで試みてきた．本講演では，GPU・FPGA 連携のためのデータ転送技術やプログラミングモデル，GPU と FPGA を併用することによる宇宙物理アプリケーションの高速化事例について紹介する．

researchmap
ラベルの出現頻度に着目したFPGAを用いた正規パス問合せの提案

溝谷, 祐大, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

第16回データ工学と情報マネジメントに関するフォーラム(DEIM2024) 1 - 8 2024.2

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

近年，グラフ分析は盛んに行われており，グラフから様々な情報が取得されている.グラフ分析の中でも，ユーザが望むデータを取得するための手法として，正規パス問合せ (RPQ) が存在する.RPQ とはエッジにラベルが貼られたグラフデータを対象とした問合せであり，指定されたラベルの並びを持つパスがグラフ中に存在するかどうかを探索し，存在する場合そのパスの始点・終点ノードを結果としてユーザに返す処理である.ここで課題となるのが， RPQ 評価の計算時間である.近年，データ分析において対象データの大規模化を受けてから，RPQ の対象となるグラフも大規模化が予想されており，現実世界に存在するような多種多様かつ大規模なグラフに対しては，実行に多大な時間を要することが想定される.そのような大規模なデータを処理するために FPGA (Field Programmable Gate Array) などのハードウェアアクセラレータの利用が注目されている.FPGA とは任意の回路をプログラミングによって繰り返し実装可能なハードウェアチップである.FPGA を用いた RPQ の高速化の既存研究では，FPGA の回路規模をすべて有効に利用できない場合が存在することや，複数 FPGA への拡張が困難といった課題点が存在する.そこで本研究では複数カーネルを利用して並列に RPQ 処理を行う手法を提案する.複数カーネルを用いることで，各カーネルが FPGA 内部で独立した回路として実装され並列動作が可能なため，FPGA の回路をより有効に活用できることや，今後複数 FPGA への手法の拡張が容易になることが利点として挙げられる.提案手法では，複数カーネルを用いた手法を実装するためにラベルの出現頻度に着目した.出現頻度が低いラベルをレアラベルを定義し，グラフとクエリをレアラベルを用いて分割することで，複数カーネルを用いた RPQ 処理が可能となる.評価実験では，レアラベルと定義するラベルの個数，クエリ中に出現するレアラベルの個数が多いときに RPQ 評価に要する時間が短くなることを確認した.また，一定の条件のもとで比較手法である，三浦らの手法よりも高速に RPQ 評価を行えることも確認した.

researchmap
Using Intel oneAPI for multi-hybrid acceleration programming with GPU and FPGA coupling Reviewed

Liang, Wentao, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops (HPCAsia '24 Workshops) 69 - 76 2024.1

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.

DOI： 10.1145/3636480.3637220

researchmap
Using Intel oneAPI for multi-hybrid acceleration programming with GPU and FPGA coupling

Liang, Wentao, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-192 ( 16 ) 1 - 7 2023.11

　More details

Language：English Publishing type：Research paper (conference, symposium, etc.)

Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.

researchmap
CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types Reviewed International coauthorship

Fujita, Norihisa, Johnston, Beau, Kobayashi, Ryohei, Teranishi, Keita, Lee, Seyong, Boku, Taisuke, Vetter, Jeffrey S

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis 1651 - 1661 2023.11

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

Addressing performance portability across diverse accelerator architectures has emerged as a major challenge in the development of application and programming systems for high-performance computing environments. Although recent programming systems that focus on performance portability have significantly improved productivity in an effort to meet this challenge, the problem becomes notably more complex when compute nodes are equipped with multiple accelerator types—each with unique performance attributes, optimal data layout, and binary formats. To navigate the intricacies of multi-accelerator programming, we propose CHARM-SYCL as an extension of our CHARM multi-accelerator execution environment [27]. This environment will combine our SYCL-based performance-portability programming front end with a back end for extremely heterogeneous architectures as implemented with the IRIS runtime from Oak Ridge National Laboratory. Our preliminary evaluation indicates potential productivity boost and reasonable performance compared to vendor-specific programming system and runtimes.

DOI： 10.1145/3624062.3624244

researchmap
GPU+FPGA Multi-device Programming System by OpenACC Reviewed International coauthorship

綱島, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐, Lee, Seyong, Vetter, Jeffrey S, 村井, 均, 中尾, 昌広, 辻, 美和子, 佐藤, 三久

IPSJ Transactions on Advanced Computing System 16 ( 2 ) 1 - 15 2023.11

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

In the field of HPC in recent years, FPGAs have been focusing on in these days as another possible solution beside GPUs which are the main player of accelerated sueprcomputing. Since the performance characteristics of both devices are significantly different, we believe that more efficient acceleration can be achieved by using them in combination. However, at present, there is no practical language system that allows users to easily write code by common expression for GPUs and FPGAs. As for GPUs, NVIDIA's GPUs have a high market share, so many applications are written in CUDA or OpenACC, but these languages cannot be used when porting part of the code to FPGA. There are only some research compilers for the OpenACC processing system for FPGA. Also, even if each accelerator can be programmed independently, it is necessary to provide a programming framework that grammatically and semantically connects them. Under such a background, we launched a concept of CAMP (Cooperative Acceleration by Multi-device Programming), and have been developing a language processing system that programs both accelerators in a unified manner using OpenACC, a directive-based arithmetic acceleration programming API. In this paper, we describe evaluation of this system with a real application for astrophysics simulation. We successfully enhanced the performance up to 10 times faster than GPU-only solution.

researchmap
Performance improvement by enhancing spatial parallelism on FPGA for HPC applications. Reviewed

Yuka Sano, Taisuke Boku, Mitsuhisa Sato, Miwako Tsuji, Norihisa Fujita, Ryohei Kobayashi

2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops) 58 - 59 2023.10

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/CLUSTERWorkshops61457.2023.00024

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/cluster/2023w
Castと通信の並列実行のための予備実験

森江, 善之, 和田, 康孝, 小林, 諒平, 坂本, 龍一

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-191 ( 14 ) 1 - 6 2023.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

現在，HPC システムで Approximate Computing（AC）を適用することはコンピュータシステムの消費電力や実効性能のトレードオフを行う上で重要である．さらに HPC システムにおけるデータ転送に関してはそのデータ精度がメッセージの総量を決めるため，データ精度を削減する AC のデータ転送への適用の効果は高くなり，特にメッセージサイズが大きい通信が頻発するアプリケーションではより重要となる．この AC をデータ転送へ適用する上で，Cast 処理と通信のオーバラップ実行による性能向上技術の確立が事前に必要となる．これは，Cast 処理と通信のオーバラップを行う方法が確立すれば，データを分割することで Cast 処理と通信を並行実行してパイプライン転送する手法が利用可能となるからである．このデータ転送手法の実現することでさらなる通信性能向上や消費電力削減が可能となる．そこで，本稿では Cast 処理と通信のオーバラップ実行を効果的に行う要件を調査する予備実験を行った．この実験結果から Cast 処理と通信のオーバラップ実行をするには通信プロトコルの選択が影響することが分かった．また，通信プロトコルのうち Rendezvous プロトコルはそのままでは Cast 処理と通信がオーバラップ実行されないことあることが分かった．この状況に対応するためには通信処理を進捗するための通信スレッドを利用するか，メインスレッドにて MPI_Test() などの通信関数を定期的に呼び出すことで通信処理を進捗させることが出来ると分かった．

researchmap
細粒度なApproximate Computing適用に向けた演算精度変更による影響の評価

和田, 康孝, 森江, 善之, 小林, 諒平, 坂本, 龍一

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-191 ( 13 ) 1 - 7 2023.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

本質的に高い演算精度を要求する HPC アプリケーションに対して Approximate Computing 技術を適用し，演算精度と実行性能，および消費電力等の間でトレードオフを最適化するためには，アプリケーション内のタスクやデータそれぞれの特性に応じて，演算精度制御の度合いを最適化する必要がある．本稿では，複数のベンチマークにおいて動的に演算精度を変更した際の実行性能および演算結果への影響を評価し，HPC アプリケーションに対する細粒度な Approximate Computing 技術の適用に向けた検討を行う．

researchmap
OpenACC Unified Programming Environment for Multi-hybrid Acceleration with GPU and FPGA Reviewed International coauthorship

Boku, Taisuke, Tsunashima, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Lee, Seyong, Vetter, Jeffrey S, Murai, Hitoshi, Nakao, Masahiro, Tsuji, Miwako, Sato, Mitsuhisa

ISC High Performance 2023: High Performance Computing 13999 662 - 674 2023.8

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Springer Nature Switzerland

Accelerated computing in HPC such as with GPU, plays a central role in HPC nowadays. However, in some complicated applications with partially different performance behavior is hard to solve with a single type of accelerator where GPU is not the perfect solution in these cases. We are developing a framework and transpiler allowing the users to program the codes with a single notation of OpenACC to be compiled for multi-hybrid accelerators, named MHOAT (Multi-Hybrid OpenACC Translator) for HPC applications. MHOAT parses the original code with directives to identify the target accelerating devices, currently supporting NVIDIA GPU and Intel FPGA, dispatching these specific partial codes to background compilers such as NVIDIA HPC SDK for GPU and OpenARC research compiler for FPGA, then assembles binaries for the final object with FPGA bitstream file. In this paper, we present the concept, design, implementation, and performance evaluation of a practical astrophysics simulation code where we successfully enhanced the performance up to 10 times faster than the GPU-only solution.

DOI： 10.1007/978-3-031-40843-4_49

researchmap
NVIDIA H100 GPUにおけるグラフニューラルネットワークの学習精度と実行性能評価

小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-190 ( 17 ) 1 - 8 2023.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

今日の情報化社会を支えるグラフ構造データを分析する手法としてグラフニューラルネットワーク (GNN) が深層学習の発展に伴い注目を集めており，近年におけるデータの大規模化や機械学習アプリケーションの多様化から GNN の学習精度の向上および学習時間の短縮を実現する手法の確立が望まれている．本稿では，NVIDIA 社が現在提供する最新型 GPU である NVIDIA H100 GPUを用いて実施した，代表的なグラフデータおよび GNN 実装間における学習時間と精度の推移評価について報告する．評価実験により，NVIDIA H100 GPU 上で動作させた GNN モデルは，NVIDIA Tesla V100 GPU で動作させた場合と比較し，1.6～1.7 倍高速に学習を実行することが確認された．

researchmap
SYCLに基づく複数の演算加速装置を統一的に扱えるプログラミング手法の提案 International coauthorship

藤田, 典久, 小林, 諒平, Beau, Johnston, Narasinga, Rao Miniskar, Seyong, Lee, Keita, Teranishi, Jeffrer, S. Vetter, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-190 ( 1 ) 1 - 13 2023.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

異なる特性を持つ複数のアクセラレータを適材適所的に用いることを我々は CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices) コンセプトと呼んでいる．CHARM においては，複数種類のアクセラレータを利用するために，アクセラレータ毎に複数の言語を使い分け，さらにそれらを組み合わせて複数種類デバイスを効率的に動作させるプログラミングが求められるが，このようなプログラムを記述するのは容易ではない．本研究では，CHARM プログラミングが抱える問題を解決するために，複数の演算加速装置を統一的に扱える SYCL に基づく処理系 “CHARM-SYCL” の提案を行う．CHARM-SYCL のランタイムは Oak Ridge NationalLaboratory で開発されているタスクランタイムシステムである IRIS をサポートし，IRIS を用いて複数種類デバイスの対応を実現する．本原稿では，CHARM-SYCL の実装の詳細および性能評価について報告する．

researchmap
Pegasusビッグメモリスーパコンピュータの性能評価

建部, 修見, 平賀, 弘平, 前田, 宗則, 藤田, 典久, 小林, 諒平, 額田, 彰

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-190 ( 7 ) 1 - 12 2023.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

Pegasus は筑波大学計算科学研究センターに 2022 年 12 月に導入され，2023 年 4 月より本運用を開始したスーパコンピュータである．Intel，NVIDIA の最新 CPU，GPU をいち早く導入し，6.5 PFlops の演算性能をもつ．大容量データの解析，大規模 AI を推進するため，不揮発性メモリを大規模に導入した．各計算ノードでは 2 TiB の大容量メモリが利用可能であり，またその領域は超高速ストレージとしても利用可能である．本研究報告では Pegasus の概要を述べるとともに，性能について報告する．

researchmap
輻射輸送シミュレーションのためのFPGAとGPUによるスクラッチパッドメモリの効率と有効性の分析

古川, 和輝, 山口, 佳樹, 横野, 智也, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 安倍, 牧人, 朴, 泰祐, 梅村, 雅之

IEICE-RECONF2023-6 123 ( 71 ) 29 - 34 2023.6

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

宇宙輻射輸送シミュレーションコードに含まれる ART(Authentic Radiation Transfer) スキームは，高計算量かつメモリ律速であり，アクセラレータによる演算加速が期待されている.本研究では、ART スキーム特有のスクラッチパッドメモリ機構を考案し，PRISM (PRefetchable and Instantly accessible Scratchpad Memory) と名付けた.この PRISM を FPGA と GPU それぞれに実装し，オリジナルの実装と比較した結果，シミュレーション空間が小さい場合は FPGA が高速で，最大 1.8 倍，大きい場合は GPU が高速で，最大 5.4 倍の演算高速化が達成された.

researchmap
HPC利用に向けたFPGA間シリアル通信コントローラKyokkoのIntel FPGAへの実装

北爪, 開人, 藤田, 典久, 小林, 諒平, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-189 ( 4 ) 1 - 9 2023.5

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

高性能計算における演算加速装置として FPGA (Field-Programmable Gate Array) が注目されている．高位合成や高速な光インターフェースを備えた FPGA ボードの登場など FPGA の有用性が高まる一方で，高性能計算における FPGA を用いた並列計算を行うための環境は未だ発展途上である．これらの一環として，筑波大学計算科学研究センターでは複数の FPGA 上で並列計算を行うために，OpenCL を用いた高位合成によって FPGA 間の高速通信を可能とするフレームワーク CIRCUS (Communication Integrated Reconfigurable CompUting System) を開発しているが，現状の CIRCUS にはフロー制御が未実装であるという課題がある．この問題は，通信部で用いている FPGA 間通信プロトコルにフロー制御がないことが原因であるため，本研究では通信部をフロー制御を含むプロトコルに置き換え，この問題を解決する．本稿では CIRCUS の通信部を置き換える通信プロトコルとして，オープンソースな通信プロトコルである Kyokko の性能評価を行う．最大で 1 ポートあたり 100Gbps の通信が可能な Intel Stratix 10 GX H-tlie を搭載した FPGA ボードである Bittware 520N 上に Kyokko を実装し，バンド幅やレイテンシ，フロー制御について評価する．実験の結果，Kyokko は 99.98% を超える高い効率と理論性能に近いバンド幅を示した．また，データの送受信にかかるレイテンシは，チャンネルボンディングしない場合は約 170ns，4 チャンネルボンディングの場合は約 180ns であり，高速であった．フロー制御のレイテンシは，チャンネルボンディングしない場合では約 310ns，4 チャンネルボンディングの場合では約 320ns であり，これらから NFC メッセージを受信した際の処理は極めて高速であることが分かった．

researchmap
Implementation and Performance Evaluation of Memory System Using Addressable Cache for HPC Applications on HBM2 Equipped FPGAs Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

Euro-Par 2022: Parallel Processing Workshops 13835 121 - 132 2023.5

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Springer Nature Switzerland

When we apply field programmable gate arrays (FPGAs) as HPC accelerators, their memory bandwidth presents a significant challenge because it is not comparable to those of other HPC accelerators. In this paper, we propose a memory system for HBM2-equipped FPGAs and HPC applications that uses block RAMs as an addressable cache implemented between HBM2 and an application. This architecture enables data transfer between HBM2 and the cache bulk and allows an application to utilize fast random access on BRAMs. This study demonstrates the implementation and performance evaluation of our new memory system for HPC and HBM2 on an FPGA. Furthermore, we describe the API that can be used to control this system from the host. We implement RISC-V cores in an FPGA as controllers to realize fine-grain data transfer control and to prevent overheads derived from the PCI Express bus. The proposed system is implemented on eight memory channels and achieves 102.7 GB/s of the bandwidth. It overcomes the memory bandwidth of conventional FPGA boards with four channels of DDR4 memory despite using only 8 of 32 channels of the HBM2.

DOI： 10.1007/978-3-031-31209-0_9

researchmap
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

PDCAT 2022: Parallel and Distributed Computing, Applications and Technologies 13798 344 - 358 2023.4

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：Springer Nature Switzerland

To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with different performance characteristics appear in the simulation, making the acceleration of simulation speed using only GPUs difficult. Therefore, we aim to improve the overall performance of the application by using FPGAs to accelerate operations with characteristics which cause lower GPU efficiency. However, the application is currently implemented through multilingual programming, where the computation kernel running on the GPU is written in CUDA and the computation kernel running on the FPGA is written in OpenCL. This method imposes a heavy burden on programmers; therefore, we are currently working on a programming environment that enables to use both accelerators in a GPU–FPGA equipped high-performance computing (HPC) cluster system with OpenACC. To this end, we port the entire code only with OpenACC from the CUDA-OpenCL mixture. On this basis, this study quantitatively investigates the performance of the OpenACC GPU implementation compared to the CUDA implementation for ARGOT, a radiative transfer simulation code for fundamental astrophysics which is a multiphysics application. We observe that the OpenACC implementation achieves performance and scalability comparable to the CUDA implementation on the Cygnus supercomputer equipped with NVIDIA V100 GPUs.

DOI： 10.1007/978-3-031-29927-8_27

researchmap
FPGA間通信フレームワークCIRCUSを利用した複数FPGAによるグラフ幅優先探索の提案

溝谷, 祐大, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

第15回データ工学と情報マネジメントに関するフォーラム (DEIM 2023) 2023.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

グラフ構造は，様々なデータをノードとエッジで表したデータ構造のことであり，我々の身の回りの多種多様なデータの関係性を表すのに有用である．グラフの分析は盛んに行われており，グラフから様々な情報が取得されている．グラフの分析アルゴリズムの中でも，幅優先探索は最も広く使われているアルゴリズムである．幅優先探索とはグラフ探索アルゴリズムの一種であり，デジタル回路のテスト・検証，道路ネットワークの解析など，幅広い分野で応用されている．しかし，近年グラフの大規模化によって，幅優先探索に多大な計算コストが必要となることが多い．また，不規則なメモリアクセスが多くなるためメモリ帯域を有効に利用できないといった問題がある．ここで我々は FPGA に着目した．FPGA とは，任意の回路をプログラミングによって繰り返し実装可能なハードウェアチップである．その性能上の特徴は各回路の並列性を利用した並列度の高い処理が可能なことである．また，FPGAでは外部通信用光リンクを利用できる．この外部通信用光リンクは FPGA 上の回路と直接接続されているため超低レイテンシで他の FPGA と通信することが可能となる．この特徴を活用する技術として FPGA 間通信フレームワーク，CIRCUS がある．本研究では，CIRCUS を利用し，複数 FPGA を使い幅優先探索を実装する．

researchmap
FPGA高位合成における演算性能向上のための空間並列性記述に関する研究

佐野, 由佳, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 佐藤, 三久

研究報告ハイパフォーマンスコンピューティング（HPC） 2023-HPC-188 ( 22 ) 1 - 10 2023.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

今日の高性能計算システムでは，高い演算性能とメモリバンド幅を有する GPU (Graphic Processing Unit) が高性能計算向けアプリケーションの演算加速装置として積極的に導入されている．しかし，GPU による演算加速は，GPU が持つ数多くのコアを活用し，かつそれらが SIMD (Single Instruction Multiple Data) 的な均質な処理が行われた時に性能を発揮するように構築されているため，並列度の低い計算や条件分岐などの複雑な処理を必要とする演算，通信が頻発するアプリケーションではその演算性能を十全に発揮することはできない．そこで，その GPU にとって不適合な演算を，回路の再構成によってアプリケーションに特化した演算パイプラインやメモリシステムを柔軟に構築できる FPGA (Field-Programmable Gate Array) にオフロードする手法が注目を集めている．現在の GPU プログラミング環境では，OpenACC に代表される指示文によるユーザフレンドリーなプログラミング環境が存在するが，FPGA プログラミング環境では，指示文を利用したプログラミング環境の完成度は高いとは言えない．そのため，我々は理化学研究所計算科学研究センター (R-CCS) と筑波大学計算科学研究センター (CCS) との共同研究により，Omni OpenACC コンパイラを FPGA プログラミング環境向けに改良する研究を進めている．本研究では，コンパイラによる演算性能最適化の手法を検討する材料として，高位合成を用いた FPGA プログラミングの演算性能向上手法について評価・検討する．具体的には，OpenCL によって記述された CG (Conjugate Gradient) 法のコードに対し，パイプライン化，Loop Unrolling，複数カーネル同時実行等，演算要素数を増やすための各種手法を試す．そして，ループの Unroll 数，同時実行するカーネル数を変化させ，FLOPS 数と BRAM (Block Random Access Memory) の使用率を評価する．FPGA の高速化は基本的にパイプライン処理によって得られるが，このクロックサイクル内の演算数を増加させ，同時に BRAM 使用量への影響等を調べ，性能最適化のための方策を探る．ただし，FPGA では Loop Unrolling の深さや，使用演算器数，メモリ使用量によって動作周波数が変化し，それらの間に複雑なトレードオフが存在するため，一概に同時実行演算数を増やすことが性能向上に資するとは限らない．今回実装した Intel Stratix10 FPGA 上での CG 法のコードでは，1 つのカーネルで Loop Unrolling を 8 回行った場合に最も高性能になることが判明した．また，2 つのカーネルで Loop Unrolling を 8 回行った場合に，動作周波数との関係で性能が最高になったが，メモリ使用量が大きく増大してしまった．他アプリケーションとの同 FPGA 上への同時実装のためにはメモリ使用量を抑える必要があり，そういう場合は 2 つのカーネルで Loop Unrolling を 4 回行った場合が最も高性能になることがわかった．

researchmap
Implementation and Performance Evaluation of Collective Communications Using CIRCUS on Multiple FPGAs Reviewed

Kikuchi, Kohei, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

HPC Asia '23 Workshops: Proceedings of the HPC Asia 2023 Workshops 15 - 23 2023.2

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

In the high-performance computing domain, Field Programmable Gate Array (FPGA) is a novel accelerator that exhibits high flexibility and performance characteristics distinct from other accelerators such as the Graphics Processing Unit (GPU). Recent advanced high-end FPGA is equipped with multiple channels of high speed optical link up to 100Gbps performance for each. This is a crucial feature when we construct PC clusters with FPGAs as accelerators, however it is not easy to utilize from user kernels because this feature is implemented in low level and simple direct communication between neighboring FPGAs.

In order to provide the communication feature between FPGAs for accelerated PC clusters, we developed a communication system named CIRCUS which implies a user-friendly API from OpenCL and is equipped with routing function over multi-hop communication on multi-dimensional torus network of FPGAs. However, current CIRCUS only provides a point-to-point communication between source and destination FPGAs. In ordinary parallel processing environment such as MPI, the user program the message passing with various collective communication functions for parallel algorithm, for instance Allreduce, Allgather, etc. In this paper, we implement the collective communication function over CIRCUS for user-friendly programming of ordinary parallel algorithms on FPGAs. As the first target, we implement Allreduce function which is the most essential and important function. The paper describes the CIRCUS system briefly followed by the design, implementation and preliminary performance evaluation on Intel Stratix10 FPGAs.

DOI： 10.1145/3581576.3581602

researchmap
GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

HPC Asia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region 117 - 125 2023.2

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics simulations, which encompass multiple physical models and simultaneous physical phenomena. Because the constituent operations in multiphysics simulations exhibit varying characteristics, accelerating these operations solely using GPUs is often challenging. Hence, FPGAs are frequently implemented for this purpose. The objective of the present study was to further improve application performance by employing both GPUs and FPGAs in a complementary manner. Recently, this approach has been applied to the radiative transfer simulation code for astrophysics known as ARGOT, with evaluation results quantitatively demonstrating the resulting improvement in performance. However, the evaluation results in question came from the use of a single node equipped with both a GPU and FPGA. In this study, we extended the GPU–FPGA-accelerated ARGOT code to operate on multiple nodes using the message passing interface (MPI) and an FPGA-to-FPGA communication technology scheme called Communication Integrated Reconfigurable CompUting System (CIRCUS). We evaluated the performance of the ARGOT code with multiple GPUs and FPGAs under weak scaling conditions, and found it to achieve up to 12.8x speedup compared to the GPU-only execution.

DOI： 10.1145/3578178.3578231

researchmap
Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling Reviewed

Boku, Taisuke, Fujita, Norihisa, Kobayashi, Ryohei, Tatebe, Osamu

ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing ( 8 ) 1 - 8 2023.1

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network.

To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration.

We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.

DOI： 10.1145/3547276.3548629

researchmap
Data Transfer API and its Performance Model for Rank-Level Approximate Computing on HPC Systems Reviewed

Morie, Yoshiyuki, Wada, Yasutaka, Kobayashi, Ryohei, Sakamoto, Ryuichi

IJNC 13 ( 1 ) 48 - 61 2023.1

　More details

Language：English Publishing type：Research paper (scientific journal) Publisher：IJNC Editorial Committee

The application of approximate computing (AC) in optimizing tradeoffs among performance, power consumption, and accuracy of computation results can be improved by adjusting data precision in applications. The importance of AC has increased over the years as it is used to maximize performance even with limited power budget and hardware resources in high performance computing (HPC) systems that require more precise computations. To apply AC for HPC applications effectively, we must consider the character of each message passing interface (MPI) rank in an application and optimize it by adjusting its data precision. This rank-level AC ensures that ranks and threads in an application run with data precision and perform data transfer while converting the precision of target data. In this paper, we have proposed and evaluated data pack/unpack application programming interfaces (APIs), which are applicable for standard MPI programs run on HPC systems, for converting the precision of target data. The proposed APIs enable us to express data transfer among ranks with different precisions. In addition, we have also developed a reasonable performance model to select an appropriate data transfer API for maximizing performance with rank-level AC based on performance evaluation with various HPC systems.

DOI： 10.15803/ijnc.13.1_48

researchmap
An FPGA-based Accelerator for Regular Path Queries over Edge-labeled Graphs Reviewed

Miura, Kento, Kobayashi, Ryohei, Amagasa, Toshiyuki, Kitagawa, Hiroyuki, Fujita, Norihisa, Boku, Taisuke

2022 IEEE International Conference on Big Data (Big Data) 415 - 422 2022.12

　More details

Language：English Publishing type：Research paper (international conference proceedings)

Edge-labeled directed graphs are commonly used to represent various information in different applications, such as social networks, knowledge graphs, etc., and regular path queries (RPQs) allow us to extract pairs of nodes that are reachable from one to another through a labeled path matching with the query pattern represented as a regular expression. It is useful for us to extract complicated or semantically meaningful information from a graph, but it gives rise to a challenge when dealing with large graphs. This is due to the long execution time caused by the explosive growth of intermediate results, but, on the other hand, some applications require fast query executions. To address this problem, we propose an FPGA-based RPQ accelerator. The idea is to exploit FPGA’s parallelism in traversing the target graph and matching the regular path expression in parallel with the pipeline manner. To validate the performance of the proposed method, we conducted a set of experiments. From the results, we observed that the proposed method achieves shorter elapsed times for RPQs against social graphs extracted from the real world, up to three orders of magnitude compared with baseline methods.

DOI： 10.1109/BigData55660.2022.10020406

researchmap
並列FPGA環境における通信システムCIRCUSを用いた集団通信の実装と性能評価

菊池, 航平, 藤田, 典久, 小林, 諒平, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-187 ( 7 ) 1 - 8 2022.11

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，新たな HPC アクセラレータとして FPGA (Field Programmable Gate Array) が注目されている．FPGA は高速なシリアル I/O インタフェースを備えており，直接インタフェースを通じて FPGA 間の通信を行うことができる．直接通信により高い通信バンド幅を低レイテンシで扱うことができる特長は FPGA のみのものであり，問題規模の拡大や性能向上のために FPGA を並列化して用いようとする場合に大きな威力を発揮することが期待される．筑波大学計算科学研究センターでは並列 FPGA 実行を行う HPC アプリケーションの開発をサポートするため，FPGA 間通信フレームワーク CIRCUS (Communication Integrated Reconfigurable CompUting System) を開発している．CIRCUS は FPGA ネットワークにおけるルータ機能と通信 API を提供しており，OpenCL のプログラムから FPGA 間通信の記述を可能にする．しかし現状で CIRCUS が対応している通信パターンは 1 対 1 通信のみであり，通信ライブラリとして広く用いられている MPI にあるような集団通信は実装されていない．本研究の目的は，CIRCUS の上で動作する，高性能でユーザフレンドリーな集団通信APIを，並列 FPGA を利用する HPC ユーザに提供することである．この目的を実現するために，本稿では CIRCUS を用いた Allreduce 通信の設計・実装を行う．実装は 4 つの FPGA 上で正常に動作するが，CIRCUS 通信にフロー制御機能がないため性能が低下していることが分かった．この問題を回避するためには複雑なプログラミングが必要であり，余分なオーバヘッドを避けられない．この問題を解決するために，FPGA 間通信コントローラをフロー制御対応のものに置き換えることを計画している．

researchmap
An Open-source FPGA Library for Data Sorting Reviewed

Kobayashi, Ryohei, Miura, Kento, Fujita, Norihisa, Boku, Taisuke, Amagasa, Toshiyuki

Journal of Information Processing 30 ( No. 0 ) 766 - 777 2022.10

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (scientific journal) Publisher：Information Processing Society of Japan

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their flexibility enables the building of application-specific computation pipelines and data supply systems. In addition to the flexibility, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. However, the high level of abstraction in the OpenCL-based development approach is a disadvantage, making it difficult to perform fine-grained performance tuning. In this paper, we present one of the methodologies to achieve both the reduction of FPGA programming cost and the provision of high performance. We focus on data sorting, which is a basic arithmetic operation, and we introduce a sorting library that can be used with the OpenCL programming model for FPGAs. Our sorting library has so far only supported integer data, but in this paper, we propose a new method that supports floating-point data. It consumes at least twice as many hardware resources compared to the merge sort restructured for the OpenCL programming model for FPGAs. However, its operating frequency is 1.08x higher and its sorting throughput is three orders of magnitude greater than the baseline. The source code of our sorting library is open source, and it can be used by application developers around the world.

DOI： 10.2197/ipsjjip.30.766

researchmap
Performance Evaluation of Data Transfer API for Rank Level Approximate Computing on HPC Systems Reviewed

Morie, Yoshiyuki, Wada, Yasutaka, Kobayashi, Ryohei, Sakamoto, Ryuichi

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 445 - 448 2022.8

　More details

Language：English Publishing type：Research paper (international conference proceedings)

Approximate computing (AC) has attracted much attention to optimize tradeoffs among performance, power con-sumption, and computation results accuracy by adjusting data precision in applications. Even on HPC systems, AC is demanded to maximize performance under the limited power budget and hardware resources. To apply AC for HPC applications, we need to consider the character of each MPI rank in an application and optimize it with its appropriate data precision. However, we also need to perform data transfer while converting the precision of the target data. This paper proposes data pack/unpack APIs, which are applicable for standard MPI programs for HPC systems, for converting the data precision of the target data, and shows its performance evaluation. We can express data transfer among ranks with different data precision with the proposed APIs. The performance evaluation reveals the break-even point to apply AC for HPC applications from the perspective of data transfer volume.

DOI： 10.1109/IPDPSW55747.2022.00082

researchmap

Other Link： https://dblp.uni-trier.de/rec/conf/ipps/2022w
並列化に伴うデータ空間の分割とそれによるアクセスパターンの変化がもたらすHBMの振る舞い調査

瀬口, 知洋, 中井, 榛希, 山口, 佳樹, 藤田, 典久, 小林, 諒平, 朴, 泰祐

IEICE-CPSY2022-15 IEICE-122 ( 133 ) 83 - 88 2022.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

アプリケーションの要求に合わせて演算回路を電気的に再構成可能な Field Programmable Gate Array (FPGA) は，グルー・ロジックの代用品および試作用デバイスとして誕生以来発展を続けている．半導体製造技術およびパッケージング技術などの進化に伴いその演算性能および機能を大きく改善させてきた．また，高位合成採用などによる統合開発環境の熟成とそれによる設計の簡素化は FPGA の導入コストを大きく下げることに成功し，FPGA は情報システムに広く採用されるに至っている．以上より FPGA は，GPU や AI チップなどと同様に多くの注目を集めるデバイスとして，また，演算性能向上や消費電力対性能の改善など，導入に対して得られる効果を十分に期待できるデバイスとして認知され始めている．そして近年，高性能計算分野において帯域幅の大きなメモリ（High Bandwidth Memory: HBM) を同一パッケージ内に採用した FPGA 製品が増加しており，それは低価格帯の組み込み系 FPGA 製品にも広がりつつある．一方，HBM を採用して一日の長である GPU 分野において，HBM の実効アクセス性能に対する議論が始まりつつある．そこで本報告では，FPGA における高位記述と HBM 利用との組みあわせについて整理し，今後の FPGA 設計・開発における問題提起を通して効率的な演算加速の可能性について議論する．

researchmap
GPU・FPGA複合型演算加速クラスタを用いた宇宙輻射輸送コードARGOTの多ノード並列化

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-185 ( 1 ) 1 - 6 2022.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．我々はこれまで宇宙輻射輸送シミュレーションコード ARGOT にそのコンセプトを適用し，その結果得られる性能向上を評価することによって，両デバイスを併用する有用性を定量的に示してきた．しかし，これまで実現してきた GPU-FPGA 連携の演算加速は，GPU と FPGA の両デバイスが搭載された単一ノードのみの利用を前提としていた．本研究では，単一ノードの利用を前提とした GPU・FPGA 連携 ARGOT コードを，MPI および FPGA 間通信技術である CIRCUS（Communication Integrated Reconfigurable CompUting System）を用いて複数ノードで動作するように拡張し，その実装方法について報告する．

researchmap
Performance Evaluation on GPU-FPGA Accelerated Computing Considering Interconnections between Accelerators Reviewed

Sano, Yuka, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

The Proceedings of the 12th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2022) 10 - 16 2022.6

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

Graphic processing units (GPUs) are often equipped with HPC systems as accelerators because of their high computing capability. GPUs are powerful computing devices; however, they operate inefficiently on applications that employ partially poor parallelism, non-regular computation, or frequent inter-node communication. To address these shortcomings of GPUs, field-programmable gate arrays (FPGA) have been emerging in the HPC domain because their reconfigurable capabilities enable the construction of application-specific pipelined hardware and memory systems. Several studies have focused on improving overall application performance by combining GPUs and FPGAs, and the platforms for achieving this have adopted the approach of hosting these two devices on a single compute node; however, the inevitability of this approach has not been discussed.

In this study, we evaluated it quantitatively using an astrophysics application that performs radiative transfer to simulate the early-stage universe after the Big Bang. The application runs on a compute node equipped with a GPU and an FPGA, and the GPU and FPGA computation kernels are launched from a single CPU (process) in the application. We modified the code to enable the launch of the GPU and FPGA computation kernels from separate message-passing interface (MPI) processes. Each MPI process was assigned to two compute nodes to run the application, which were equipped only with a GPU and FPGA, respectively, and the execution performance of the application was compared against that of the original GPU-FPGA accelerated application. The results revealed that the performance degradation compared to the original GPU-FPGA accelerated application was approximately 2 ∼ 3 %, thereby demonstrating quantitatively that even if both devices are mounted on different compute nodes, this is acceptable in practical use depending on the characteristics of the application.

DOI： 10.1145/3535044.3535046

researchmap
ノードを跨いだGPU・FPGA複合型演算加速による宇宙物理シミュレーションの実装と評価

佐野, 由佳, 小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-184 ( 6 ) 1 - 7 2022.5

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年の高性能計算システムでは，高い演算性能とメモリバンド幅を有する GPU (Graphic Processing Unit) が演算加速装置として積極的に導入されている．しかし，全てのアプリケーションが GPU に適合するということではなく，並列性がコア数に対して不足していたり条件分岐が発生したりするような，GPU にとって不適合な演算が部分的に含まれるアプリケーションではその演算性能を十全に発揮することはできない．そこで，その GPU にとって不適合な演算をアプリケーションに特化した演算パイプラインやメモリシステムを柔軟に構築できるFPGA (Field-programmable Gate Array) にオフロードし，GPU と FPGA を相補的に活用することによってアプリケーション全体の性能を向上させるアプローチが試みられている．GPU と FPGA を併用してアプリケーションを実行する研究事例は幾つか存在し，そのためのプラットフォームとしては，両デバイスを同一の計算ノードに搭載するシステムがこれまで用いられてきた．ただし，その構成の必然性については詳細に検討されていないのが現状である．そこで本研究では，GPU と FPGA を併用して初期宇宙の天体形成をシミュレートする宇宙物理アプリケーションを用いて，両方のデバイスが同じ計算機に接続される必要性を定量的に評価した．既存のコードに対して MPI (Message Passing Interface) を用いて再実装を行い，GPU と FPGA が分離した構成で動作するように修正を施した．そして，GPU と FPGA が同じ計算機に接続された構成と，GPU と FPGA が分離した構成において，アプリケーションの性能評価を行った．性能評価より，GPU と FPGA が分離した構成でアプリケーションを実行した場合は，GPU と FPGA が同じ計算機に接続された構成でアプリケーションを実行した場合と比較して，2～3 [%] の性能低下に抑えられた．以上より，GPU と FPGA を協調計算に用いる場合，アプリケーションの特性次第では，GPU と FPGA が異なる計算機に接続されている環境においても高速に協調計算が可能であることが定量的に明らかになった．

researchmap
oneAPIを用いたGPU・FPGA混載ノードにおける宇宙物理シミュレーションコードARGOTの実装

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-183 ( 12 ) 1 - 9 2022.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

GPU（Graphics Processing Unit）は，HPC 分野において最も広く用いられているアクセラレータの一つである．しかし，マルチフィジックスに基づく科学計算では単一のシミュレーションの中に多様なワークロードが出現し，GPU のみを用いた高速化では不十分である．我々は，このような複雑な物理シミュレーションを対象として，GPU と FPGA（Field Programmable Gate Array）の併用による高速化を目指し，CHARM（Cooperative Heterogeneous Acceleration by Reconfigurable Multidevices）というコンセプトの下，ハードウェア，プログラミングシステム，そしてアプリケーション開発をおこなっている．ここでの大きな課題は，これら複数のデバイスをどのようにプログラムするかである．近年注目されている Intel 社によって提案された oneAPI は，SYCL をベースにした DPC++ による単一言語プラットフォームを提供し，複数のデバイス間における連携プログラミングが可能である．本稿では，GPU と FPGA を用いた宇宙物理シミュレーションコード ARGOT を oneAPI によって実装し，その性能評価について報告する．本研究の特徴は，oneAPI をその一般的な利用方法とは異なり，DPC++ のみを用いた開発ではなく既存の CUDA や OpenCL によるプログラム部分コードを組み合わせるためのフレームワークとして用いている点である．結果として，oneAPI を用いることで，DPC++ によるプログラミングだけでなく，CUDA や OpenCL など他の言語で記述された既存のソースコードを再利用して，複数のデバイスが協調するプログラムを実装することができることがわかった．

researchmap
GPUクラスタを用いた宇宙輻射輸送コードARGOTのOpenACC実装と性能評価

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-183 ( 17 ) 1 - 8 2022.3

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本研究ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，初期宇宙における天体形成をシミュレーションする ARGOT コードを OpenACC によって実装し，OpenMP ベースの CPU 実装および CUDA ベースの GPU 実装との 1 ノード利用時の性能評価を実施した．その結果，CUDA ベースの GPU 実装と遜色ない性能を達成することが明らかとなったため，本稿では，GPU クラスタを対象に，その OpenACC 実装をマルチノード・マルチ GPU 化し，その性能評価について報告する．

researchmap
HBM2 搭載 FPGA のための Addressable Cache を用いた HPC 向けメモリシステムの性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-183 ( 9 ) 1 - 10 2022.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

高性能計算の分野で Field Programmable Gate Array (FPGA) が新たなるアクセラレータとして注目されている．他のアクセラレータと比較して，FPGA は外部メモリ帯域が弱いという弱点があり，HPC における FPGA 利用の障壁のひとつである．最新の高性能 FPGA では，High Bandwidth Memory 2 (HBM2) を搭載する FPGA があり，これを使うことで HPC における FPGA 利用が広がると考えられる．しかしながら，FPGA は固定機能としてのメモリネットワークやキャッシュを持たず，HBM2 の性能を発揮できるメモリ回路を別途開発しなければならない問題がある．本稿では，我々が研究開発している HPC 向け HBM2 メモリシステムの実装と性能評価を示す．また，本システムを扱うための API の設計と実装についても報告を行う．FPGA は自律動作できるアクセラレータであり，本システムを扱う API はこの特徴を活かしたものである．

researchmap
OpenACCによる宇宙物理シミュレーションのGPU＋FPGA協調計算の実装 International coauthorship

綱島, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐, Lee, Seyong, Vetter, Jeffrey S, 村井, 均, 中尾, 昌広, 辻, 美和子, 佐藤, 三久

研究報告ハイパフォーマンスコンピューティング（HPC） 2022-HPC-183 ( 11 ) 1 - 9 2022.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年 HPC 分野では，アクセラレータとして GPU や FPGA が注目されている．特に FPGA は GPU の苦手な処理でも性能を発揮することが期待されており，我々は両者を統合した次世代スーパーコンピュータの研究を行っている．しかし，GPU と FPGA を組み合わせたプログラミングでは，標準的な手法や言語が存在していない．HPC における GPU のシェアは現状では NVIDIA 社によるものが支配的であるため，主に GPU の処理は CUDA で記述されている．一方で，FPGA では高位合成技術により，ハードウェア記述言語に代わって，OpenCL の使用が可能になっている．これら二つを組み合わせてプログラミングを行うことはアプリケーションプログラマーにとって多大な負担となる．また，OpenCL では GPU のプログラミングも行うことができるが，既存のアプリケーションの多くはすでに CUDA で書かれているか，CPU 版しか存在しないため，OpenCL に書き直すには相当の負担が掛かる．仮にコードを別の言語で書き直すのであれば，より汎用性や抽象度の高い記述を行うことが理想である．そこで，我々はCAMP（Cooperative Acceleration by Multi-device Programming）というコンセプトの下，ディレクティブ形式の API である OpenACC を用いて両アクセラレータのプログラミングを統一的に行う環境である MHOAT（Multi-Hybrid OpenACC Translator）を開発している．本稿では，宇宙物理分野の実アプリケーションである ARGOT コードを対象に，MHOAT による GPU と FPGA の混合演算加速の実装について述べる．

researchmap
Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment Reviewed

Kashino, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region 84 - 93 2022.1

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

GPU (Graphics Processing Unit) computing is one of the most popular accelerating methods for various high-performance computing applications. For scientific computations based on multi-physical phenomena, however, a single device solution on a GPU is insufficient, where the single timescale or degree of parallelism is not simply supported by a simple GPU-only solution. We have been researching a combination of a GPU and FPGA (Field Programmable Gate Array) for such complex physical simulations. The most challenging issue is how to program these multiple devices using a single code.

OneAPI, recently provided by Intel, is a programming paradigm supporting such a solution on a single language platform using DPC++ based on SYCL 2020. However, there are no practical applications utilizing its full features or supporting heterogeneous multi-device programming to demonstrate its potential capability. In this study, we present the implementation and performance evaluation of our astrophysics code ARGOT used to apply the oneAPI solution with a GPU and an FPGA. To realize our concept of Cooperative Heterogeneous Acceleration by Reconfigurable Multidevices, also known as CHARM, as a type of next-generation accelerated supercomputing for complex multi-physical simulations, this study was conducted on our multi-heterogeneous accelerated cluster machine running at the University of Tsukuba.

Through the research, we found that current oneAPI framework is effective not only for its typical programming by DPC++ but also for utilizing traditionally developed applications coded by several other languages such as CUDA or OpenCL to support multiple types of accelerators. As an example of real application, we successfully implemented and executed an early stage universe simulation by fundamental astrophysics code to utilize both GPU and FPGA effectively. In this paper, we demonstrate the actual procedure for this method to program multi-device acceleration over oneAPI.

DOI： 10.1145/3492805.3492817

researchmap
An Efficient RTL Buffering Scheme for an FPGA-Accelerated Simulation of Diffuse Radiative Transfer Reviewed

Furukawa, Kazuki, Yokono, Tomoya, Yamaguchi, Yoshiki, Yoshikawa, Kohji, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke, Umemura, Masayuki

2021 International Conference on Field-Programmable Technology (ICFPT) 1 - 9 2021.12

　More details

Language：English Publishing type：Research paper (international conference proceedings)

This paper proposes the efficient buffering approach for implementing radiative transfer equations to bridge the performance gap between processing elements and HBM memory bandwidth. The radiation transfer equation originally focuses on the fundamental physics process in astrophysics. Besides, it has become the focus of a lot of attention in recent years because of the wealth of applications such as medical bioimaging. However, the acceleration requires a complicated memory access pattern with low latency, and the earlier studies unveil conventional memory access based on software control has no aptitude for this computation. Thus, this article introduced an HBM FPGA and proposed an application-specific buffering mechanism called PRISM (PRefetchable and Instantly accessible Scratchpad Memory) to efficiently bridge the computational unit and the HBM. The proposed approach was evaluated on a XILINX Alveo U280 FPGA, and the experimental results are also discussed.

DOI： 10.1109/ICFPT52863.2021.9609944

researchmap
HBM2 Memory System for HPC Applications on an FPGA Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

Proceedings of 2021 IEEE International Conference on Cluster Computing (CLUSTER) 783 - 786 2021.10

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

Field Programmable Gate Arrays (FPGAs) have been targeted as a new accelerator of the HPC field. This is because the barrier to using FPGAs has been gradually lowered due to the widespread use of high-level synthesis (HLS) technology. In addition, the bandwidth of external memory in FPGAs is much lower than that of other accelerators widely used in HPC, such as NVIDIA V100 GPUs. However, the latest FPGAs can use High Bandwidth Memory 2 (HBM2), which has a memory bandwidth of up to 512GB/s. Therefore, we believe FPGAs will be a viable option for speeding up applications. However, unlike CPUs and GPUs, FPGAs do not have caches and memory networks to exploit the full potential of HBM2, which may limit the efficiency of the application. In this paper, we propose a memory system for HBM2 and HPC applications. We show the prototype implementation of the system and evaluate its performance. We also demonstrate the use of the proposed system from an application developed in High-Level Synthesis (HLS) written in C++.

DOI： 10.1109/Cluster48925.2021.00116

researchmap
演算精度の動的制御によるApproximate Computingの実現に向けた予備評価

和田, 康孝, 小林, 諒平, 坂本, 龍一, 森江, 善之

研究報告ハイパフォーマンスコンピューティング（HPC） 2021-HPC-181 ( 2 ) 1 - 6 2021.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

演算精度と実行性能あるいは消費電力等とのトレードオフを最適化する Approximate Computing 技術が浸透し始めている．Approximate Computing 技術を活用することで，アプリケーションを実行する際に，必要十分な精度の演算結果を得つつも，実行性能の最大化や消費電力の削減を可能とすることができる．今後さらにその効果を拡大させるためには，GPGPU や FPGA などのアクセラレータを搭載したシステムや，構成が異なるノードを複数台接続することで構成されるシステムなど，様々な状況に即して Approximate Computing を適用する必要がある．特に，アプリケーション実行時に，アプリケーションの構造やシステムの状況に応じて，動的に演算精度を調整することが重要となると考えられる．このような背景から，本稿では，アプリケーション実行時に動的に演算精度を変更・調整することを想定し，これをアプリケーションのレベルで適用した際の実行性能と演算結果への影響・トレードオフを評価する．

J-GLOBAL

researchmap
HBM-FPGA implementation of a large-scale radiative transfer simulation of diffuse photon and its subjects

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

Proceedings of Forum on Information Technology 1 27 - 32 2021.8

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

HBM-FPGAを利用した宇宙輻射輸送シミュレーションARGOT(Accelerated Radiative transfer on Grids using Oct-Tree/筑波大学計算科学研究センター)の演算加速について議論する．本シミュレーションは，HBM などの大規模・高帯域なメモリを利用しても、メモリ帯域幅がボトルネックとなり十分な加速が難しいことが知られる．そこで本研究では，メモリアクセス効率を高めるため，演算バッファに細粒度なデータフロー制御を組み込むことでメモリアクセス数の削減を図り，飛躍的な演算速度の向上を目指している．本報告では，等方性拡散する各光線が直線的に進行する性質に着目し演算空間を三角錐型に分割するとともにその更新順序を最適化することで，高効率なストリーム演算が実現可能であることを示す．

researchmap
oneAPIを用いたGPU・FPGA混載ノードにおけるヘテロ演算加速プログラム開発

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2021-HPC-180 ( 8 ) 1 - 9 2021.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，メモリバンド幅と空間並列性基づく演算性能に優れた GPU とパイプライン並列性による演算性能と通信性能に優れた FPGA を相補的に活用することでアプリケーション全体の性能向上を目指している．このコンセプトを CHARM（Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices）と呼んでおり，多様な HPC ワークロードに対して効果的に働くことが期待できる．しかしながら，一般に GPU と FPGA は異なるプログラム開発環境で開発されるアクセラレータであり，開発ユーザーにとって負担が大きい．そのため，開発の複雑さを解決する統一的な開発環境が必要である．この問題に対して，Intel 社により提供される oneAPI 開発環境が有効に働くことが期待できる．oneAPI は，異なるアクセラレーター間において統一的な言語および各オフローディングモジュールを統合的に実行する API を提供する．本稿では，NVIDIA GPU 及び Intel FPGA の 2 つのアクセラレータをターゲットとして，oneAPI を用いたヘテロ演算加速プログラムを開発する手法について報告する．

researchmap
FPGAにおけるHPCアプリケーション向けHBM2メモリシステムの提案と実装

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2021-HPC-180 ( 27 ) 1 - 9 2021.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

高性能計算の分野で Field Programmable Gate Array (FPGA) が新たなるアクセラレータとして注目されている．近年，高位合成 (High Level Synthesis: HLS) 開発環境が発展しておきており，C や C++ といった言語を用いた開発が可能になりつつある．FPGA は外部メモリ帯域が弱いという課題があり FPGA を HPC で利用する際の障壁となることがあったが，High Bandwidth Memory 2 (HBM2) を搭載した FPGA チップがベンダーからリリースされ始めており，最大で 512GB/s のメモリ帯域を有する．しかしながら，FPGA には，キャッシュやメモリネットワークといったメモリを利用するための機能はなく，HBM2 を FPGA で利用する際の課題の一つである．本稿では，HPC アプリケーションに適する HBM2 メモリシステムの提案と実装を行い性能評価について報告を行う．また，高位合成で記述したカーネルから提案システムが扱えることを示す．

researchmap
FPGA向け浮動小数点数型ソーティングライブラリの提案と実装

小林, 諒平, 三浦, 賢人, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

IEICE-CPSY2021-8 IEICE-121 ( 116 ) 43 - 48 2021.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々はこれまで基本的な算術演算であるデータのソートに着目し，FPGA (Field-Programmable Gate Array) のプログラミングモデルであるOpenCLで使用可能なソーティングライブラリを開発している．本稿では，浮動小数点数型データに対応する機構の提案および実装について報告する．提案するソーティングライブラリは、3つのハードウェアソートアルゴリズムを組み合わせて構築され，OpenCLプログラミングモデル用に再実装したマージソートアルゴリズムと比較した結果，全体のハードウェアリソースを2倍以上消費する一方で，3桁以上のソート性能を達成した．

researchmap
コンパクション処理を活用した正規パス問合わせアクセラレータのFPGA実装

小林, 諒平, 三浦賢人, 藤田典久, 朴泰祐, 天笠俊之

IEICE-RECONF2021-12 IEICE-121 ( 59 ) 62 - 67 2021.6

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

グラフ構造は身の回りの様々なデータを表すのに効果的なデータ構造である．ビッグデータ分析などの普及に伴い，現在では様々な分野においてグラフ構造データが用いられている．そのようなグラフ構造データからユーザの望むデータを抽出する方法の一つとして，指定されたエッジの並びをもつパスをグラフ内から探索し，そのパスの始点・終点ノードを返す正規パス問合わせ(RPQ)が存在する．本研究では，RPQ評価をパイプライン的に処理するための手法とそのFPGA実装を提案する．実装したRPQアクセラレータの性能を評価したところ，比較手法と比べ最大で約3桁の高速化を達成した．また本研究では，より大規模なグラフを扱えるようにする拡張手法を提案しており，それが実機で正しく動作することを確認した．

researchmap
A Sorting Library for FPGA Implementation in OpenCL Programming Reviewed

Kobayashi, Ryohei, Miura, Kento, Fujita, Norihisa, Boku, Taisuke, Amagasa, Toshiyuki

Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART '21). ( 10 ) 1 - 6 2021.6

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings)

In this study, we focus on data sorting, which is a basic arithmetic operation, and we present a sorting library that can be used with the OpenCL programming model for field-programmable gate arrays (FPGAs). Our sorting library is built by combining three hardware sorting algorithms. It consumes more than twice the overall hardware resources compared to the merge sort restructured for the OpenCL programming model for FPGAs. However, its operating frequency is 1.09x higher and its sorting throughput is three orders of magnitude greater than the baseline.

DOI： 10.1145/3468044.3468054

researchmap
HBM2メモリを持つFPGAボードの性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2021-HPC-178 ( 6 ) 1 - 8 2021.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高位合成（High Level Synthesis: HLS）と呼ばれる技術が発展してきており，Field Programmable Gate Array（FPGA）開発の障壁が低下しつつある．しかしながら，FPGA の持つメモリ帯域は他のアクセラレータと比べて低く，HPC 分野で FPGA を利用する際の障壁となることがあった．しかし，High Bandwidth Memory 2（HBM2）を搭載した FPGA チップがベンダーからリリースされ始めており，最大で 512GB/s のメモリ帯域を有する．依然として，Graphics Processing Unit（GPU）のアクセラレータと比べると，1/4 倍性能の開きがあるものの，性能が一桁以上違うという状況からは改善しつつある．本稿では，Intel Stratix10 FPGA に搭載された HBM2 メモリの性能評価および HPC アプリケーションに適用する手法について述べる．

researchmap
FPGA/GPU協調によるネットワーク型不正侵入検知システムの構築

菊地, 駿太, 池上, 努, Akram, ben Ahmed, 工藤, 知宏, 小林, 諒平, 藤田, 典久, 朴, 泰祐

電子情報通信学会技術研究報告コンピュータシステム 120 ( 338 ) 113 - 118 2021.1

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

These days, Heterogeneous computing is becoming common. In this study, we made an NIDS (Network Intrusion Detection System) as a proof-of-concept application which co-operate FPGA and GPU. NIDS is used to monitor the network and alert us when there is an input that matches a malicious packet. In the system, FPGA handles more than 100Gbps input, which other processing units cannot handle. FPGA pre-filters suspicious packets, because FPGA is suitable for simple tasks. GPU performs exact pattern matching, which can handle various length pattern matching. Future work is to increase the throughput of processing.

researchmap
Performance Evaluation of OpenCL-Enabled Inter-FPGA Optical Link Communication Framework CIRCUS and SMI Reviewed

Kashino, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region 23 - 31 2021.1

　More details

Language：English Publishing type：Research paper (international conference proceedings)

In recent years, Field Programmable Gate Array (FPGAs) have attracted much attention as accelerators in the research area of HighPerformance Computing (HPC). One of the strong features of current FPGA devices is their ability to achieve high-bandwidth communication performance with direct optical links to construct multi-FPGA platforms as well as their adjustability. However, FPGA programming is not easily performed on user applications. By more user-friendly programming environments, FPGAs can be applied to various HPC applications on multi-FPGA platforms. Of the several studies aimed at realizing high-level synthesis to utilize the FPGA communication feature, we focus on two systems: Communication Integrated Recongurable CompUting System (CIRCUS) and Streaming Message Interface (SMI) which are available on an Intel FPGA with direct optical links with a peak performance of 40 ∼ 100 Gbps. In both systems, a user can access the optical link in OpenCL kernels where high-level programming for HPC applications is possible. In this paper, we introduce them for practical cases and compare their implementations and performance in real systems. In conclusion, we evaluated that the CIRCUS system for single point-to-point communication achieves a bandwidth of up to 90 Gbps with a 100-Gbps optical link using OpenCL code. It is 2.7 times faster than the SMI system implemented on the same platform, and we also confirmed that the broadcast data transfer among four FPGAs using CIRCUS is up to 31 Gbps of bandwidth which is 5.3 times faster compared to that achieved using SMI. In addition, we determined the main cause of the performance bottleneck on SMI when it is applied to a 100-Gbps platform and compared it with the CIRCUS implementation.

DOI： 10.1145/3432261.3432266

researchmap
OpenCL-enabled Parallel Raytracing for Astrophysical Application on Multiple FPGAs with Optical Links Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC) 48 - 55 2020.12

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/H2RC51942.2020.00011

Web of Science

researchmap
Multi-Hybrid Accelerated Simulation by GPU and FPGA on Radiative Transfer Simulation in Astrophysics Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

Journal of Information Processing 28 ( 0 ) 1073 - 1089 2020.12

　More details

Language：English Publishing type：Research paper (scientific journal) Publisher：Information Processing Society of Japan

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore's Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose a GPU-FPGA-accelerated simulation based on the concept and show our implementation with CUDA and OpenCL mixed programming for the proposed method. The results of experiments show that our proposed method can always achieve a better performance than GPU-based implementation and we believe that realizing GPU-FPGA-accelerated simulation is the most significant

DOI： 10.2197/ipsjjip.28.1073

researchmap
OpenACCとOpenCLの混合記述によるGPU-FPGAデバイス間連携

小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-177 ( 12 ) 1 - 7 2020.12

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本稿ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，CUDA より抽象度を引き上げたプログラミングモデルである OpenACC と OpenCL の組み合わせにより GPU と FPGA の両演算加速デバイスを連携させ，性能向上を目指す枠組みを示す．

researchmap
OpenACCによるGPUデバイスメモリ管理についての考察

渡邉, 孔英, 菊池, 航平, 柏野, 隆太, 綱島, 隆太, 藤田, 典久, 小林, 諒平, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-177 ( 13 ) 1 - 9 2020.12

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

アプリケーションの GPU 化によって高速化を図るとき，CPU メモリと GPU メモリの間のデータ移動管理が必要になる．OpenACC で記述されたプログラムを PGI コンパイラでコンパイルするとき，データ移動の管理は自動的に行わせるか，プログラマが記述するかを選択することができる．本研究では，両方の方法によるデータ移動管理とその性能について，実験を行って比較および考察した．その結果，データアクセスのパターンによっては，データ移動管理を自動的に行わせる方がデータ転送を削減でき，高速化に役立つ場合があることがわかった．

J-GLOBAL

researchmap
Performance Evaluation of Parallel FPGA System for OpenCL Programming Reviewed

藤田, 典久, 小林, 諒平, 山口, 佳樹, 上野, 知洋, 佐野, 健太郎, 朴, 泰祐

IPSJ Transactions on Advanced Computing System 13 ( 3 ) 13 - 28 2020.11

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

Field Programmable Gate Array (FPGA) is a kind of reconfigurable hardware. We focus on FPGA's powerful interconnection network capability. We have been proposing Communication Integrated Reconfigurable CompUting System (CIRCUS), which is an inter-FPGA communication framework. CIRCUS system allows us to describe a fused pipeline combining communication and computation in the OpenCL language. Center for Computational Sciences, University of Tsukuba operates Cygnus supercomputer. Each node of Cygnus has 2 FPGAs. In this paper, we describe the design and implementation of CIRCUS system and show the result of its performance evaluation on Cygnus.

J-GLOBAL

researchmap
Toward OpenACC-enabled GPU-FPGA Accelerated Computing

Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Kohji Yoshikawa, Makito Abe, Masayuki Umemura

Proceedings - IEEE International Conference on Cluster Computing, ICCC 2020- 422 - 423 2020.9

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Institute of Electrical and Electronics Engineers Inc.

DOI： 10.1109/CLUSTER49012.2020.00060

Scopus

researchmap
再結合光子の輻射輸送大規模計算に向けたHBM-FPGA実装への考察

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

情報科学技術フォーラム講演論文集 1 21 - 26 2020.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

筑波大学計算科学研究センターのプロジェクトに，宇宙輻射輸送シミュレーションを利用した天体現象の解明がある．このシミュレーションは、星および星間媒質からのエネルギー演算により構成されるARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) 法を用いて演算を行う．後者の演算スキーム，ART (Authentic Radiation Transfer) は，ランダムメモリアクセスが可能なことから FPGA 実装による飛躍的な速度向上が期待されているが，GPU実装を大きく超える高速化は実現されていない。そこで本研究では，演算方式の見直しを含め，メモリシステムを含めた演算加速部の高速化について議論する．

researchmap
FPGAに組み込まれたHBMの効率的な利用とその考察

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

電子情報通信学会技術研究報告 (信学技報) 120 ( 168 ) 30 - 35 2020.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

複数の FPGA を用いた演算加速が高性能計算において期待される中，AiS (Accelerators in Switch) という一概念に注目が集まっている.AiS は，各 FPGA を繋ぐ通信機構の中にアプリケーション特化の演算機構を組みこみ，通信 × 演算の密結合型機構の実現とそれによるシステム性能の向上を提案している.筑波大学計算科学研究センターでは，宇宙輻射輸送シミュレーションコード ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) を開発し，これに AiS を応用することで，シミュレーションシステムの高速化を目指す研究が進められている.本研究では， ARGOT のうち ART (Authentic Radiation Transfer) スキームを FPGA で高速化することを提案する.ART は3次元格子空間を扱うため，これに由来するランダムに近いメモリアクセス制御は FPGA による解決を期待できる.一方，演算時に発生する膨大なメッシュデータのメモリアクセスについては，FPGA 内の BRAM 等に保存することが難しく，性能低下の原因となっていた.そこで本稿では HBM (High Bandwidth Memory) に着目し，これを用いた ART スキームの実装について提案する.まず，Xilinx Alveo U280 における HBM のメモリアクセス性能について議論する.続けて，HBM からメッシュデータを読み出す場合の SPM (Scratchpad Memory) として On-chip RAM(BRAM・URAM)を用いることを想定し，メモリアクセスがボトルネックとならない SPM へのアクセス率の検証と，外部メモリへのアクセス回数を減らすための工夫に関して議論を行う.

researchmap
Stratix 10 FPGAを用いたray-tracing法による輻射輸送計算の高速化

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-175 ( 8 ) 1 - 10 2020.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々はこれまでの研究で，宇宙輻射輸送問題で用いられる Authentic Radiative Transfer（ART）法を Arria 10 FPGA 上に実装し性能評価を行ってきた．本稿では，ART 法を最新の Intel Field Programmable Gate Array（FPGA）である Stratix 10 FPGA 向けに最適化し，性能評価を行う．また，我々が提唱している FPGA 間通信フレームワークである Communication Integrated Reconfigurable CompUting System（CIRCUS）を用いて並列計算を実現し，複数 FPGA を用いる際の性能評価も行う．

researchmap
Accelerating Radiative Transfer Simulation with GPU-FPGA Cooperative Computation Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP) 9 - 16 2020.7

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGAaccelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to 17.4× compared with GPU-based implementation. This performance is stil

DOI： 10.1109/ASAP49362.2020.00011

researchmap

Other Link： https://dblp.uni-trier.de/db/conf/asap/asap2020.html#KobayashiFYBYAU20
Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Ueno, Tomohiro, Sano, Kentaro, Boku, Taisuke

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 450 - 459 2020.7

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.

DOI： 10.1109/IPDPSW50202.2020.00083

researchmap

Other Link： https://dblp.uni-trier.de/db/conf/ipps/ipdps2020w.html#FujitaKYUSB20
OpenCL対応FPGA間光リンク接続フレームワークCIRCUSとSMIの性能評価

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-175 ( 16 ) 1 - 8 2020.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能分野において FPGA に対する期待が高まっている．高位合成により開発の障壁が低下し，強力な通信性能をもつことが可能な FPGA は従来のシステムでは高速化できない種類のアプリケーションに対しても効果的に働く可能性がある．これらの FPGA の特徴を最大限に活用するためには，FPGA に特化した通信フレームワークが必要となる．既にこの研究は行われており，筑波大学から CIRCUS，チューリッヒ工科大学から SMI が提案されている．いずれも 40～100Gbps の光リンクを OpenCL から利用可能とするもので，今後の FPGA の HPC 利用において重要なパーツとなると考えられる．本報告では，この 2 つの手法，CIRCUS と SMI について実機性能評価を行い，その特性を比較する．

researchmap
宇宙幅射輸送コードARGOTのOpenACCによるGPU実装

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-175 ( 7 ) 1 - 7 2020.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本稿ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，初期宇宙における天体形成をシミュレーションするプログラムを OpenACC によって実装し，OpenMP ベースの CPU 実装および CUDA ベースの GPU 実装との性能評価を行う．

researchmap
Design and Performance Evaluation of Inter-FPGA Communication using High Level Synthesis

藤田, 典久, 小林, 諒平, 山口, 佳樹, 上野, 知洋, 佐野, 健太郎, 朴, 泰祐

計算工学講演会論文集 Proceedings of the Conference on Computational Engineering and Science / 日本計算工学会編 25 6p 2020.6

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
Multi-hybrid Accelerated Computing with GPU and Reconfigurable System

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

計算工学講演会論文集 Proceedings of the Conference on Computational Engineering and Science / 日本計算工学会編 25 6p 2020.6

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is k-ey to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose a GPU-FPGA-accelerated simulation based on the concept and show our implementa- tion with OpenCL-enabled GPU–FPGA DMA method. The results of experiments show that our proposed method can always achieve better performance than GPU-based implementation and we believe that real- izing GPU–FPGA-accelerated simulation is the most significant difference be

researchmap
スーパーコンピュータCygnus上におけるFPGA間パイプライン通信の性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 上野, 知洋, 佐野, 健太郎, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-173 ( 24 ) 1 - 11 2020.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

再構成可能なハードウェアの一つに Field Programmable Gate Array (FPGA) がある．我々は，FPGA が持つ協力な外部通信機構に注目している．FPGA 開発は低レベルな記述が必要でありコストが高かったが，高位合成 (High Level Synthesys, HLS) の技術によって解消されつつある．我々は Communication Integrated Reconfigurable CompUting System (CIRCUS) という FPGA 間通信フレームワークを提唱している．CIRUCS システムを用いることで，通信と演算が一体となったパイプラインを OpenCL で記述できる．筑波大学計算科学研究センターでは 1 ノードあたり 2 FPGA ボードを搭載するスーパーコンピュータ Cygnus を運用しており，本稿では Cygnus 上で CIRCUS の通信性能の評価を行い報告する．

J-GLOBAL

researchmap
GPU・FPGA複合演算加速による宇宙輻射輸送コードARGOTの性能評価

小林, 諒平, 藤田, 典久, 中道, 安祐未, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2020-HPC-173 ( 8 ) 1 - 11 2020.3

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．本稿では，マルチフィジックスの例である，宇宙輻射輸送シミュレーションコード ARGOT を対象にする．ARGOT は，点光源と空間に分散した光源の 2 種類の輻射輸送問題を含む．ARGOT 法の演算には既に ARGOT プログラムに実装されている GPU カーネルを用いることで，主要演算部分を GPU と FPGA に適材適所的に機能分散して ARGOT コードを最適化する．また，GPU-FPGA 間のデータ転送には，これまでに提案してきた OpenCL から制御可能な GPU-FPGA 間 DMA 転送を利用する．提案手法を評価したところ，GPU と FPGA に適材適所的に機能分散した ARGOT コードは，そうでない ARGOT コードと比較して最大 10.4 倍の性能向上を達成できた．

researchmap
OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Nakamichi, Ayumi, Boku, Taisuke

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops 17 - 20 2020.1

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1145/3373271.3373275

Web of Science

researchmap
OpenCL対応GPU・FPGAデバイス間連携機構による宇宙輻射輸送コードの演算加速

小林, 諒平, 藤田, 典久, 中道, 安祐未, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-172 ( 8 ) 1 - 9 2019.12

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA (Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや復数の同時発生する物理現象を含むシミュレーショシであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速させづらいことがある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．本稿では，マルチフィジックスの例である，宇宙輻射輸送シミュレーションコード ARGOT を対象にする．ARGOT は，点光源と空間に分散した光源の 2 種類の輻射輸送問題を含む．ARGOT 法の演算には既に ARGOT プログラムに実装されている GPU カーネルを用いることで，主要演算部分を GPU と FPGA に適材適所的に機能分散して ARGOT コードを最適化する．また，GPU-FPGA 間のデータ転送には，これまでに提案してきた OpenCL から制御可能な GPU-FPGA 間 DMA 転迭を利用する．提案手法を評価したところ，GPU と FPGA に適材適所的に機能分散した ARGOT コードは，そうでない ARGOT コードと比較して最大 3 倍の性能向上を達成できた．

researchmap
GPU-FPGA協調プログラミングを実現するコンパイラの開発 International coauthorship

綱島, 隆太, 小林, 諒平, 藤田, 典久, 中道, 安祐未, 朴, 泰祐, Lee, Seyong, Vetter, Jeffrey, 村井, 均, 佐藤, 三久

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-172 ( 11 ) 1 - 10 2019.12

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能コンピューティング（HPC : High Performance Computing）分野におけるトップレベルのマシンには，アクセラレータを搭載した大規模計算クラスタが多く含まれている．高い演算性能とメモリバンド幅を有する Graphics Processing Unit（GPU）がアクセラレータとして主に用いられているが，条件分岐が頻出する処理や多数の演算コアを活用できないような並列性の小さい処理といった GPU の不得手する演算は依然として存在し，それが性能向上の妨げとなっている．このような問題に対し，任意の論理回路をプログラム可能な集積回路である Field Programmable Gate Array（FPGA）に，GPU が不得手とする処理を実行する回路を実装し，それを FPGA に適宜にオフロードすることによってアプリケーション全体の性能を向上させるアプローチを我々は試みている．しかしながら，GPU と FPGA の演算カーネルは，それぞれ CUDA と OpenCL といった異なるプログラミング言語で開発する必要があり，このようなマルチリンガルプログラミングは，ユーザーにとって多大な負担となる．そこで本研究では，GPU と FPGA が搭載された計算機システム上にて，両アクセラレータの統合的な制御を可能にする OpenACC を用いたプログラミング環境について検討する．本報告では，OpenACC を用いて記述された一つのプログラムを GPU 用，FPGA 用コンパイラそれぞれに向けたファイルに分割するソース to ソースコンパイラを開発し，最終的にこれらをリンクした単一の実行ファイルにより，両アクセラレータの連携が実現できるか検証を行った．その結果，開発したコンパイラによって，統一したアプリケーションプログラミングインターフェイス（API）で書かれた一つのプログラムから，CPU，GPU，FPGA で連携して演算を行う単一の実行ファイルが生成され，両アクセラレータの連携が実現できることが確認された．

J-GLOBAL

researchmap
再構成可能なハードウェアを用いた演算と通信を融合する手法の提案と性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-171 ( 6 ) 1 - 9 2019.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能計算の分野で再構成可能なハードウェアである Field Programmable Gate Array (FPGA) が次世代の演算加速装置として注目されている．FPGAを高性能計算で用いる際の障壁は開発の困難さであったが，高位合成手法の発展に伴いこの問題は解決しつつある．最新の FPGA は最大で 100Gbps×4の通信性能を有しており，我々はその強力な通信性能に注目している．FPGA の絶対性能は他のアクセラレータよりも低いが，FPGA が持つ演算能力と通信能力を組み合わせることでより広い範囲の問題に FPGA が適用できると考えている．本研究の目的は，高位合成で記述された FPGA アプリケーションから通信機構を操作し並列処理システムを実現することである．通信のスループットやレイテンシだけでなく，通信と演算を一体化したパイプラインが FPGA 内に構築される点も評価を行い，高位合成で記述した FPGA アプリケーションで並列計算が可能なことを示す．我々は FPGA 間で直接通信を行う環境として CoE というシステムを開発しており，バンド幅は最大で 90.7Gbps を達成し，最小レイテンシは 429.2ns であった．また，パイプライン評価においても，良好な結果が得られ，通信と演算を一体化したパイプラインを構築できていることを確認した．

researchmap
Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 479 - 488 2019.7

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.

DOI： 10.1109/IPDPSW.2019.00089

researchmap
Optimization on Astrophysical Radiative Transfer Code for FPGAs with OpenCL Reviewed

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

IPSJ Transactions on Advanced Computing System 12 ( 3 ) 64 - 75 2019.7

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

One of the recent challenges faced by HPC is how to apply FPGA technology to accelerate a next-generation supercomputer as an efficient method of achieving high performance and low power consumption. GPU is the most commonly used accelerator for HPC supported by regularly executed high degree of parallel operations which causes performance bottleneck in some cases. On the other hand, there are great opportunities to flexibly and efficiently utilize FPGAs in reconfigurable circuits to fit various computing situations. However, it is not easy for application developers to implement FPGA logic circuits for their applications and algorithms, which generally require complicated hardware logic descriptions. Because of the progress made in the FPGA development environment in recent years, the HLS development environment using the OpenCL language has become popular. Based on our experience describing kernels using OpenCL, we found that a more aggressive programming strategy is necessary to realize true high performance based on a “co-design” concept to implement the necessary features and operations to fit the target application in an FPGA design. In this paper, we optimize the ART method used in space radiative transfer problems on an FPGA using OpenCL. Using a co-designed method for the optimized programming of a specific application with OpenCL for an FPGA, we achieved a performance that is 4.9 times faster than that of a multicore CPU implementation, and almost the same performance as a GPU implementation. Considering the current advanced FPGAs with interconnection features, we believe that their parallelized implementation with multiple FPGAs will achieve a higher performance than GPU.

researchmap
OpenCL対応FPGA間通信機能によるGPU・FPGA複合型演算加速

小林, 諒平, 藤田, 典久, 山口, 佳樹, 中道, 安祐未, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-170 ( 5 ) 1 - 9 2019.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，OpenCL コードから制御可能な FPGA 間通信技術と GPU-FPGA 間 DMA 転送技術を融合した，複数ノード上における GPU-FPGA 間連携子法を提案する．GPU-FPGA 間 DMA 転送は，GPU デバイスのグローバルメモリを PCIe アドレス空間にマップし，アドレスマップの結果をベースに OpenCL カーネル内で作成したディスクリプタを最終的に FPGA 内の PCIe DMA コントローラに書き込むことによって実現される．また，FPGA 間通信は，Verilog HDL で実装された Ethernet 通信を実行するハードウェアと，そのハードウェアの制御モジュール（OpenCL カーネル）を I/O Channel で接続することによって構成されているシステムで実現される．この提案手法を用いて，ノードを跨いだ GPU 同士の pingpong ベンチマークを実装し，それが正しく動作していることを認した．

researchmap
GPU・FPGA複合演算加速による輻射流体シミュレーションコードARGOTの実装

中道, 安祐未, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-170 ( 22 ) 1 - 5 2019.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能コンピューティング (HPC:High Performance Computing) の分野において，アクセラレータを搭載した大規模計算クラスタが主流の 1 つとなっている．アクセラレータには，主に Graphics Processing Unit (GPU) が用いられているが，HPC 分野では処理の柔軟性や電力効率の高さから Field Programmable Gate Array (FPGA) が注目されつつある．そこで，GPU が不得意な計算を FPGA に行わせる GPU+FPGA の複合システムにより実アプリケーションのさらなる高性能化を目指す．前回の発表では，GPUとFPGA の両方を搭載した計算機で GPU+FPGA のハイブリッドアクセラレーションを実現するプログラムの開発手法と環境について議論した．GPU・FPGA の両デバイスを協調する方法を確立したため，本研究では，その方法を用いて輻射流体シミュレーションコード ARGOT の実装を行う．従来は CPU・GPU を用いて高速化が行われていたが，アルゴリズムの特性より，本研究では FPGA を用いた方がより高速化できるアルゴリズムに対して OpenCL による実装を用いたソースコードを組み込んだ．実装にはまだ至ってはいないが，実装に対する議論を行う．

researchmap
GPU-FPGA Heterogeneous Computing with OpenCL-Enabled Direct Memory Access Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Nakamichi, Ayumi, Boku, Taisuke

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 489 - 498 2019.7

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore's Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose an OpenCL-enabled data movement method to directly access the global memory of the GPU and show how to implement cooperative GPU-FPGA computation using it. The results of experiments show that our proposed method can achieve a latency of 0.59 μs and a data transfer rate as high as 7.0 GB/s between the GPU and the FPGA, thus confirming that it is effective at realizing high-performance cooperative GPU-FPGA computation.

DOI： 10.1109/IPDPSW.2019.00090

researchmap
GPU-FPGA協調計算を記述するためのプログラミング環境に関する研究

綱島, 隆太, 小林, 諒平, 藤田, 典久, 中道, 安祐未, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-169 ( 10 ) 1 - 9 2019.5

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能コンピューティング（HPC : High Performance Computing）分野におけるトップレベルのマシンには，アクセラレータを搭載した大規模計算クラスタが多く含まれている．高い演算性能とメモリバンド幅を有する Graphics Processing Unit （GPU）がアクセラレータとして主に用いられているが，条件分岐が頻出する処理や多数の演算コアが利用できないような並列性の小さい処理といった GPU の不得手する演算は依然として存在し，それが性能向上の妨げとなっている．このような問題に対し，任意の論理回路をプログラム可能な集積回路である Field Programmable Gate Array （FPGA）に，GPU が不得手とする処理を実行する回路を実装し，それを FPGA に適宜にオフロードすることによってアプリケーション全体の性能を向上させるアプローチを我々は試みている．しかしながら，GPU と FPGA の演算カーネルは，それぞれ CUDA と OpenCL といった異なるプログラミング言語で開発する必要があり，このようなマルチリンガルプログラミングは，ユーザーにとって多大な負担となる．そこで本研究では，GPU と FPGA が搭載された計算機システム上にて，両アクセラレータの統合的な制御を可能にする OpenACC を用いたプログラミング環境について検討する．本報告では，OpenACC により記述された別々の GPU 向け，FPGA 向けファイルをコンパイル時にリンクすることで両アクセラレータの連携が可能か検証を行った．その結果，OpenACC による記述のみで GPU-FPGA 協調計算が実現可能であることを確認した．

J-GLOBAL

researchmap
高位設計と低位設計の違いとFPGA演算性能の関係について

横野, 智也, 山口, 佳樹, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

情報処理学会第81回全国大会講演論文集 2019.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

FPGA1チップの回路規模が100 万システムゲートを超えた現在，その全ての動作を把握し，RTL(Register Transfer Level)設計により完全な最適化を達成するのは困難になりつつある．そこで，高位記述言語によるHLS(High Level Synthesis) 設計に注目が集まっている．Intel社のIntel SDK for OpenCL，Xilinx 社のVivado HLS およびSDAccel などHLS 設計・開発環境は整いつつある．ここで，データセンターのような多くのユーザが利用しかつ複数のFPGA が並列に動作する環境において，RTL設計のみを唯一の選択肢とし続けることはユーザビリティの点から現実的ではない．一方，高性能演算と言う観点で設計手法をみたとき，HLS 設計のみを選択肢とするのは，現時点では時期尚早と考えられる．そこで本論文では，HDL 設計とHLS 設計の現状を等距離から評価し議論することで，次世代のヘテロジニアス高性能計算およびそこにFPGA が存在する可能性について検討する.

researchmap
GPU・FPGA混載ノードにおけるヘテロ演算加速プログラム環境に関する研究

中道, 安祐未, 小林, 諒平, 藤田, 典久, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2019-HPC-168 ( 10 ) 1 - 7 2019.2

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能コンピューティング (HPC ： High Performance Computing) の分野において，アクセラレータを搭載した大規模計算クラスタが主流の 1 つとなっている．アクセラレータには，主に Graphics Processing Unit (GPU) が用いられているが，HPC 分野では処理の柔軟性や電力効率の高さから Field Programmable Gate Array (FPGA) が注目されつつある．そこで，GPU が不得意な計算を FPGA に行わせる GPU + FPGA の複合システムにより実アプリケーションのさらなる高性能化を目指す．本研究では，GPU と FPGA の両方を搭載した計算機で GPU + FPGA のハイブリッドアクセラレーションを実現するプログラムの開発手法と環境について議論する．

researchmap
異デバイス間でのPCIe通信を実現するOpenCL対応FPGAモジュールの提案と検証

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

IEICE-RECONF2018-63 IEICE-118 ( 432 ) 107 - 112 2019.1

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU (Graphics Processing Unit) に演算通信性能に優れている FPGA (Field Programmable Gate Array) を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，OpenCL コードから制御可能なデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの外部メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大 33.3 倍の性能差，通信バンド幅の面では最大 2.0 倍の性能差が確認された．

researchmap
OpenCL-enabled high performance direct memory access for GPU-FPGA cooperative computation Reviewed

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke

Proceedings of the HPC Asia 2019 Workshops 6 - 9 2019.1

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by FPGA vendors, which reduces the programming effort required as compared to the past. These improvements reveal the possibilities of realizing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is one of the keys to more improve the performance of modern heterogeneous supercomputers using accelerators like GPUs. In this paper, we propose a high-performance GPU-FPGA data communication using OpenCL and Verilog HDL mixed programming in order to make both devices smoothly work together. OpenCL is used to program application algorithms and data movement control when Verilog HDL is used to implement low-level components for memory copies between the two devices. Experimental results using toy programs showed that our proposed method achieves a latency of 0.6 $\mu$s and as much as 6.9 GB/s between the GPU and the FPGA, thus confirming that the proposed method is effective at realizing the high-performance GPU-FPGA cooperative computation.

DOI： 10.1145/3317576.3317581

researchmap
OpenCLとVerilog HDLの混合記述によるGPU-FPGAデバイス間連携

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2018-HPC-167 ( 11 ) 1 - 10 2018.12

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU （Graphics Processing Unit）に演算通信性能に優れている FPGA （Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU - FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，GPU プログラミングと FPGA プログラミングの連携を効率的に行うためのデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの外部メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大で 83 倍の性能差，通信バンド幅の面では最大で 2.4 倍の性能差が確認された．

researchmap
OpenCLによるFPGA上の演算と通信を融合した並列処理システムの実装及び性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2018-HPC-167 ( 9 ) 1 - 9 2018.12

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，高性能計算の分野で再構成可能なハードウェアである Field Programmable Gate Array (FPGA) が次世代の演算加速装置として注目されている．FPGA を高性能計算で用いる際の障壁は開発の困難さであったが，高位合成手法の発展に伴いこの問題は解決しつつある．最新の FPGA は最大で 100 Gbps × 4 の通信性能を有しており，我々はその強力な通信性能に注目している．FPGA の絶対性能は他のアクセラレータよりも低いが，FPGA が持つ演算能力と通信能力を組み合わせることでより広い範囲の問題に FPGA が適用できると考えている．本研究の目的は，高位合成で記述された FPGA アプリケーションから通信機構を操作し並列処理システムを実現することである．通信のスループットやレイテンシだけでなく，姫野ベンチマークを用いた性能評価を行い，高位合成で記述した FPGA アプリケーションで並列計算が可能なことを示す．我々は FPGA 間で直接通信を行う環境として Channel over Ethernet (CoE) というシステムを開発しており，バンド幅は最大で 7.13 Gbps を達成し，4 バイト通信時のレイテンシは 980 ns であった．姫野ベンチマークで，問題サイズ M を 4 FPGA で実行する場合に 22659 MFLOPS の性能が得られ，4 FPGA 時に 1 FPGA 時と比べて 3.6 1倍という良好な Strong Scaling の結果が得られた．

researchmap
FPGAによる宇宙輻射輸送シミュレーションの演算加速

横野, 智也, 藤田, 典久, 山口, 佳樹, 大畠, 佑真, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

IEICE-RECONF2018-25 118 ( 215 ) 35 - 40 2018.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々はこれまで，アクセラレータ間を密結合し低レイテンシで通信を行うTCA(Tightly Coupled Accelerators) と呼ばれるアーキテクチャを提案し，FPGA(Field Programmable Gate Array) を用いたTCA 実装としてPEACH2(PCI Ecpress Adaptive Communication Hub Ver.2) の開発を行ってきた．これらの研究を基に現在，TCAの概念をより進めたアーキテクチャとしてAiS(Accelerators in Switch) というコンセプトの研究を進めている．AiSは通信機構の中にアプリケーションに特化した演算機構を組み込み，FPGA 内での演算機構と通信機構のより強い連携を実現する次世代の並列演算加速機構である．本稿では，AiS の実現に向けた評価として，宇宙輻射輸送シミュレーションであるARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) の中で用いられるART (Authentic Radiation Transfer) 法を異なるFPGA(Xilinx/Intel) に実装し，その評価を行う．これは当該シミュレーションがGPU のような加速機構により高速化される部分とそうでない部分をほぼ等しく含んでいるため，GPU とは異なるアーキテクチャとの協調計算が求められるためである．ART 法をFPGA に実装した際，CPU と比較し両デバイスともに高速化を実現した．

researchmap
GPU-FPGA複合システムにおけるデバイス間連携機構

小林, 諒平, 阿部, 昂之, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

研究報告ハイパフォーマンスコンピューティング（HPC） 2018-HPC-165 ( 26 ) 1 - 8 2018.7

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は，高い演算性能とメモリバンド幅を有する GPU (Graphics Processing Unit) に演算通信性能に優れている FPGA (Field Programmable Gate Array) を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，GPU プログラミングと FPGA プログラミングの連携を効率的に行うためのデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの内蔵メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大で 8.4 倍の性能差，通信バンド幅の面では最大で 3.7 倍の性能差が確認された．

researchmap
並列FPGAシステムにおけるOpenCLを用いた宇宙輻射輸送コードの演算加速

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

研究報告ハイパフォーマンスコンピューティング（HPC） 2018-HPC-165 ( 27 ) 1 - 8 2018.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年注目されている High Performance Computing (HPC) における挑戦の一つに，どのようにして Field Programable Gate Array (FPGA) 技術を用いて，高い性能と低い低消費電力を次世代スーパーコンピュータシステムで達成するかというものがある．従来手法ではソフトウェアの開発者が Hardware Description Language (HDL) を用いて FPGA 回路を開発することは困難であったが，近年の FPGA における開発環境の進歩により，高位合成の利用が一般的になりつつあり，HDL の記述なしに FPGA 開発が可能になりつつある．本研究では，初期宇宙の研究に重要な輻射輸送を解くプログラム Accelerated Radiative transfer on Grids using Oct-Tree (ARGOT) で用いられているアルゴリズムである Authentic Radiation Transfer (ART) 法を OpenCL で記述して FPGA 向けに最適化を行い，また，今後の展望として，ART 法の計算をどのようにして複数 FPGA で並列計算を行うかについて述べる．これまでの研究では，FPGA 内の Block RAM (BRAM) に収まる大きさの問題しか解けず，ARGOT で実際に計算したい問題サイズに対応できなかったが，大容量の DDR メモリを併用することで実用的な問題サイズを FPGA で解けるようになった．CPU，GPU，FPGA 間での性能比較を行い，CPU と比べて 6.9 倍の高速化が達成され，GPU との比較では GPU と同程度の性能を達成した．FPGA 実装の性能は GPU と同程度であるが，自ら通信機構を操作できる FPGA の方が通信オーバーヘッドは GPU と比べると小さく，並列計算を行う際の性能は GPU の性能を超えられると考えられ，今後，並列 FPGA 計算の実装を行う予定である．

researchmap
Accelerating Space Radiative Transfer on FPGA using OpenCL Reviewed

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Oobata, Yuma, Boku, Taisuke, Abe, Makito, Yoshikawa, Kohji, Umemura, Masayuki

HEART 2018 Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies Article No. 6 6:1 - 6:7 2018.6

　More details

Language：English Publishing type：Research paper (international conference proceedings)

One of the recent challenges faced by High-Performance Computing (HPC) is how to apply Field-Programmable Gate Array (FPGA) technology to accelerate a next-generation supercomputer as an efficient method of achieving high performance and low power consumption. Graphics Processing Unit (GPU) is the most commonly used accelerator for HPC supported by regularly executed high degree of parallel operations which causes performance bottleneck in some cases. On the other hand, there are great opportunities to flexibly and efficiently utilize FPGAs in logic circuits to fit various computing situations. However, it is not easy for application developers to implement FPGA logic circuits for their applications and algorithms, which generally require complicated hardware logic descriptions. Because of the progress made in the FPGA development environment in recent years, the High-Level Synthesis (HLS) development environment using the OpenCL language has become popular. Based on our experience describing kernels using OpenCL, we found that a more aggressive programming strategy is necessary to realize true high performance based on a "codesign" concept to implement the necessary features and operations to fit the target application in an FPGA design. In this paper, we optimize the Authentic Radiation Transfer (ART) method on an FPGA using OpenCL. We also discuss a method to parallelize its computation in an FPGA and a method to optimize the OpenCL code on FPGAs. Using a codesigned method for the optimized programming of a specific application with OpenCL for an FPGA, we achieved a performance that is 6.9 times faster than that of a CPU implementation using OpenMP, and almost the same performance as a GPU implementation using CUDA. The ART code should work on a larger configuration with multiple FPGAs requiring interconnections between them. Considering the current advanced FPGAs with interconnection features, we believe that their parallelized implementation with multiple FPGAs will achieve a higher performance than GPU.

DOI： 10.1145/3241793.3241799

CiNii Research

researchmap
複数のFPGAによる分散ソーティングの実現に向けた予備評価

小林, 諒平, 藤田, 典久, 大畠, 佑真, 山口, 佳樹, 朴, 泰祐

Technical report of IEICE. EA 118 ( 63 ) 65 - 70 2018.5

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.) Publisher：The Institute of Electronics, Information and Communication Engineers

researchmap
ArchHDL: A novel hardware RTL modeling and high-speed simulation environment Reviewed

Shimpei Sato, Ryohei Kobayashi, Kenji Kise

IEICE Transactions on Information and Systems E101D ( 2 ) 344 - 353 2018.2

　More details

Language：English Publishing type：Research paper (scientific journal) Publisher：Institute of Electronics, Information and Communication, Engineers, IEICE

DOI： 10.1587/transinf.2017RCP0012

Scopus

researchmap
宇宙輻射輸送計算におけるHDL設計とOpenCL設計の比較

横野, 智也, 藤田, 典久, 山口, 佳樹, 大畠, 佑真, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2018-HPC-163 ( 24 ) 1 - 8 2018.2

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

半導体の高集積化は，FPGA の大規模化・高機能化・低価格化をもたらし，組み込みシステム用途だけでなく高性能計算用途においても導入が検討されるようになった．しかし，FPGA 開発はハードウェア記述言語（HDL : Hardware Description Language）による設計が主流であり，FPGA の利用可能性は開発の困難さによって大きく制約を受けている．FPGA の高性能計算応用を考えたとき，C 言語や OpenCL 言語を初めとする高位記述による設計が考えられるが，開発効率などの定性的な議論はあるものの，演算性能を定量的に比較した報告は少ない．そこで本論文では，宇宙輻射輸送計算をベンチマークに，高位記述設計（OpenCL 言語による HLS 設計）と低位記述設計（Verilog HDL による RTL 設計）とを比較し，高性能計算応用からみた FPGA の利用可能性と演算性能について議論する．具体的には，原始銀河形成シミュレーションにおいて再結合光子の輻射輸送を解く ART （Authentic Radiation Transfer）法を FPGA に実装し，その演算性能について比較を行った．細かな演算回路の調整や外部インタフェースを含むシステムとしての設計を除くと，XILINX 社と Intel 社という利用デバイスの違いがあるものの，記述方法によらず同程度の性能を得ることができることを確認できた．

researchmap
OpenCL-ready high speed FPGA network for reconfigurable high performance computing Reviewed

Ryohei Kobayashi, Yuma Oobata, Norihisa Fujita, Yoshiki Yamaguchi, Taisuke Boku

ACM International Conference Proceeding Series 192 - 201 2018.1

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Association for Computing Machinery

DOI： 10.1145/3149457.3149479

Scopus

researchmap
OpenCLを用いたFPGAによる宇宙輻射輸送シミュレーションの演算加速

藤田, 典久, 小林, 諒平, 山口, 佳樹, 大畠, 佑真, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2017-HPC-161 ( 12 ) 1 - 9 2017.9

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々はこれまで，アクセラレータ間を密結合し低レイテンシで通信を行う TCA (Tightly Coupled Accelerators) と呼ばれるアーキテクチャを提案し，FPGA (Field Programmable Gate Array) を用いた TCA 実装として PEACH2 (PCI Express Adaptive Communication Hub Ver. 2) の開発を行ってきた．これらの研究を基に現在，TCA の概念をより進めたアーキテクチャとして AiS (Accelerators in Switch) というコンセプトの研究を進めている．AiS は通信機構の中にアプリケーションに特化した演算機構を組み込み，FPGA 内での演算機構と通信機構のより強い連携を実現する次世代の並列演算加速機構である．これまでにも PEACH 2 に対して演算機構を組み込む研究は行われてきたが，PEACH 2 は Verilog HDL (Hardware Description Language) によって全体が記述されており，AiS における演算部についても Verilog HDL を用いて記述しなければならず，開発コストが高く，FPGA の専門家でなければその開発ができないという問題があった．近年の FPGA 開発環境の進歩により，より一般的な環境で AiS を実現できるようになり，さらに通信性能についても 40 Gbps，100 Gbps といった高速な通信機構を扱え，また，ソフトウェアで用いられている言語から回路を合成する高位合成と呼ばれる技術が普及してきた．Intel FPGA では OpenCL を用いた高位合成処理系があり，OpenCL 言語からの回路の生成だけでなく，OpenCL API を用いた FPGA の制御が可能となるが，CPU や GPU 向けに記述・最適化された OpenCL コードをそのまま用いても性能がでないことがわかっており，FPGA 向けの最適化をどう行うかが課題となる．本稿では Intel FPGA 向け高位合成開発環境である Intel FPGA SDK for OpenCL を用いて，宇宙輻射輸送シミュレーションコード ARGOT の中で用いられている ART 法を FPGA 向けに最適化を行う．ART 法を FPGA に実装するにあたって，どのように FPGA 内部で並列演算を行うか，どのような FPGA 向け最適化を行うかについて述べる．Intel Arria 10 FPGA を用いて性能評価を行い，CPU 実装と比べて 14.6 倍の高速化が得られ，その実装は 63 % の回路リソースを利用し動作周波数は 236.11 MHz であった．

researchmap
OpenCLとVerilog HDLの混合記述によるFPGA間Ethernet接続

大畠, 佑真, 小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2017.7

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
A High Performance FPGA-Based Sorting Accelerator with a Data Compression Mechanism Reviewed

Ryohei Kobayashi, Kenji Kise

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E100D ( 5 ) 1003 - 1015 2017.5

　More details

Language：English Publishing type：Research paper (scientific journal)

DOI： 10.1587/transinf.2016EDP7383

Web of Science

researchmap
高位合成によるFPGAの高性能計算へ適用 Reviewed

大畠, 佑真, 藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

ハイパフォーマンスコンピューティングと計算科学シンポジウム論文集 2017.5

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

researchmap
OpenCLとVerilog HDLの混合記述によるFPGAプログラミング

藤田, 典久, 大畠, 佑真, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

情報処理学会研究報告ハイパフォーマンスコンピューティング（HPC） 2017-HPC-158 ( 16 ) 1 - 9 2017.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

我々は GPU に代表されるアクセラレータを持つ PC クラスタにおいて，アクセラレータ同士のノード間通信性能を向上させる機構として，TCA (Tightly Coupled Accelerators) と呼ばれるコンセプトを提案してきた．また，そのプロトタイプ実装を FPGA (Field Programmable Gate Array) を用いて行うことにより，演算加速と通信の融合だけでなく，アプリケーションに特化した演算機能を通信機構内に組み込むという新コンセプトとして Accelerator in Switch を提唱している．近年，FPGA のハードウェアおよび開発環境の進歩により，より一般的な環境で Accelerator in Switch を実現できるようになってきた．ハードウェア面では 40Gb / 100Gb Ethernet のような高速な外部リンクが搭載され，また FPGA 開発に用いられる言語として C 言語，C++ 言語，OpenCL 言語などを利用可能な，高位合成と呼ばれる手法が広まりつつある．これらの背景の下，Accelerator in Switch をアプリケーションユーザにまで広める環境が固まりつつある．本稿では，Accelerator in Switch において，OpenCL では記述できない機能を補完するために Verilog HDL 記述を平行して用い，OpenCL と Verilog HDL を併用してプログラミングする方法について検討を行う．通信機構などの外部ペリフェラルと OpenCL を接続する方法の検討や，メモリアクセスやコアとなる演算を Verilog HDL で代替し，ライブラリすることで，より高性能・高効率な回路実装を目指す．一例として，内積計算をライブラリ化したところ，混合記述を行ったプログラムで理論ピーク性能の約 90% の実効性能を達成し，OpenCL のみで記述したプログラムの性能を上回った．また，外部ペリフェラルの操作として，ボード上に搭載されているハードウェアの制御を OpenCL から行えることを確認した．

researchmap
A High-speed Verilog HDL Simulation Method using a Lightweight Translator Reviewed

Kobayashi, Ryohei, Misono, Tomohiro, Kise, Kenji

ACM SIGARCH Computer Architecture News - HEART '16 44 ( 4 ) 26 - 31 2016.9

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (international conference proceedings)

Designing with Hardware Description Languages (HDLs) is still the de facto standard way to develop FPGA-based custom computing systems, and RTL simulation is an important step in ensuring that the designed hardware behavior meets the design specification. In this paper, we propose a new high-speed Verilog HDL simulation method. It is based on two previously proposed techniques: ArchHDL and Pyverilog. ArchHDL is used as a simulation engine in the method because the RTL simulation provided by ArchHDL can be parallelized with OpenMP. We use Pyverilog to develop a code translator to convert Verilog HDL source code into ArchHDL code, and due to this, the translator can be realized and its implementation is lightweight. We compare the proposed method with Synopsys VCS, and the experimental results show that the RTL simulation behavior and speed are same as that of Synopsys VCS and up to 5.8x better respectively.

DOI： 10.1145/3039902.3039908

researchmap
Effective Parallel Simulation of ArchHDL under Manycore Environment Reviewed

Tomohiro Misono, Ryohei Kobayashi, Shimpei Sato, Kenji Kise

Proceedings - 2015 3rd International Symposium on Computing and Networking, CANDAR 2015 140 - 146 2016.3

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Institute of Electrical and Electronics Engineers Inc.

DOI： 10.1109/CANDAR.2015.93

Scopus

researchmap
世界最速のFPGAソーティングアクセラレータの初期検討

臼井, 琢真, 眞下, 達, 松田, 裕貴, 小林, 諒平, 吉瀬, 謙二

第78回全国大会講演論文集 2016 ( 1 ) 149 - 150 2016.3

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

ソーティングはデータベース，画像処理，データ圧縮といった様々なアプリケーションに使用されている非常に重要な計算カーネルである．このため様々な高速化手法が提案されており，中にはFPGAを用いたものが存在する．FPGAはユーザーが自由に内部構成を設計できるLSIであるため，アプリケーションに特化した演算回路やデータ供給機構を実装することにより，CPUやGPUと比較して高い演算性能を持つアクセラレータを作成できる可能性を持つ．本稿では，FPGAを用いた世界最速のソーティングアクセラレータの実現に向けたアプローチを検討する．

researchmap
Frix: Feasible and Reconfigurable IBM PC Compatible SoC

Matsuda, Yuki, Ogawa, Eri, Misono, Tomohiro, Kobayashi, Ryohei, Kise, Kenji

第78回全国大会講演論文集 2016 ( 1 ) 151 - 152 2016.3

　More details

Language：English Publishing type：Research paper (conference, symposium, etc.)

In order to develop high performance computer systems effectively, environments to evaluate architectural ideas are required.In these purpose, software based simulators are often used, but they have disadvantage of slow simulation speed.In order to achieve fast simulation speed, hardware environments are desired. We propose Frix (Feasible and Reconfigurable IBM PC Compatible SoC), which is an FPGA-based evaluation environment with an x86 soft processor.Frix can boot general purpose operating systems, FreeDOS and TinyCore.The source code of Frix is written in Verilog HDL, and released as open-source.In this paper, we detail the design of Frix and show how to use Frix for research and education.

researchmap
SSDの並列性を引き出すI/Oスケジューラ

奥村, 開里, 小林, 諒平, 吉瀬, 謙二

研究報告システムソフトウェアとオペレーティング・システム（OS） 2015-OS-135 ( 14 ) 1 - 8 2015.11

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年，Solid State Drive(SSD) は個人用のパソコンのみならず，クラウドストレージ，データセンターなどといった幅広い範囲で使われ始めている．SSD は性能向上のために，複数チャンネル，またチャンネル毎に存在する複数のチップによって I/O の並列処理を行い性能を向上させているが，それらを考慮した SSD 用のスケジューラは OS 側に組み込まれていない．そのため本稿では，SSD の並列性を抽出することにより，レイテンシの低減，及びスループットの向上を目的とする Alleviate Conflict(AC) スケジューラを提案する．Linux に提案するスケジューラを実装し，SSD に対する様々な I/O リクエストパターンを用いて，SSD の帯域幅とレイテンシを評価した．その結果，Web サーバに近い I/O アクセスパターンにおいては，提案した I/O スケジューラは，Linux カーネルで標準的に使用されている Noop スケジューラ，Deadline スケジューラ，CFQ スケジューラそれぞれと比較し，Noop スケジューラからは帯域幅 4%の向上，レイテンシは 15%の低減，Deadline スケジューラからは帯域幅 7%の向上，レイテンシは 7%の低減，CFQ スケジューラからは帯域幅 34%の向上，レイテンシは 40%の低減を達成した．

researchmap
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み

小林, 諒平, 吉瀬, 謙二

IEICE-RECONF2015-12 115 ( 109 ) 65 - 70 2015.6

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

Sorting is an extremely important computation kernel that has been accelerated by using FPGAs in a lot of fields, such as databases, image processing, data compression, etc. FPGA-based accelerators can achieve higher computation performance than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we introduce several approaches to realize the fastest FPGA-based sorting hardware in the world, and discuss our present system compared with the prior work. According to these approaches and the performance model, we figure out how to design the sorting hardware, whose performance is equal to that of the prior work, with the half of the hardware resources.

researchmap
FPGAベースのソーティングアクセラレータの設計と実装

小林, 諒平, 吉瀬, 謙二

IEICE-CPSY2015-5 115 ( 7 ) 25 - 30 2015.4

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

Sorting is an extremely important computation kernel that has been tried to be accelerated in a lot of fields, such as database, image processing, data compression and so on. We propose an FPGA-based accelerator that executes sorting at high speed. FPGA-based accelerators can achieve higher computation performance than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. Our proposed FPGA accelerator uses two approaches: “Sorting Network” and “Merge Sorter Tree”. In this paper, we detail design and implementation of the proposed sorting accelerator, and evaluate this performance. As a result, the sorting speed of the proposed hardware is up to 10.06x than Intel Core i7-4770 operating at 3.4GHz.

researchmap
Ultra High-speed FPGA Accelerator for Sorting Application

Kobayashi, Ryohei, Kise, Kenji

第77回全国大会講演論文集 2015 ( 1 ) 25 - 26 2015.3

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (conference, symposium, etc.)

FPGA accelerators can obtain higher computation performance and better power efficiency than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we propose an approach of sorting acceleration by using a large FPGA. Sorting is an extremely important computation kernel that has been tried to be accelerated in lots of fields. We design and implement the proposed FPGA accelerator, and then evaluate its performance by comparing with a modern desktop computer. From this evaluation, we show how sorting is accelerated.

researchmap
USB3.0接続の手軽で高速なFPGAアクセラレータの設計と実装

臼井, 琢真, 小林, 諒平, 吉瀬, 謙二

IEICE-RECONF2014-78 114 ( 428 ) 205 - 210 2015.1

　More details

Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

FPGA accelerators can obtain higher computation performance and better power efficiency than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we propose a portable and high-speed FPGA accelerator employing USB3.0. USB3.0 is a data transfer interface with high versatility and high speed. We choose sorting as a practical application accelerated by the FPGA accelerator, and designed and implemented this. To show the high portability of the proposed FPGA accelerator, we evaluated the FPGA accelerator with several computers, such as desktop PCs, laptop PCs and so on.As a result, the sorting effective performances of the proposed FPGA accelerator are 1.28 and 2.60-times higher than Intel Core i7-3770K operating at 3.5GHz and Intel i3-4010U operating at 1.83GHz respectively. From this evaluation, we show that the proposed FPGA accelerator has high portability.

researchmap
Reconfigurable IBM PC Compatible SoC for Computer Architecture Education and Research Reviewed

Eri Ogawa, Yuki Matsuda, Tomohiro Misono, Ryohei Kobayashi, Kenji Kise

2015 IEEE 9TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANYCORE SYSTEMS-ON-CHIP (MCSOC) 65 - 72 2015

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/MCSoC.2015.35

Web of Science

researchmap
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems Reviewed

Ryohei Kobayashi, Kenji Kise

2015 IEEE 9TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANYCORE SYSTEMS-ON-CHIP (MCSOC) 49 - 56 2015

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/MCSoC.2015.40

Web of Science

researchmap
A challenge of portable and high-speed FPGA accelerator Reviewed

Takuma Usui, Ryohei Kobayashi, Kenji Kise

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9040 383 - 392 2015

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Springer Verlag

DOI： 10.1007/978-3-319-16214-0_34

Scopus

researchmap
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons Reviewed International coauthorship

ImmanuelV, Encarnacion, Kobayashi, Ryohei, Kise, Kenji

組込みシステムシンポジウム2014論文集 2014 126 - 131 2014.10

　More details

Language：English Publishing type：Research paper (conference, symposium, etc.)

An embedded system we developed, the MieruEMB system, is used as an educational kit for learning implementation skills and knowledge regarding embedded systems. In this paper we present 3bOS, a simple and easily customizable embedded OS, running on the MieruEMB system. 3bOS comes with a three-button interface and a built-in file explorer for FAT file systems. 3bOS is capable of running ELF executables, providing approximately 400 KB of memory for an application. It can also support basic graphics functions. This embedded OS is written in C, and just consists of around 800 lines of the code. Because of its simplicity, users can easily understand how this embedded OS runs on the MieruEMB system, and can easily modify this embedded OS if they want. We show the design, the implementation, and the features of 3bOS, and conclude that 3bOS is usable for educational purposes.

researchmap
FPGAの消費電力を削減するHDLコーディング手法の検討

Kobayashi, Ryohei, Kise, Kenji

第76回全国大会講演論文集 2014 ( 1 ) 25 - 26 2014.3

　More details

Authorship：Corresponding author Language：English Publishing type：Research paper (conference, symposium, etc.)

The advantages of using FPGAs (Field Programmable Gate Arrays) are to change design easily, low respin costs and speeding up development time. However to get these benefits, the FPGA has disadvantages: higher power consumption, larger silicon areas and lower operating speeds compared with the ASIC. In particular, higher power consumption not only requires higher packaging costs, shortens chip life-times, expensive cooling systems, but also decreases system reliability. Therefore, it is truly important to reduce FPGA s power consumption. In this paper, we compare HDL (Hardware Description Language) coding styles, which have already been proposed to reduce power consumption for FPGAs, and seek a more effective way than those.

researchmap
Scalable Stencil-computation Accelerator by Employing Multiple Small FPGAs Reviewed

小林, 諒平, 吉瀬, 謙二

IPSJ Transactions on Advanced Computing System 6 ( 4 ) 1 - 13 2013.10

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as earthquake simulation, digital signal processing and fluid calculation. We have proposed high performance architecture for 2D stencil computation and implemented the architecture by employing many small FPGAs. We develop the system in stages. First, We implement software simulator in C++, which emulates stencil computation in cycle level accuracy on multiple FPGA nodes. Second, we implement the circuits based on the software simulator in Verilog HDL. We implement the circuits in FPGA array and verify FPGA array. We evaluate the performance, the scalability and the power consumption of developed FPGA array. As a result, we establish the validity on the proposed architecture since the FPGA array operated successfully. The FPGA array with 100-FPGA achieved about 0.6GFlop/sW. This performance/W value is about 3.8 times better than typical CPU card.

researchmap
Development of Scalable Stencil-Computation Accelerator Based on Multiple Small FPGAs Reviewed

小林, 諒平, 高前田(山崎), 伸也, 吉瀬, 謙二

先進的計算基盤システムシンポジウム論文集 2013 179 - 187 2013.5

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as earthquake simulation, digital signal processing and fluid calculation. We have proposed high performance architecture for 2D stencil computation and implemented the architecture by using multiple small FPGAs. We develop the system in stages. First, We implement software simulator in C++, which emulates stencil computation in cycle level accuracy on multiple FPGA nodes. Second, we implement the circuits based on the software simulator in Verilog HDL. We implement the circuits in FPGA array and verify FPGA array. We evaluate the performance, the scalability and the power consumption of developed FPGA array. As a result, we establish the validity on the proposed architecture since the FPGA array operated successfully. The FPGA array with 100-FPGA achieved about 0.6GFlop/sW. This performance/W value is about 3.8 times better than typical CPU card.

researchmap
Design of Synchronization Mechanism to Conquer the Clock Oscillator Variation for High Performance Stencil Computation Accelerator

Kobayashi, Ryohei, Takamaeda-Yamazaki, Shinya, Kise, Kenji

第75回全国大会講演論文集 2013 ( 1 ) 133 - 134 2013.3

　More details

Language：English Publishing type：Research paper (conference, symposium, etc.)

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as Earthquake simulation, seismic imaging for the oil and gas exploration industry. We have proposed the effective stencil computation method and the architecture by employing multiple small FPGAs with 2Dmech topology. However, as we implemented stencil computation accelerator, we realized that the accelerator does not stable operate because clock oscillator variation occurs. This variation occurs because each FPGA node which composes the accelerator has unique clock domain. In this paper, we evaluate clock oscillator variation quantitatively and describe design of synchronization mechanism to conquer the variation to operate the accelerator successfully.

CiNii Books

researchmap
メッシュ接続FPGAアレーを用いた高性能ステンシル計算機の設計と実装

小林, 諒平, 高前田(山崎), 伸也, 吉瀬, 謙二

IEICE-RECONF2012-88 112 ( 377 ) 159 - 164 2013.1

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

We develop an effective stencil computation accelerator by using multiple FPGAs, which employs 2D-mesh architecture connecting multiple small FPGAs. On the process of the development, there is a trouble that the system generates an illegal computation result when the multiple FPGA nodes are used. The cause of it is clock period variation. This paper describes a quantitative evaluation result of clock variations for every FPGA node and the design and implementation of a mechanism to operate the system successfully.

researchmap
メッシュ接続FPGAアレーにおける高性能ステンシル計算 Reviewed

小林, 諒平, 佐野, 伸太郎, 高前田(山崎), 伸也, 吉瀬, 謙二

先進的計算基盤システムシンポジウム論文集 2012 142 - 149 2012.5

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

FPGA (Field Programmable Gate Array) is a remarkable device to easily develop custom hardware accelerators with higher performance. In this paper, we propose scalable stencil computation mechanism by employing multiple small FPGAs. Stencil computation is one of the most important kernels in scientific computations. This paper describes the architecture of our multi-FPGA-based stencil computation system with 2D-mech topology and the details of primary implementation. In order to eliminate the handshaking overhead among the neighbor FPGAs, computation order is customized for each FPGA to increase the overwrap rate of computations and communications. The evaluation result shows that a single FPGA node archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. We estimated the system performance of 256 FPGAs. The 256 FPGAs system achieves 537GFlop/s with 0.94GFlop/sW efficiency.

researchmap
メッシュ接続FPGAアレーにおけるステンシル計算の検討

小林, 諒平, 佐野, 伸太郎, 高前田(山崎), 伸也, 吉瀬, 謙二

第74回全国大会講演論文集 2012 ( 1 ) 107 - 108 2012.3

　More details

Authorship：Corresponding author Language：Japanese Publishing type：Research paper (conference, symposium, etc.)

近年,FPGAの有する専用のハードウェアを柔軟に構成できるという性質から，FPGAを科学計算のアクセラレータとして用いる研究が盛んに行われている.本研究ではメッシュ接続のFPGAアレーを用いて，科学技術計算において重要な計算カーネルの一つであるステンシル計算に対する検討を行った．

researchmap
Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations Reviewed

Ryohei Kobayashi, Shinya Takamaeda-Yamazaki, Kenji Kise

2012 THIRD INTERNATIONAL CONFERENCE ON NETWORKING AND COMPUTING (ICNC 2012) 343 - 349 2012

　More details

Language：English Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/ICNC.2012.67

Web of Science

researchmap

▼display all

Books

Interface 2017年2月号緊急特集本家ARMのIoTワールド入門

Kobayashi,Ryohei（ Role： Contributor計算力時代到来...スパコン技術研究コーナソート専用コンピュータ最前線）

CQ出版社 2017.2

　More details

Language：Japanese Book type：Scholarly book

researchmap
Interface 2016年12月号 IoT＆スパコン！ラズパイ時代の自分用コンピュータ作り

Kobayashi,Ryohei（ Role： Contributor第6章ビッグデータ時代にますます重要！ハードウェア・データ処理に挑戦）

CQ出版社 2016.12

　More details

Language：Japanese Book type：Scholarly book

researchmap
Interface 2016年12月号 IoT＆スパコン！ラズパイ時代の自分用コンピュータ作り

Kobayashi,Ryohei（ Role： Contributor第6章 Appendix 2 基本演算の高速化が重要！ハードウェア並列ソート・アルゴリズム）

CQ出版社 2016.12

　More details

Language：Japanese Book type：Scholarly book

researchmap

Presentations

これからのHPC研究に向けて Invited

小林諒平

第201回ハイパフォーマンスコンピューティング研究発表会 2025.9 情報処理学会ハイパフォーマンスコンピューティング（HPC）研究会

　More details

Event date： 2025.9

Language：Japanese Presentation type：Symposium, workshop panel (nominated)

Venue：金沢商工会議所 Country：Japan

researchmap
【SWEST/ACRi 共同企画セッション】FPGAが変えるスーパーコンピューティングの世界 Invited

小林諒平

第27回組込みシステム技術に関するサマーワークショップ(SWEST27) 2025.8 SWEST 実行委員会

　More details

Event date： 2025.8

Language：Japanese Presentation type：Oral presentation (invited, special)

Venue：下呂温泉水明館 Country：Japan

やわらかいハードウェアとして知られるFPGAは，その何にでもなれる特性 (用途に応じて論理回路構成を柔軟に変更できるリコンフィギュラブル性) により，スーパーコンピュータをはじめとした高性能計算 (HPC: High performance computing) システムに搭載する演算加速装置 (アクセラレータ) としても活用されています．しかしFPGAはGPUやCPUとは異なるプログラミングモデルや演算性能・リソース制約があることから全てのアプリケーションに万能ではなく，アプリケーションの並列性や演算パターン，問題サイズの特性を見極め，最適なアクセラレータ選択と回路設計を行うことが重要です．本セッションでは，これまでの研究で得られた知見を基にアプリケーション特性に応じたFPGA導入の実践的手法を解説し，聴講者が自身のアプリケーションへのFPGA適用イメージを深められる場を提供します．

researchmap
HPC研究の過去とこれからーAI/量子時代のHPC研究ー Invited

小林諒平

2025年並列／分散／協調処理に関するサマー・ワークショップ（SWoPP 2025） 2025.8

　More details

Event date： 2025.8

Language：Japanese Presentation type：Symposium, workshop panel (nominated)

Country：Japan

本SWoPP2025において、HPC研究会は200回記念を迎えます。そこで、過去の研究会を振り返り、AI/量子時代に必要とされるHPC研究の展望をパネリストとともに議論します。特に若い世代が興味を持って担っていくHPC分野の課題について議論します。パネリストには、過去のHPC研究会関係者に加え、若い世代も招待して議論します。

researchmap
CXLメモリプール実験システムの初期評価

遠藤敏夫, 坂本龍一, 野村哲弘, 小林諒平, 大辻弘貴, 加藤純, 古藤明音, 三輪真弘

第200回ハイパフォーマンスコンピューティング研究発表会（SWoPP2025） 2025.8

　More details

Event date： 2025.8

Language：Japanese Presentation type：Oral presentation (general)

Country：Japan

HPC・クラウドシステムでは，ノードごとに大容量メモリを固定割り当てするため，導入コストや消費電力の増大が深刻な課題となっている．その解決策の一つとして，Compute Express Link (CXL) 2.0規格に基づくメモリプールシステムが注目されており，それによって複数ノード間でメモリ資源を効率的に共有・柔軟に割り当てるアプローチが可能になる．本研究では，H3社製Falcon C5022モジュールを用いて1TiBのCXLメモリプールを構築し，Intel Granite Rapids CPU搭載サーバ上で実機性能評価を実施した．具体的には，Intel Memory Latency Checker v3.11によるメモリアクセスレイテンシ測定とSTREAMベンチマークによるバンド幅評価を行い，その結果をもとにCXLメモリプール技術の性能特性を定量的に明らかにした．最後に，得られた知見を踏まえ，CXLメモリプールの最適設計や運用に向けた実用的な指針について議論する．

researchmap
Accelerating Deep Learning Inference with a Parallel FPGA System International conference

Takumi Suzuki, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku

HEART '25: Proceedings of the 15th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 2025.5

　More details

Event date： 2025.5

Language：English Presentation type：Oral presentation (general)

Country：Japan

researchmap
イタレーションレベルApproximate Computing手法の提案と予備評価

和田康孝, 小林諒平, 森江善之, 坂本龍一

第199回ハイパフォーマンスコンピューティング研究発表会 2025.5

　More details

Event date： 2025.5

Language：Japanese Presentation type：Oral presentation (general)

演算精度を変更することにより，演算性能・消費電力・演算結果の正確さの間でトレードオフを最適化するApproximate Computing（AC）手法は，消費電力などの制約下において限界を超えた性能を得るために有望な手段の一つである．HPCアプリケーションのように演算精度に対して敏感なアプリケーションにおいてACの効果を得るためには，アプリケーション全体で統一した演算精度を用いるのではなく，アプリケーションの要素ごとに細粒度に演算精度を調整し，きめ細やかに最適化を施す必要がある．本稿では，HPCアプリケーションに特徴的な時間発展ループ等の構造を利用してACを適用するイタレーションレベルAC手法について述べ，その予備的な評価結果について紹介する．

researchmap
Evaluation of Trade-off between Compression Ratio and Hardware Cost for Adaptive Bandwidth Compression Hardware Platform International conference

Tomohiro Ueno, Kaito Kitazume, Masato Kiyama, Kazutomo Yoshii, Kento Sato, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku, Kentaro Sano

IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL Chips 28) 2025.4

　More details

Event date： 2025.4

Language：English Presentation type：Oral presentation (general)

researchmap
高スループット非同期集団通信の性能モデル化に向けた予備評価

森江善之, 和田康孝, 小林諒平, 坂本龍一, 南里豪志

第198回ハイパフォーマンスコンピューティング・第14回量子ソフトウェア合同研究発表会 2025.3

　More details

Event date： 2025.3

Language：Japanese Presentation type：Oral presentation (general)

researchmap
並列FPGA間通信フレームワークCIRCUSへのフロー制御の実装と評価

北爪開人, 藤田典久, 小林諒平, 朴泰祐

第198回ハイパフォーマンスコンピューティング・第14回量子ソフトウェア合同研究発表会 2025.3

　More details

Event date： 2025.3

Language：Japanese Presentation type：Oral presentation (general)

researchmap
GPU演算加速による一般相対論的輻射磁気流体シミュレーションコードの性能評価

小林諒平, 高橋博之, 額田彰, 朝比奈雄太, 朴泰祐, 大須賀健

第198回ハイパフォーマンスコンピューティング・第14回量子ソフトウェア合同研究発表会 2025.3

　More details

Event date： 2025.3

Language：Japanese Presentation type：Oral presentation (general)

researchmap
Accelerating General Relativistic Radiation Magnetohydrodynamic Simulations with GPUs International conference

Ryohei Kobayashi, Hiroyuki R. Takahashi, Akira Nukada, Yuta Asahina, Taisuke Boku, Ken Ohsuga

HPC Asia 2025: International Conference on High Performance Computing in Asia-Pacific Region 2025.2

　More details

Event date： 2025.2

Language：English Presentation type：Oral presentation (general)

researchmap
「富岳」Next時代のアクセラレータに向けた自動チューニング技術 Invited

Ryohei Kobayashi

第16回自動チューニング技術の現状と応用に関するシンポジウム（ATTA2024） 2024.12

　More details

Event date： 2024.12

Language：Japanese Presentation type：Symposium, workshop panel (nominated)

researchmap
Preliminary Evaluation of Kyokko for Inter-FPGA Communication Framework CIRCUS International conference

Kaito Kitazume, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku

IEEE Cluster 2024 2024.9

　More details

Event date： 2024.9

Language：English Presentation type：Poster presentation

researchmap
Preliminary Performance Evaluation of Grace-Hopper GH200 International conference

Toshihiro Hanawa, Kengo Nakajima, Yohei Miki, Takashi Shimokawabe, Kazuya Yamazaki, Shinji Sumimoto, Osamu Tatebe, Taisuke Boku, Daisuke Takahashi, Akira Nukada, Norihisa Fujita, Ryohei Kobayashi, Hiroto Tadano, Akira Naruse

IEEE Cluster 2024 2024.9

　More details

Event date： 2024.9

Language：English Presentation type：Poster presentation

researchmap
Using SYCLomatic to Migrate CUDA Code to oneAPI Adapting NVIDIA GPU International conference

Wentao Liang, Norihisa Fujita, Ryohei Kobayashi, Taisuke Boku

IEEE Cluster 2024 2024.9

　More details

Event date： 2024.9

Language：English Presentation type：Poster presentation

researchmap
適応型帯域圧縮ハードウェアプラットフォームのChisel実装と評価

北爪開人, 上野知洋, 吉井一友, 木山真人, 藤田典久, 小林諒平, 佐野健太郎, 朴泰祐

2024年9月リコンフィギャラブルシステム研究会 2024.9

　More details

Event date： 2024.9

Language：Japanese Presentation type：Oral presentation (general)

researchmap
CHARM-SYCL & IRIS: A Tool Chain for Performance Portability on Extremely Heterogeneous Systems International conference

Norihisa Fujita, Beau Johnston, Narasinga Rao Miniskar, Ryohei Kobayashi, Mohammad Alaul, Haque Monil, Keita Teranishi, Seyong Lee, Jeffrey S. Vetter, Taisuke Boku

20th IEEE International Conference on e-Science 2024 2024.9

　More details

Event date： 2024.9

Language：English Presentation type：Oral presentation (general)

researchmap
GPU・FPGA連携による高性能計算 Invited

小林, 諒平

DAシンポジウム2024 −システムとLSIの設計技術− 2024.8

　More details

Event date： 2024.8

Language：Japanese Presentation type：Oral presentation (invited, special)

スーパーコンピュータに対する要求性能と利用可能な電力容量の制限，昨今の脱炭素化への動向などから，スーパーコンピュータの電力効率の向上は喫緊の課題であり，その解として演算加速装置（アクセラレータ）の利活用が高性能計算分野の主流となりつつある．現在最も多用されているアクセラレータは GPU（Graphics Processing Unit）であるが，これによる効率的な計算には極めて大量かつ均一性の高い空間並列性，均一なメモリアクセス，比較的少ない並列通信データ量など，様々な制約が存在するため，GPU だけではアプリケーションを十分に演算加速し切れない場合がある．そこで，GPU では非効率となる演算を加速させるハードウェアを FPGA（Field Programmable Gate Array）に実装し，GPU と FPGA の双方の計算デバイスを相補的に活用することによってアプリケーション全体の性能を向上させるアプローチをこれまで試みてきた．本講演では，GPU・FPGA 連携のためのデータ転送技術やプログラミングモデル，GPU と FPGA を併用することによる宇宙物理アプリケーションの高速化事例について紹介する．

researchmap
Improving Performance on Replica-Exchange Molecular Dynamics Simulations by Optimizing GPU Core Utilization International conference

Boku, Taisuke, Sugita, Masatake, Kobayashi, Ryohei, Furuya, Shinnosuke, Fujie, Takuya, Ohue, Masahito, Akiyama, Yutaka

The 53rd International Conference on Parallel Processing (ICPP 2024) 2024.8

　More details

Event date： 2024.8

Language：English Presentation type：Oral presentation (general)

While GPUs are the main players of the accelerating devices on high performance computing systems, their performance depends on how to utilize a numerous number of cores in parallel on each device. Typically, a loop structure with a number of iterations is assigned to a device to utilize their cores to map calculations in iterations so that there must be enough count of iterations to fill the thousands of GPU cores in the high-end GPUs.

In the advanced GPU represented by NVIDIA H100, several techniques, such as Multi-Process Service (MPS) or Multi-Instance GPU (MIG), which divides GPU cores to be mapped to the multiple user processes, are provided to enhance the core utilization even in a case with a small degree of parallelism. We apply MPS to a practical Molecular Dynamics (MD) simulation with AMBER software for improving the efficiency of GPU core utilization to save the computation resources. The critical issue here is to analyze the core utilization and overhead when running multiple processes on a GPU device as well as the multi-GPU and multi-node parallel execution for overall performance improvement.

In this paper, we introduce a method to apply MPS for AMBER to simulate the membrane permeation process of a drug candidate peptide by a two-dimensional replica-exchange method on an advanced supercomputer with NVIDIA H100. We applied several optimizations on parameter settings with NVIDIA H100 and V100 GPUs investigating their performance behavior. Finally, we found that the GPU core utilization improves up to twice compared with a simple process assignment method to maximize the GPU utilization efficiency.

researchmap
GH200の予備性能評価

塙, 敏博, 建部, 修見, 中島, 研吾, 朴, 泰祐, 三木, 洋平, 下川辺, 隆史, 山崎, 一哉, 住元, 真司, 高橋, 大介, 額田, 彰, 藤田, 典久, 小林, 諒平, 多田野, 寛人, 田浦, 健次朗, 細川, 颯介, 髙橋, 淳一郎, 成瀬, 彰

第195回ハイパフォーマンスコンピューティング研究発表会（SWoPP2024） 2024.8

　More details

Event date： 2024.8

Language：Japanese Presentation type：Oral presentation (general)

最先端共同 HPC 基盤施設 (JCAHPC) では，2025 年 1 月に稼働開始する Miyabi の導入準備を進めている．1,120 ノードの Miyabi-G 計算ノードには，GH200 Grace-Hopper Superchip が搭載され，国内のスパコンとして初めて GH200 が導入される．本稿では，GH200実験システムを用いて各種の予備性能評価を実施したので，その結果を報告する．

researchmap
多様な環境におけるマルチ・タスク・ミニベンチマークの評価とPerformance Portability

藤田, 典久, Beau, Johnston, 小林, 諒平, Mohammad, Alaul, Haque Monil, Narasinga, Rao Miniskar, Keita, Teranishi, Seyong, Lee, Jeffrey, S. Vetter, 朴, 泰祐

第195回ハイパフォーマンスコンピューティング研究発表会（SWoPP2024） 2024.8

　More details

Event date： 2024.8

Language：Japanese Presentation type：Oral presentation (general)

HPC システムの多様性が増してきているため，アプリケーションの可搬性は多様なシステムを利用する上で重要な課題となっている．本稿では，複数の演算加速装置を統一的に扱えるプログラミング環境である CHARM-SYCL をアプリケーションの可搬性を実現するための開発環境として提案する．CHARM-SYCL は単一のコードから複数の演算加速装置に対応するカーネルを生成できるのに加えて，ORNL で開発されている IRIS ライブラリをバックエンドとして利用できる．IRIS は高性能なスケジューラを持ち計算タスクを複数の演算加速装置上で実行でき，CHARM-SYCL と IRIS を組み合わせることで高いアプリケーションの可搬性を実現する．本稿では，モンテカルロ法シミュレーションのベンチマークコードに CHARM-SYCL 開発環境を適用し，提案するシステムによって高いアプリケーションの可搬性が実現できていることを示す．

researchmap
次世代スパコンに期待すること Invited

小林, 諒平

2024年並列／分散／協調処理に関するサマー・ワークショップ（SWoPP 2024） 2024.8

　More details

Event date： 2024.8

Language：Japanese Presentation type：Symposium, workshop panel (nominated)

このBOFではHPC分野の若手の方に登壇いただき、日本のHPC業界のこれからを担う若手研究者から見た次世代スパコンのあるべき姿についてパネルディスカッションを行う。

researchmap
Preliminary Evaluation of Flow Control on the Inter-FPGA Communication Framework CIRCUS International conference

Kitazume, Kaito, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

2nd Workshop on FPGA Technologies for Adaptive Computing (FTAC 2024) 2024.6

　More details

Event date： 2024.6

Language：English Presentation type：Oral presentation (general)

—Field-Programmable Gate Arrays (FPGAs) are gaining attention as computational acceleration devices in the field of high-performance computing. The usefulness of FPGAs has increased due to the appearance of FPGA boards with highspeed optical interfaces and high-level synthesis. On the other hands, the environment for using FPGAs in parallel computing for high-performance computing is still under development. As part of these efforts, the Center for Computational Sciences at the University of Tsukuba is developing a framework called CIRCUS (Communication Integrated Reconfigurable CompUting System). This framework aims to enable fast communication between multiple FPGAs using OpenCL-based high-level synthesis. However, a current challenge with CIRCUS is the lack of flow control implementation. The issue arises from the lack of flow control in the FPGA communication protocol used within the communication module. To solve this problem, our research focuses on replacing the communication module with a protocol that includes flow control. In this paper, we evaluate the performance of the open-source communication controller Kyokko as a replacement for CIRCUS’s communication module. We implement Kyokko on an Intel Stratix 10 GX H-tile FPGA board (BittWare 520N), which supports communication speeds of up to 100 Gbps per port.

researchmap
Accelerating Deep Learning Inference with Multiple FPGAs International conference

Suzuki, Takumi, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

2nd Workshop on FPGA Technologies for Adaptive Computing (FTAC 2024) 2024.6

　More details

Event date： 2024.6

Language：English Presentation type：Oral presentation (general)

The demand for fast and power-efficient deep learning solutions is growing. In response, methods for partitioning and implementing deep learning inference models across multiple FPGAs are gaining traction. This research aims to partition and implement deep learning inference models across multiple FPGAs within the Cygnus supercomputer. We use OpenCL as the programming language, implement the ResNet-50 model, and perform 8-bit quantization of type int. Currently, progress has been made in implementing the quantized ResNet-50 model on a single FPGA within the PPX server. The performance is about 1,600 times slower than that observed in related research. Therefore, future plans include increasing the speed of a single FPGA and partitioning the model across multiple FPGAs.

researchmap
Unified Programming Environment for Multiple Accelerator Types with Programming, Performance and Compiler Portability International conference

Fujita, Norihisa, Johnston, Beau, Kobayashi, Ryohei, Teranishi, Keita, Lee, Seyong, Boku, Taisuke, Vetter, Jeffrey S

ISC High Performance 2024 2024.5

　More details

Event date： 2024.5

Language：English Presentation type：Poster presentation

Ensuring performance portability across a range of accelerator architectures presents a significant challenge when developing application and programming systems for high-performance computing (HPC) environments. This challenge becomes even more pronounced within computing nodes that incorporate multiple accelerator types. Each of these accelerators is distinguished by its specific performance attributes, optimal data layouts, programming interfaces, and program binaries. Navigating the complexity of multi-accelerator programming has motivated us to create the CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices) framework, which transparently selects the suitable computations for each accelerator in a given HPC system.
CHARM-SYCL [1] is a unified programming environment based on the concept for multiple accelerator types to attach the diversity problem in HPC systems. We can use SYCL as the single programming environment and create portable applications that are compatible with many accelerator types in a single executable binary file. The CHARM-SYCL runtime uses the IRIS framework [2] as a backend for accelerators. It is a task-based runtime system developed at ORNL to support multiple accelerator types. IRIS uniformly supports many accelerators and has an internal scheduler to dynamically distribute compute tasks to multiple devices according to the scheduling policy specified by the application. Our goal is realizing the portability of the accelerator programming environment. We aim for three types of probabilities: programming portability, performance portability, compiler portability. In this poster, we will demonstrate the unification and portability for multiple accelerator types of our proposed programming environment.
References
[1] Norihisa Fujita, Beau Johnston, Ryohei Kobayashi, Keita Teranishi, Seyong Lee,
Taisuke Boku, and Jeffrey S. Vetter. 2023. CHARM-SYCL: New Unified Programming
Environment for Multiple Accelerator Types. In Workshops of The International
Conference on High Performance Computing, Network, Storage, and Analysis (SC-W
2023), November 12–17, 2023, Denver, CO, USA. ACM, New York, NY, USA, 11 pages.
https://doi.org/10.1145/3624062.3624244
[2] Jungwon Kim, Seyong Lee, Beau Johnston, and Jeffrey S. Vetter. 2021. IRIS: A
Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems. In
Proceedings of the 25th IEEE High Performance Extreme Computing
Conference(HPEC’21).1–8. https://doi.org/10.1109/HPEC49654.2021.9622873

researchmap
ラベルの出現頻度に着目したFPGAを用いた正規パス問合せの提案

溝谷, 祐大, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

第16回データ工学と情報マネジメントに関するフォーラム(DEIM2024) 2024.2

　More details

Event date： 2024.2 - 2024.3

Language：Japanese Presentation type：Oral presentation (general)

近年，グラフ分析は盛んに行われており，グラフから様々な情報が取得されている.グラフ分析の中でも，ユーザが望むデータを取得するための手法として，正規パス問合せ (RPQ) が存在する.RPQ とはエッジにラベルが貼られたグラフデータを対象とした問合せであり，指定されたラベルの並びを持つパスがグラフ中に存在するかどうかを探索し，存在する場合そのパスの始点・終点ノードを結果としてユーザに返す処理である.ここで課題となるのが， RPQ 評価の計算時間である.近年，データ分析において対象データの大規模化を受けてから，RPQ の対象となるグラフも大規模化が予想されており，現実世界に存在するような多種多様かつ大規模なグラフに対しては，実行に多大な時間を要することが想定される.そのような大規模なデータを処理するために FPGA (Field Programmable Gate Array) などのハードウェアアクセラレータの利用が注目されている.FPGA とは任意の回路をプログラミングによって繰り返し実装可能なハードウェアチップである.FPGA を用いた RPQ の高速化の既存研究では，FPGA の回路規模をすべて有効に利用できない場合が存在することや，複数 FPGA への拡張が困難といった課題点が存在する.そこで本研究では複数カーネルを利用して並列に RPQ 処理を行う手法を提案する.複数カーネルを用いることで，各カーネルが FPGA 内部で独立した回路として実装され並列動作が可能なため，FPGA の回路をより有効に活用できることや，今後複数 FPGA への手法の拡張が容易になることが利点として挙げられる.提案手法では，複数カーネルを用いた手法を実装するためにラベルの出現頻度に着目した.出現頻度が低いラベルをレアラベルを定義し，グラフとクエリをレアラベルを用いて分割することで，複数カーネルを用いた RPQ 処理が可能となる.評価実験では，レアラベルと定義するラベルの個数，クエリ中に出現するレアラベルの個数が多いときに RPQ 評価に要する時間が短くなることを確認した.また，一定の条件のもとで比較手法である，三浦らの手法よりも高速に RPQ 評価を行えることも確認した.

researchmap
Unified Programming Environment for Multiple Accelerator Types with Portability International conference

Fujita, Norihisa, Johnston, Beau, Kobayashi, Ryohei, Teranishi, Keita, Lee, Seyong, Boku, Taisuke, Vetter, Jeffrey

The 6th R-CCS International Symposium 2024.1

　More details

Event date： 2024.1

Language：English Presentation type：Poster presentation

Ensuring performance portability across a range of accelerator architectures presents a significant challenge when developing application and programming systems for high performance computing (HPC) environments. This challenge becomes even more pronounced within computing nodes that incorporate multiple accelerator types. Each of these accelerators is distinguished by its specific performance attributes, optimal data layouts, programming interfaces, and program binaries. Navigating the complexity of multi-accelerator programming has motivated us to create the CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices) framework, which transparently selects the suitable computations for each accelerator in a given HPC system.
CHARM-SYCL is a unified programming environment based on the concept for multiple accelerator types to attach the diversity problem in HPC systems. We can use SYCL as the single programming environment and create portable applications that are compatible with many accelerator types in a single executable binary file. The CHARM-SYCL runtime uses the IRIS framework as a backend for accelerators. It is a task-based runtime system developed at ORNL to support multiple accelerator types. IRIS uniformly supports many accelerators and has an internal scheduler to dynamically distribute compute tasks to multiple devices according to the scheduling policy specified by the application.
Unlike other operating systems, Linux has a distribution culture. Under the circumstances, it is difficult for us to run the same binary on different distributions because they have different versions of the Linux kernels, compilers, and libraries. In addition to the differences in the distributions, different systems usually have different configurations because of the differences in the system, such as the type of CPUs or accelerators. This forces users to compile and install the CHARM-SYCL compiler on individual systems to avoid compatibility problems. This process will be a very troublesome task for computer scientists because they are not computer professionals. We want to make the installation process as simple as possible. To solve this problem, we propose the compiler portable mode of the CHARM-SYCL compiler. It is a special configuration mode at compile time of the compiler. It maximizes the compatibility and allows us to run the compiler on the major Linux distributions used in HPC systems with the same binary.
In this poster, we will demonstrate the unification and portability for multiple accelerator types of our proposed system.

researchmap
Enhancing spatial parallelism on loop structure for FPGA International conference

Sano, Yuka, Boku, Taisuke, Fujita, Norihisa, Kobayashi, Ryohei, Sato, Mitsuhisa, Tsuji, Miwako

HPC Asia 2024 2024.1

　More details

Event date： 2024.1

Language：English Presentation type：Poster presentation

In today's HPC systems, GPUs with high computational performance and memory bandwidth are the leading players. However, GPU-based acceleration is designed to excel when utilizing many computation cores and performing SIMD/STMD manner. One of the alternative solutions is FPGA (Field Programmable Gate Array).

Currently, it is available to program FPGA devices in high-level language. However, the programmer needs high optimization skills to exploit its potential performance. To solve this problem, we have been developing an OpenACC-ready compiler for FPGA. This research has been performed based on Omni OpenACC compiler in collaboration with the Center for Computational Sciences at the University of Tsukuba (CCS) and RIKEN Center for Computational Science (R-CCS).
In this study, we evaluate and examine high-level synthesis-based FPGA programming techniques towards the compiler-based performance optimization. We try various techniques to increase the number of computational elements by spatial parallelism, such as pipelining, loop unrolling, and simultaneous execution of multiple kernels. Here we target the CG (Conjugate Gradient) method code for matrix calculation described in OpenCL.
Based on the optimization methods obtained in this research, we are implementing the functionality to generate OpenCL code from OpenACC using the Omni OpenACC compiler. This feature will provide existing FPGA programmers with a more straightforward programming environment than OpenCL. Additionally, the programming approach of adding directives to sequential code is expected to reduce the amount of code and development time. Furthermore, FPGA acceleration efforts are expected to expand to applications that have been reluctant to use FPGA-based acceleration until now.

researchmap
Using Intel oneAPI for multi-hybrid acceleration programming with GPU and FPGA coupling International conference

Liang, Wentao, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

International Workshop on Intel eXtreme Performance Users Group (IXPUG) 2024.1

　More details

Event date： 2024.1

Language：English Presentation type：Poster presentation

Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.

researchmap
Using Intel oneAPI for multi-hybrid acceleration programming with GPU and FPGA coupling

Liang, Wentao, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

第247回システム・アーキテクチャ・第192回ハイパフォーマンスコンピューティング合同研究発表会 2023.12

　More details

Event date： 2023.12

Language：English Presentation type：Oral presentation (general)

Intel oneAPI is a programming framework that accepts various accelerators such as GPUs, FPGAs, and multi-core CPUs, with a focus on HPC applications. Users can apply their code written in a single language, DPC++, to this heterogeneous programming environment. However, in practice, it is not easy to apply to different accelerators, especially for non-Intel devices such as NVIDIA and AMD GPUs. We have successfully constructed a oneAPI environment set to utilize the single DPC++ programming to handle true multi-hetero acceleration including NVIDIA GPU and Intel FPGA simultaneously. In this paper, we will show how this is done and what kind of applications can be targeted.

researchmap
グラフニューラルネットワークにおけるHPC最前線 Invited

小林, 諒平

液体・ガラスへのデータ駆動アプローチ ~ グラフニューラルネットワークとその周辺 ~

　More details

Event date： 2023.11

researchmap
CHARM-SYCL: New Unified Programming Environment for Multiple Accelerator Types International conference

Fujita, Norihisa, Johnston, Beau, Kobayashi, Ryohei, Teranishi, Keita, Lee, Seyong, Boku, Taisuke, Vetter, Jeffrey S

RSDHA: 3rd Workshop on Redefining Scalability for Diversely Heterogeneous Architectures

　More details

Event date： 2023.11

Addressing performance portability across diverse accelerator architectures has emerged as a major challenge in the development of application and programming systems for high-performance computing environments. Although recent programming systems that focus on performance portability have significantly improved productivity in an effort to meet this challenge, the problem becomes notably more complex when compute nodes are equipped with multiple accelerator types—each with unique performance attributes, optimal data layout, and binary formats. To navigate the intricacies of multi-accelerator programming, we propose CHARM-SYCL as an extension of our CHARM multi-accelerator execution environment [27]. This environment will combine our SYCL-based performance-portability programming front end with a back end for extremely heterogeneous architectures as implemented with the IRIS runtime from Oak Ridge National Laboratory. Our preliminary evaluation indicates potential productivity boost and reasonable performance compared to vendor-specific programming system and runtimes.

researchmap
Performance improvement by enhancing spatial parallelism on FPGA for HPC applications International conference

Sano, Yuka, Boku, Taisuke, Sato, Mitsuhisa, Tsuji, Miwako, Fujita, Norihisa, Kobayashi, Ryohei

IEEE Cluster 2023

　More details

Event date： 2023.10 - 2023.11

In today’s HPC systems, GPUs with high computational performance and memory bandwidth under relatively low power consumption are the leading players. However, GPU-based acceleration is designed to excel when utilizing many computation cores and performing SIMD/STMD manner of synchronized computation over a large number of uniform data array elements. Therefore, it may not fully exploit its computational performance in calculations with low parallelism, complex operations involving conditional branching, or parallel applications with frequent inter-node communication to interrupt continuous computing on GPU devices. One of the alternative solutions for accelerated computing is FPGA (Field Programmable Gate Array), especially with recent advancements in devices containing a large number of logic elements, high memory bandwidth, and even multiple channels of high-speed optical interconnection interfaces, reaching up to 100 Gbps for each. The performance of an FPGA is based on pipeline parallelism, enabling the computation stream to continue even with conditional branches.

researchmap
Castと通信の並列実行のための予備実験

森江, 善之, 和田, 康孝, 小林, 諒平, 坂本, 龍一

第191回ハイパフォーマンスコンピューティング研究発表会

　More details

Event date： 2023.9

現在，HPC システムで Approximate Computing（AC）を適用することはコンピュータシステムの消費電力や実効性能のトレードオフを行う上で重要である．さらに HPC システムにおけるデータ転送に関してはそのデータ精度がメッセージの総量を決めるため，データ精度を削減する AC のデータ転送への適用の効果は高くなり，特にメッセージサイズが大きい通信が頻発するアプリケーションではより重要となる．この AC をデータ転送へ適用する上で，Cast 処理と通信のオーバラップ実行による性能向上技術の確立が事前に必要となる．これは，Cast 処理と通信のオーバラップを行う方法が確立すれば，データを分割することで Cast 処理と通信を並行実行してパイプライン転送する手法が利用可能となるからである．このデータ転送手法の実現することでさらなる通信性能向上や消費電力削減が可能となる．そこで，本稿では Cast 処理と通信のオーバラップ実行を効果的に行う要件を調査する予備実験を行った．この実験結果から Cast 処理と通信のオーバラップ実行をするには通信プロトコルの選択が影響することが分かった．また，通信プロトコルのうち Rendezvous プロトコルはそのままでは Cast 処理と通信がオーバラップ実行されないことあることが分かった．この状況に対応するためには通信処理を進捗するための通信スレッドを利用するか，メインスレッドにて MPI_Test() などの通信関数を定期的に呼び出すことで通信処理を進捗させることが出来ると分かった．

researchmap
細粒度なApproximate Computing適用に向けた演算精度変更による影響の評価

和田, 康孝, 森江, 善之, 小林, 諒平, 坂本, 龍一

第191回ハイパフォーマンスコンピューティング研究発表会

　More details

Event date： 2023.9

本質的に高い演算精度を要求する HPC アプリケーションに対して Approximate Computing 技術を適用し，演算精度と実行性能，および消費電力等の間でトレードオフを最適化するためには，アプリケーション内のタスクやデータそれぞれの特性に応じて，演算精度制御の度合いを最適化する必要がある．本稿では，複数のベンチマークにおいて動的に演算精度を変更した際の実行性能および演算結果への影響を評価し，HPC アプリケーションに対する細粒度な Approximate Computing 技術の適用に向けた検討を行う．

researchmap
Pegasusビッグメモリスーパコンピュータの性能評価

建部, 修見, 平賀, 弘平, 前田, 宗則, 藤田, 典久, 小林, 諒平, 額田, 彰

第190回ハイパフォーマンスコンピューティング研究発表会（SWoPP2023）

　More details

Event date： 2023.8

Pegasus は筑波大学計算科学研究センターに 2022 年 12 月に導入され，2023 年 4 月より本運用を開始したスーパコンピュータである．Intel，NVIDIA の最新 CPU，GPU をいち早く導入し，6.5 PFlops の演算性能をもつ．大容量データの解析，大規模 AI を推進するため，不揮発性メモリを大規模に導入した．各計算ノードでは 2 TiB の大容量メモリが利用可能であり，またその領域は超高速ストレージとしても利用可能である．本研究報告では Pegasus の概要を述べるとともに，性能について報告する．

researchmap
NVIDIA H100 GPUにおけるグラフニューラルネットワークの学習精度と実行性能評価

小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

第190回ハイパフォーマンスコンピューティング研究発表会（SWoPP2023）

　More details

Event date： 2023.8

今日の情報化社会を支えるグラフ構造データを分析する手法としてグラフニューラルネットワーク (GNN) が深層学習の発展に伴い注目を集めており，近年におけるデータの大規模化や機械学習アプリケーションの多様化から GNN の学習精度の向上および学習時間の短縮を実現する手法の確立が望まれている．本稿では，NVIDIA 社が現在提供する最新型 GPU である NVIDIA H100 GPUを用いて実施した，代表的なグラフデータおよび GNN 実装間における学習時間と精度の推移評価について報告する．評価実験により，NVIDIA H100 GPU 上で動作させた GNN モデルは，NVIDIA Tesla V100 GPU で動作させた場合と比較し，1.6～1.7 倍高速に学習を実行することが確認された．

researchmap
SYCLに基づく複数の演算加速装置を統一的に扱えるプログラミング手法の提案

藤田, 典久, 小林, 諒平, Beau, Johnston, Narasinga, Rao Miniskar, Seyong, Lee, Keita, Teranishi, Jeffrey, S. Vetter, 朴, 泰祐

第190回ハイパフォーマンスコンピューティング研究発表会（SWoPP2023）

　More details

Event date： 2023.8

異なる特性を持つ複数のアクセラレータを適材適所的に用いることを我々は CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices) コンセプトと呼んでいる．CHARM においては，複数種類のアクセラレータを利用するために，アクセラレータ毎に複数の言語を使い分け，さらにそれらを組み合わせて複数種類デバイスを効率的に動作させるプログラミングが求められるが，このようなプログラムを記述するのは容易ではない．本研究では，CHARM プログラミングが抱える問題を解決するために，複数の演算加速装置を統一的に扱える SYCL に基づく処理系 “CHARM-SYCL” の提案を行う．CHARM-SYCL のランタイムは Oak Ridge NationalLaboratory で開発されているタスクランタイムシステムである IRIS をサポートし，IRIS を用いて複数種類デバイスの対応を実現する．本原稿では，CHARM-SYCL の実装の詳細および性能評価について報告する．

researchmap
Accelerating astrophysics simulation with GPUs and FPGAs Invited International conference

Kobayashi, Ryohei

ADAC (Accelerated Data Analytics and Computing Institute) ~ Applications Working Group Monthly Seminar ~

　More details

Event date： 2023.6

The use of graphic processing units (GPUs) has become very popular owing to their good peak performance and high memory bandwidth; however, they do not work well for applications that employ partially poor parallelism or frequent inter-node communication. Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research as their computational and communication capabilities have drastically improved in recent years. GPU-FPGA coupling could be ideal for Multiphysics problems where various computations are included within a simulation and difficult to accelerate by GPU alone. Currently, researchers at the University of Tsukuba are conducting research and development on an approach to holistic acceleration of applications in HPC cluster systems equipped with GPUs and FPGAs, making full use of both accelerators. This talk will present the outline of the programming environment, implementation, and performance evaluation of a GPU-FPGA-accelerated application for astrophysics simulations.

researchmap
輻射輸送シミュレーションのためのFPGAとGPUによるスクラッチパッドメモリの効率と有効性の分析

古川, 和輝, 山口, 佳樹, 横野, 智也, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 安倍, 牧人, 朴, 泰祐, 梅村, 雅之

リコンフィギャラブルシステム研究会（RECONF）

　More details

Event date： 2023.6

宇宙輻射輸送シミュレーションコードに含まれる ART(Authentic Radiation Transfer) スキームは，高計算量かつメモリ律速であり，アクセラレータによる演算加速が期待されている.本研究では、ART スキーム特有のスクラッチパッドメモリ機構を考案し，PRISM (PRefetchable and Instantly accessible Scratchpad Memory) と名付けた.この PRISM を FPGA と GPU それぞれに実装し，オリジナルの実装と比較した結果，シミュレーション空間が小さい場合は FPGA が高速で，最大 1.8 倍，大きい場合は GPU が高速で，最大 5.4 倍の演算高速化が達成された.

researchmap
OpenACC Unified Programming Environment for Multi-hybrid Acceleration with GPU and FPGA International conference

Boku, Taisuke, Tsunashima, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Lee, Seyong, Vetter, Jeffrey S, Murai, Hitoshi, Nakao, Masahiro, Tsuji, Miwako, Sato, Mitsuhisa

2023 WORKSHOP: HPC ON HETEROGENEOUS HARDWARE (H3)

　More details

Event date： 2023.5

Accelerated computing in HPC such as with GPU, plays a central role in HPC nowadays. However, in some complicated applications with partially different performance behavior is hard to solve with a single type of accelerator where GPU is not the perfect solution in these cases. We are developing a framework and transpiler allowing the users to program the codes with a single notation of OpenACC to be compiled for multi-hybrid accelerators, named MHOAT (Multi-Hybrid OpenACC Translator) for HPC applications. MHOAT parses the original code with directives to identify the target accelerating devices, currently supporting NVIDIA GPU and Intel FPGA, dispatching these specific partial codes to background compilers such as NVIDIA HPC SDK for GPU and OpenARC research compiler for FPGA, then assembles binaries for the final object with FPGA bitstream file. In this paper, we present the concept, design, implementation, and performance evaluation of a practical astrophysics simulation code where we successfully enhanced the performance up to 10 times faster than the GPU-only solution.

researchmap
HPC利用に向けたFPGA間シリアル通信コントローラKyokkoのIntel FPGAへの実装

北爪, 開人, 藤田, 典久, 小林, 諒平, 朴, 泰祐

第189回ハイパフォーマンスコンピューティング研究発表会

　More details

Event date： 2023.5

高性能計算における演算加速装置として FPGA (Field-Programmable Gate Array) が注目されている．高位合成や高速な光インターフェースを備えた FPGA ボードの登場など FPGA の有用性が高まる一方で，高性能計算における FPGA を用いた並列計算を行うための環境は未だ発展途上である．これらの一環として，筑波大学計算科学研究センターでは複数の FPGA 上で並列計算を行うために，OpenCL を用いた高位合成によって FPGA 間の高速通信を可能とするフレームワーク CIRCUS (Communication Integrated Reconfigurable CompUting System) を開発しているが，現状の CIRCUS にはフロー制御が未実装であるという課題がある．この問題は，通信部で用いている FPGA 間通信プロトコルにフロー制御がないことが原因であるため，本研究では通信部をフロー制御を含むプロトコルに置き換え，この問題を解決する．本稿では CIRCUS の通信部を置き換える通信プロトコルとして，オープンソースな通信プロトコルである Kyokko の性能評価を行う．最大で 1 ポートあたり 100Gbps の通信が可能な Intel Stratix 10 GX H-tlie を搭載した FPGA ボードである Bittware 520N 上に Kyokko を実装し，バンド幅やレイテンシ，フロー制御について評価する．実験の結果，Kyokko は 99.98% を超える高い効率と理論性能に近いバンド幅を示した．また，データの送受信にかかるレイテンシは，チャンネルボンディングしない場合は約 170ns，4 チャンネルボンディングの場合は約 180ns であり，高速であった．フロー制御のレイテンシは，チャンネルボンディングしない場合では約 310ns，4 チャンネルボンディングの場合では約 320ns であり，これらから NFC メッセージを受信した際の処理は極めて高速であることが分かった．

researchmap
FPGA高位合成における演算性能向上のための空間並列性記述に関する研究

佐野, 由佳, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 佐藤, 三久

第188回ハイパフォーマンスコンピューティング研究発表会

　More details

Event date： 2023.3

今日の高性能計算システムでは，高い演算性能とメモリバンド幅を有する GPU (Graphic Processing Unit) が高性能計算向けアプリケーションの演算加速装置として積極的に導入されている．しかし，GPU による演算加速は，GPU が持つ数多くのコアを活用し，かつそれらが SIMD (Single Instruction Multiple Data) 的な均質な処理が行われた時に性能を発揮するように構築されているため，並列度の低い計算や条件分岐などの複雑な処理を必要とする演算，通信が頻発するアプリケーションではその演算性能を十全に発揮することはできない．そこで，その GPU にとって不適合な演算を，回路の再構成によってアプリケーションに特化した演算パイプラインやメモリシステムを柔軟に構築できる FPGA (Field-Programmable Gate Array) にオフロードする手法が注目を集めている．現在の GPU プログラミング環境では，OpenACC に代表される指示文によるユーザフレンドリーなプログラミング環境が存在するが，FPGA プログラミング環境では，指示文を利用したプログラミング環境の完成度は高いとは言えない．そのため，我々は理化学研究所計算科学研究センター (R-CCS) と筑波大学計算科学研究センター (CCS) との共同研究により，Omni OpenACC コンパイラを FPGA プログラミング環境向けに改良する研究を進めている．本研究では，コンパイラによる演算性能最適化の手法を検討する材料として，高位合成を用いた FPGA プログラミングの演算性能向上手法について評価・検討する．具体的には，OpenCL によって記述された CG (Conjugate Gradient) 法のコードに対し，パイプライン化，Loop Unrolling，複数カーネル同時実行等，演算要素数を増やすための各種手法を試す．そして，ループの Unroll 数，同時実行するカーネル数を変化させ，FLOPS 数と BRAM (Block Random Access Memory) の使用率を評価する．FPGA の高速化は基本的にパイプライン処理によって得られるが，このクロックサイクル内の演算数を増加させ，同時に BRAM 使用量への影響等を調べ，性能最適化のための方策を探る．ただし，FPGA では Loop Unrolling の深さや，使用演算器数，メモリ使用量によって動作周波数が変化し，それらの間に複雑なトレードオフが存在するため，一概に同時実行演算数を増やすことが性能向上に資するとは限らない．今回実装した Intel Stratix10 FPGA 上での CG 法のコードでは，1 つのカーネルで Loop Unrolling を 8 回行った場合に最も高性能になることが判明した．また，2 つのカーネルで Loop Unrolling を 8 回行った場合に，動作周波数との関係で性能が最高になったが，メモリ使用量が大きく増大してしまった．他アプリケーションとの同 FPGA 上への同時実装のためにはメモリ使用量を抑える必要があり，そういう場合は 2 つのカーネルで Loop Unrolling を 4 回行った場合が最も高性能になることがわかった．

researchmap
FPGA間通信フレームワークCIRCUSを利用した複数FPGAによるグラフ幅優先探索の提案

溝谷, 祐大, 小林, 諒平, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

第15回データ工学と情報マネジメントに関するフォーラム (DEIM 2023)

　More details

Event date： 2023.3

グラフ構造は，様々なデータをノードとエッジで表したデータ構造のことであり，我々の身の回りの多種多様なデータの関係性を表すのに有用である．グラフの分析は盛んに行われており，グラフから様々な情報が取得されている．グラフの分析アルゴリズムの中でも，幅優先探索は最も広く使われているアルゴリズムである．幅優先探索とはグラフ探索アルゴリズムの一種であり，デジタル回路のテスト・検証，道路ネットワークの解析など，幅広い分野で応用されている．しかし，近年グラフの大規模化によって，幅優先探索に多大な計算コストが必要となることが多い．また，不規則なメモリアクセスが多くなるためメモリ帯域を有効に利用できないといった問題がある．ここで我々は FPGA に着目した．FPGA とは，任意の回路をプログラミングによって繰り返し実装可能なハードウェアチップである．その性能上の特徴は各回路の並列性を利用した並列度の高い処理が可能なことである．また，FPGAでは外部通信用光リンクを利用できる．この外部通信用光リンクは FPGA 上の回路と直接接続されているため超低レイテンシで他の FPGA と通信することが可能となる．この特徴を活用する技術として FPGA 間通信フレームワーク，CIRCUS がある．本研究では，CIRCUS を利用し，複数 FPGA を使い幅優先探索を実装する．

researchmap
GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

HPC Asia '23: International Conference on High Performance Computing in Asia-Pacific Region

　More details

Event date： 2023.2 - 2023.3

The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics simulations, which encompass multiple physical models and simultaneous physical phenomena. Because the constituent operations in multiphysics simulations exhibit varying characteristics, accelerating these operations solely using GPUs is often challenging. Hence, FPGAs are frequently implemented for this purpose. The objective of the present study was to further improve application performance by employing both GPUs and FPGAs in a complementary manner. Recently, this approach has been applied to the radiative transfer simulation code for astrophysics known as ARGOT, with evaluation results quantitatively demonstrating the resulting improvement in performance. However, the evaluation results in question came from the use of a single node equipped with both a GPU and FPGA. In this study, we extended the GPU–FPGA-accelerated ARGOT code to operate on multiple nodes using the message passing interface (MPI) and an FPGA-to-FPGA communication technology scheme called Communication Integrated Reconfigurable CompUting System (CIRCUS). We evaluated the performance of the ARGOT code with multiple GPUs and FPGAs under weak scaling conditions, and found it to achieve up to 12.8x speedup compared to the GPU-only execution.

researchmap
Implementation and Performance Evaluation of Collective Communications Using CIRCUS on Multiple FPGAs International conference

Kikuchi, Kohei, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke

International Workshop on Intel eXtreme Performance Users Group (IXPUG) co-located with HPC Asia 2023

　More details

Event date： 2023.2

In the high-performance computing domain, Field Programmable Gate Array (FPGA) is a novel accelerator that exhibits high flexibility and performance characteristics distinct from other accelerators such as the Graphics Processing Unit (GPU). Recent advanced high-end FPGA is equipped with multiple channels of high speed optical link up to 100Gbps performance for each. This is a crucial feature when we construct PC clusters with FPGAs as accelerators, however it is not easy to utilize from user kernels because this feature is implemented in low level and simple direct communication between neighboring FPGAs.

In order to provide the communication feature between FPGAs for accelerated PC clusters, we developed a communication system named CIRCUS which implies a user-friendly API from OpenCL and is equipped with routing function over multi-hop communication on multi-dimensional torus network of FPGAs. However, current CIRCUS only provides a point-to-point communication between source and destination FPGAs. In ordinary parallel processing environment such as MPI, the user program the message passing with various collective communication functions for parallel algorithm, for instance Allreduce, Allgather, etc. In this paper, we implement the collective communication function over CIRCUS for user-friendly programming of ordinary parallel algorithms on FPGAs. As the first target, we implement Allreduce function which is the most essential and important function. The paper describes the CIRCUS system briefly followed by the design, implementation and preliminary performance evaluation on Intel Stratix10 FPGAs.

researchmap
An FPGA-based Accelerator for Regular Path Queries over Edge-labeled Graphs International conference

Miura, Kento, Kobayashi, Ryohei, Amagasa, Toshiyuki, Kitagawa, Hiroyuki, Fujita, Norihisa, Boku, Taisuke

2022 IEEE International Conference on Big Data (Big Data) 2022.12

　More details

Event date： 2022.12

Language：English Presentation type：Oral presentation (general)

Venue：Japan Osaka

Edge-labeled directed graphs are commonly used to represent various information in different applications, such as social networks, knowledge graphs, etc., and regular path queries (RPQs) allow us to extract pairs of nodes that are reachable from one to another through a labeled path matching with the query pattern represented as a regular expression. It is useful for us to extract complicated or semantically meaningful information from a graph, but it gives rise to a challenge when dealing with large graphs. This is due to the long execution time caused by the explosive growth of intermediate results, but, on the other hand, some applications require fast query executions. To address this problem, we propose an FPGA-based RPQ accelerator. The idea is to exploit FPGA’s parallelism in traversing the target graph and matching the regular path expression in parallel with the pipeline manner. To validate the performance of the proposed method, we conducted a set of experiments. From the results, we observed that the proposed method achieves shorter elapsed times for RPQs against social graphs extracted from the real world, up to three orders of magnitude compared with baseline methods.

researchmap
Accelerating Radiative Transfer Simulation on NVIDIA GPUs with OpenACC International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

The 23rd International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’22)

　More details

Event date： 2022.12

To accelerate multiphysics applications, making use of not only GPUs but also FPGAs has been emerging. Multiphysics applications are simulations involving multiple physical models and multiple simultaneous physical phenomena. Operations with different performance characteristics appear in the simulation, making the acceleration of simulation speed using only GPUs difficult. Therefore, we aim to improve the overall performance of the application by using FPGAs to accelerate operations with characteristics which cause lower GPU efficiency. However, the application is currently implemented through multilingual programming, where the computation kernel running on the GPU is written in CUDA and the computation kernel running on the FPGA is written in OpenCL. This method imposes a heavy burden on programmers; therefore, we are currently working on a programming environment that enables to use both accelerators in a GPU–FPGA equipped high-performance computing (HPC) cluster system with OpenACC. To this end, we port the entire code only with OpenACC from the CUDA-OpenCL mixture. On this basis, this study quantitatively investigates the performance of the OpenACC GPU implementation compared to the CUDA implementation for ARGOT, a radiative transfer simulation code for fundamental astrophysics which is a multiphysics application. We observe that the OpenACC implementation achieves performance and scalability comparable to the CUDA implementation on the Cygnus supercomputer equipped with NVIDIA V100 GPUs.

researchmap
並列FPGA環境における通信システムCIRCUSを用いた集団通信の実装と性能評価

菊池, 航平, 藤田, 典久, 小林, 諒平, 朴, 泰祐

第187回ハイパフォーマンスコンピューティング研究発表会 2022.12

　More details

Event date： 2022.12

Language：Japanese Presentation type：Oral presentation (general)

近年，新たな HPC アクセラレータとして FPGA (Field Programmable Gate Array) が注目されている．FPGA は高速なシリアル I/O インタフェースを備えており，直接インタフェースを通じて FPGA 間の通信を行うことができる．直接通信により高い通信バンド幅を低レイテンシで扱うことができる特長は FPGA のみのものであり，問題規模の拡大や性能向上のために FPGA を並列化して用いようとする場合に大きな威力を発揮することが期待される．筑波大学計算科学研究センターでは並列 FPGA 実行を行う HPC アプリケーションの開発をサポートするため，FPGA 間通信フレームワーク CIRCUS (Communication Integrated Reconfigurable CompUting System) を開発している．CIRCUS は FPGA ネットワークにおけるルータ機能と通信 API を提供しており，OpenCL のプログラムから FPGA 間通信の記述を可能にする．しかし現状で CIRCUS が対応している通信パターンは 1 対 1 通信のみであり，通信ライブラリとして広く用いられている MPI にあるような集団通信は実装されていない．本研究の目的は，CIRCUS の上で動作する，高性能でユーザフレンドリーな集団通信APIを，並列 FPGA を利用する HPC ユーザに提供することである．この目的を実現するために，本稿では CIRCUS を用いた Allreduce 通信の設計・実装を行う．実装は 4 つの FPGA 上で正常に動作するが，CIRCUS 通信にフロー制御機能がないため性能が低下していることが分かった．この問題を回避するためには複雑なプログラミングが必要であり，余分なオーバヘッドを避けられない．この問題を解決するために，FPGA 間通信コントローラをフロー制御対応のものに置き換えることを計画している．

researchmap
Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling International conference

Boku, Taisuke, Fujita, Norihisa, Kobayashi, Ryohei, Tatebe, Osamu

2nd International Workshop on Deployment and Use of Accelerators (DUAC) - co-located with the 51st International Conference on Parallel Processing - 2022.8 DUAC2022 Organization Committee

　More details

Event date： 2022.8 - 2022.9

Language：English Presentation type：Oral presentation (general)

Venue：France Bordeaux

In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network.

To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration.

We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.

researchmap
OpenACC-Enabled GPU-FPGA Accelerated Computing for Astrophysics Simulation Invited International conference

Kobayashi, Ryohei

OpenACC and Hackathons Asia-Pacific Summit 2022 2022.8

　More details

Event date： 2022.8

Language：English Presentation type：Oral presentation (general)

There are a variety of accelerators available to the high performance computing (HPC) community. The use of graphic processing units (GPUs) has become very popular owing to their good peak performance and high memory bandwidth; however, they do not work well for applications that employ partially poor parallelism or frequent inter-node communication. Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research as their computational and communication capabilities have drastically improved in recent years. GPU-FPGA coupling could be ideal for Multiphysics problems where various computations are included within a simulation and difficult to accelerate by GPU alone.

Currently, researchers at the University of Tsukuba are working on a programming environment that enables the use of both accelerators in a GPU-FPGA-equipped HPC cluster system with OpenACC. This talk will present the outline of the programming environment, implementation, and performance evaluation of a GPU-FPGA-accelerated application for astrophysics simulations.

researchmap
Implementation and Performance Evaluation of Memory System Using Addressable Cache for HPC Applications on HBM2 Equipped FPGAs International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

HeteroPar 2022: Twentieth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms

　More details

Event date： 2022.8

When we apply field programmable gate arrays (FPGAs) as HPC accelerators, their memory bandwidth presents a significant challenge because it is not comparable to those of other HPC accelerators. In this paper, we propose a memory system for HBM2-equipped FPGAs and HPC applications that uses block RAMs as an addressable cache implemented between HBM2 and an application. This architecture enables data transfer between HBM2 and the cache bulk and allows an application to utilize fast random access on BRAMs. This study demonstrates the implementation and performance evaluation of our new memory system for HPC and HBM2 on an FPGA. Furthermore, we describe the API that can be used to control this system from the host. We implement RISC-V cores in an FPGA as controllers to realize fine-grain data transfer control and to prevent overheads derived from the PCI Express bus. The proposed system is implemented on eight memory channels and achieves 102.7 GB/s of the bandwidth. It overcomes the memory bandwidth of conventional FPGA boards with four channels of DDR4 memory despite using only 8 of 32 channels of the HBM2.

researchmap
GPU・FPGA複合型演算加速クラスタを用いた宇宙輻射輸送コードARGOTの多ノード並列化

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第185回ハイパフォーマンスコンピューティング研究発表会（SWoPP2022） 2022.7

　More details

Event date： 2022.7

Language：Japanese Presentation type：Oral presentation (general)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．我々はこれまで宇宙輻射輸送シミュレーションコード ARGOT にそのコンセプトを適用し，その結果得られる性能向上を評価することによって，両デバイスを併用する有用性を定量的に示してきた．しかし，これまで実現してきた GPU-FPGA 連携の演算加速は，GPU と FPGA の両デバイスが搭載された単一ノードのみの利用を前提としていた．本研究では，単一ノードの利用を前提とした GPU・FPGA 連携 ARGOT コードを，MPI および FPGA 間通信技術である CIRCUS（Communication Integrated Reconfigurable CompUting System）を用いて複数ノードで動作するように拡張し，その実装方法について報告する．

researchmap
並列化に伴うデータ空間の分割とそれによるアクセスパターンの変化がもたらすHBMの振る舞い調査

瀬口, 知洋, 中井, 榛希, 山口, 佳樹, 藤田, 典久, 小林, 諒平, 朴, 泰祐

SWoPP2022: 並列／分散／協調システムとディペンダブルコンピューティングおよび一般 2022.7

　More details

Event date： 2022.7

Language：Japanese Presentation type：Oral presentation (general)

アプリケーションの要求に合わせて演算回路を電気的に再構成可能な Field Programmable Gate Array (FPGA) は，グルー・ロジックの代用品および試作用デバイスとして誕生以来発展を続けている．半導体製造技術およびパッケージング技術などの進化に伴いその演算性能および機能を大きく改善させてきた．また，高位合成採用などによる統合開発環境の熟成とそれによる設計の簡素化は FPGA の導入コストを大きく下げることに成功し，FPGA は情報システムに広く採用されるに至っている．以上より FPGA は，GPU や AI チップなどと同様に多くの注目を集めるデバイスとして，また，演算性能向上や消費電力対性能の改善など，導入に対して得られる効果を十分に期待できるデバイスとして認知され始めている．そして近年，高性能計算分野において帯域幅の大きなメモリ（High Bandwidth Memory: HBM) を同一パッケージ内に採用した FPGA 製品が増加しており，それは低価格帯の組み込み系 FPGA 製品にも広がりつつある．一方，HBM を採用して一日の長である GPU 分野において，HBM の実効アクセス性能に対する議論が始まりつつある．そこで本報告では，FPGA における高位記述と HBM 利用との組みあわせについて整理し，今後の FPGA 設計・開発における問題提起を通して効率的な演算加速の可能性について議論する．

researchmap
Performance Evaluation on GPU-FPGA Accelerated Computing Considering Interconnections between Accelerators International conference

Sano, Yuka, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

HEART2022: International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies 2022.6 HEART2022 Organization Committee

　More details

Event date： 2022.6

Language：English Presentation type：Oral presentation (general)

Venue：Japan Center for Computational Sciences, University of Tsukuba.

Graphic processing units (GPUs) are often equipped with HPC systems as accelerators because of their high computing capability. GPUs are powerful computing devices; however, they operate inefficiently on applications that employ partially poor parallelism, non-regular computation, or frequent inter-node communication. To address these shortcomings of GPUs, field-programmable gate arrays (FPGA) have been emerging in the HPC domain because their reconfigurable capabilities enable the construction of application-specific pipelined hardware and memory systems. Several studies have focused on improving overall application performance by combining GPUs and FPGAs, and the platforms for achieving this have adopted the approach of hosting these two devices on a single compute node; however, the inevitability of this approach has not been discussed.

In this study, we evaluated it quantitatively using an astrophysics application that performs radiative transfer to simulate the early-stage universe after the Big Bang. The application runs on a compute node equipped with a GPU and an FPGA, and the GPU and FPGA computation kernels are launched from a single CPU (process) in the application. We modified the code to enable the launch of the GPU and FPGA computation kernels from separate message-passing interface (MPI) processes. Each MPI process was assigned to two compute nodes to run the application, which were equipped only with a GPU and FPGA, respectively, and the execution performance of the application was compared against that of the original GPU-FPGA accelerated application. The results revealed that the performance degradation compared to the original GPU-FPGA accelerated application was approximately 2 ∼ 3 %, thereby demonstrating quantitatively that even if both devices are mounted on different compute nodes, this is acceptable in practical use depending on the characteristics of the application.

researchmap
Performance Evaluation of Data Transfer API for Rank Level Approximate Computing on HPC Systems International conference

Morie, Yoshiyuki, Wada, Yasutaka, Kobayashi, Ryohei, Sakamoto, Ryuichi

24th Workshop on Advances in Parallel and Distributed Computational Models 2022.5

　More details

Event date： 2022.5

Language：English Presentation type：Oral presentation (general)

Approximate computing (AC) has attracted much attention to optimize tradeoffs among performance, power con-sumption, and computation results accuracy by adjusting data precision in applications. Even on HPC systems, AC is demanded to maximize performance under the limited power budget and hardware resources. To apply AC for HPC applications, we need to consider the character of each MPI rank in an application and optimize it with its appropriate data precision. However, we also need to perform data transfer while converting the precision of the target data. This paper proposes data pack/unpack APIs, which are applicable for standard MPI programs for HPC systems, for converting the data precision of the target data, and shows its performance evaluation. We can express data transfer among ranks with different data precision with the proposed APIs. The performance evaluation reveals the break-even point to apply AC for HPC applications from the perspective of data transfer volume.

researchmap
第２回ACRi討論会：若手研究者の本音～FPGA業界の良いとこ／悪いとこ～ Invited

小林, 諒平

第２回ACRi討論会：若手研究者の本音～FPGA業界の良いとこ／悪いとこ～ 2022.5

　More details

Event date： 2022.5

Language：Japanese Presentation type：Symposium, workshop panel (nominated)

researchmap
ノードを跨いだGPU・FPGA複合型演算加速による宇宙物理シミュレーションの実装と評価

佐野, 由佳, 小林, 諒平, 藤田, 典久, 朴, 泰祐

第184回ハイパフォーマンスコンピューティング研究発表会 2022.5

　More details

Event date： 2022.5

Language：Japanese Presentation type：Oral presentation (general)

近年の高性能計算システムでは，高い演算性能とメモリバンド幅を有する GPU (Graphic Processing Unit) が演算加速装置として積極的に導入されている．しかし，全てのアプリケーションが GPU に適合するということではなく，並列性がコア数に対して不足していたり条件分岐が発生したりするような，GPU にとって不適合な演算が部分的に含まれるアプリケーションではその演算性能を十全に発揮することはできない．そこで，その GPU にとって不適合な演算をアプリケーションに特化した演算パイプラインやメモリシステムを柔軟に構築できるFPGA (Field-programmable Gate Array) にオフロードし，GPU と FPGA を相補的に活用することによってアプリケーション全体の性能を向上させるアプローチが試みられている．GPU と FPGA を併用してアプリケーションを実行する研究事例は幾つか存在し，そのためのプラットフォームとしては，両デバイスを同一の計算ノードに搭載するシステムがこれまで用いられてきた．ただし，その構成の必然性については詳細に検討されていないのが現状である．そこで本研究では，GPU と FPGA を併用して初期宇宙の天体形成をシミュレートする宇宙物理アプリケーションを用いて，両方のデバイスが同じ計算機に接続される必要性を定量的に評価した．既存のコードに対して MPI (Message Passing Interface) を用いて再実装を行い，GPU と FPGA が分離した構成で動作するように修正を施した．そして，GPU と FPGA が同じ計算機に接続された構成と，GPU と FPGA が分離した構成において，アプリケーションの性能評価を行った．性能評価より，GPU と FPGA が分離した構成でアプリケーションを実行した場合は，GPU と FPGA が同じ計算機に接続された構成でアプリケーションを実行した場合と比較して，2～3 [%] の性能低下に抑えられた．以上より，GPU と FPGA を協調計算に用いる場合，アプリケーションの特性次第では，GPU と FPGA が異なる計算機に接続されている環境においても高速に協調計算が可能であることが定量的に明らかになった．

researchmap
oneAPIを用いたGPU・FPGA混載ノードにおける宇宙物理シミュレーションコードARGOTの実装

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

第183回ハイパフォーマンスコンピューティング研究発表会 2022.3

　More details

Event date： 2022.3

Language：Japanese Presentation type：Oral presentation (general)

GPU（Graphics Processing Unit）は，HPC 分野において最も広く用いられているアクセラレータの一つである．しかし，マルチフィジックスに基づく科学計算では単一のシミュレーションの中に多様なワークロードが出現し，GPU のみを用いた高速化では不十分である．我々は，このような複雑な物理シミュレーションを対象として，GPU と FPGA（Field Programmable Gate Array）の併用による高速化を目指し，CHARM（Cooperative Heterogeneous Acceleration by Reconfigurable Multidevices）というコンセプトの下，ハードウェア，プログラミングシステム，そしてアプリケーション開発をおこなっている．ここでの大きな課題は，これら複数のデバイスをどのようにプログラムするかである．近年注目されている Intel 社によって提案された oneAPI は，SYCL をベースにした DPC++ による単一言語プラットフォームを提供し，複数のデバイス間における連携プログラミングが可能である．本稿では，GPU と FPGA を用いた宇宙物理シミュレーションコード ARGOT を oneAPI によって実装し，その性能評価について報告する．本研究の特徴は，oneAPI をその一般的な利用方法とは異なり，DPC++ のみを用いた開発ではなく既存の CUDA や OpenCL によるプログラム部分コードを組み合わせるためのフレームワークとして用いている点である．結果として，oneAPI を用いることで，DPC++ によるプログラミングだけでなく，CUDA や OpenCL など他の言語で記述された既存のソースコードを再利用して，複数のデバイスが協調するプログラムを実装することができることがわかった．

researchmap
OpenACCによる宇宙物理シミュレーションのGPU＋FPGA協調計算の実装

綱島, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐, Lee, Seyong, Vetter, Jeffrey S, 村井, 均, 中尾, 昌広, 辻, 美和子, 佐藤, 三久

第183回ハイパフォーマンスコンピューティング研究発表会 2022.3

　More details

Event date： 2022.3

Language：Japanese Presentation type：Oral presentation (general)

近年 HPC 分野では，アクセラレータとして GPU や FPGA が注目されている．特に FPGA は GPU の苦手な処理でも性能を発揮することが期待されており，我々は両者を統合した次世代スーパーコンピュータの研究を行っている．しかし，GPU と FPGA を組み合わせたプログラミングでは，標準的な手法や言語が存在していない．HPC における GPU のシェアは現状では NVIDIA 社によるものが支配的であるため，主に GPU の処理は CUDA で記述されている．一方で，FPGA では高位合成技術により，ハードウェア記述言語に代わって，OpenCL の使用が可能になっている．これら二つを組み合わせてプログラミングを行うことはアプリケーションプログラマーにとって多大な負担となる．また，OpenCL では GPU のプログラミングも行うことができるが，既存のアプリケーションの多くはすでに CUDA で書かれているか，CPU 版しか存在しないため，OpenCL に書き直すには相当の負担が掛かる．仮にコードを別の言語で書き直すのであれば，より汎用性や抽象度の高い記述を行うことが理想である．そこで，我々はCAMP（Cooperative Acceleration by Multi-device Programming）というコンセプトの下，ディレクティブ形式の API である OpenACC を用いて両アクセラレータのプログラミングを統一的に行う環境である MHOAT（Multi-Hybrid OpenACC Translator）を開発している．本稿では，宇宙物理分野の実アプリケーションである ARGOT コードを対象に，MHOAT による GPU と FPGA の混合演算加速の実装について述べる．

researchmap
HBM2 搭載 FPGA のための Addressable Cache を用いた HPC 向けメモリシステムの性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

第183回ハイパフォーマンスコンピューティング研究発表会 2022.3

　More details

Event date： 2022.3

Language：Japanese Presentation type：Oral presentation (general)

高性能計算の分野で Field Programmable Gate Array (FPGA) が新たなるアクセラレータとして注目されている．他のアクセラレータと比較して，FPGA は外部メモリ帯域が弱いという弱点があり，HPC における FPGA 利用の障壁のひとつである．最新の高性能 FPGA では，High Bandwidth Memory 2 (HBM2) を搭載する FPGA があり，これを使うことで HPC における FPGA 利用が広がると考えられる．しかしながら，FPGA は固定機能としてのメモリネットワークやキャッシュを持たず，HBM2 の性能を発揮できるメモリ回路を別途開発しなければならない問題がある．本稿では，我々が研究開発している HPC 向け HBM2 メモリシステムの実装と性能評価を示す．また，本システムを扱うための API の設計と実装についても報告を行う．FPGA は自律動作できるアクセラレータであり，本システムを扱う API はこの特徴を活かしたものである．

researchmap
GPUクラスタを用いた宇宙輻射輸送コードARGOTのOpenACC実装と性能評価

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第183回ハイパフォーマンスコンピューティング研究発表会 2022.3

　More details

Event date： 2022.3

Language：Japanese Presentation type：Oral presentation (general)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本研究ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，初期宇宙における天体形成をシミュレーションする ARGOT コードを OpenACC によって実装し，OpenMP ベースの CPU 実装および CUDA ベースの GPU 実装との 1 ノード利用時の性能評価を実施した．その結果，CUDA ベースの GPU 実装と遜色ない性能を達成することが明らかとなったため，本稿では，GPU クラスタを対象に，その OpenACC 実装をマルチノード・マルチ GPU 化し，その性能評価について報告する．

researchmap
GPU and FPGA Unified Programming of Astrophysics Real Application with OpenACC International conference

Tsunashima, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke, Lee, Seyong, Vetter, Jeffrey S, Murai, Hitoshi, Nakao, Masahiro, Sato, Mitsuhisa

The 4th R-CCS International Symposium 2022.2

　More details

Event date： 2022.2

Language：English Presentation type：Poster presentation

In recent years, the power consumption required for HPC systems especially for extremely large scale one has become a serious problem. Accelerators such as GPU are one of the popular solutions as shown in the GPU-ready systems occupancy in recent TOP500 List where 7 systems out of TOP10 are equipped with GPUs. The GPU achieves excellent parallel processing performance by high degree of SIMD-parallelism and very wide bandwidth of memory such as HBM2. However, GPU is not a perfect solution in several cases where its performance degrades by conditional branching, processing with data dependency, or processing with low data parallelism. On the other hand, the FPGA (Field Programmable Gate Array) is attracted as another accelerator solution beside GPU in recent years. The performance characteristics of FPGA is quite different from GPU and there is a room to apply it where GPU does not work efficiently. FPGA is the hardware that has different properties from GPU and CPU. FPGA can reconfigure the circuit as many times as you like. Therefore, it is possible to construct an optimum circuit for each application. Since the degree of parallelism that can be realized is overwhelmingly higher in GPU, the theoretical maximum FLPOS of FPGA is lower than GPU, but FPGA is expected to have higher performance in the cases mentioned above where GPU does not work well. In addition, today’s high-end FPGAs have a very high-speed external communication interface, which enables self-controllable communication between FPGAs over multiple computation nodes. We have been researching a combined accelerated computing platform with both GPU and FPGA under the concept called CHARM (Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices), toward development of multi-hybrid accelerated system for next-generation supercomputer platform in the post-Moore era. However, mainstream programming languages do not support programming that combines GPU and FPGA. There are multiple languages used for offloading computations to the GPU, but since NVIDIA dominates the share of GPU in HPC, GPU processing is mainly described in CUDA. On the other hand, FPGAs have made it possible to use OpenCL instead of HDL (Hardware Description Languages) due to HLS (High-Level Synthesis) technology. Programming by combining these two force a heavy burden on application programmers and will slow down the growth of science research. OpenCL can also use to describe GPU programs, but many existing applications are already written in CUDA or only have a CPU version, so rewriting to OpenCL will be a considerable burden. Therefore, we are developing a programming environment for both accelerators using OpenACC under the concept of CAMP (Cooperative Acceleration by Multi-device Programming). Like OpenMP, OpenACC is an API that adds directives to CPU programs, so there is less effort for rewriting for offloading computations to the GPU and FPGA, and the burden on programmers can be greatly reduced. In the research so far, we developed MHOAT (Multi-Hybrid OpenACC Translator) that the prototype of the source-to-source OpenACC compiler using PGI compiler by NVIDIA and OpenARC supporting OpenACC for FPGA by ORNL to realize unified programming environment for CHARM by OpenACC. In this presentation, we show GPU + FPGA cooperative computing with OpenACC is possible by compiling and executing a real application ARGOT, which is a astrophysics simulation program, with MHOAT. Furthermore, as a future prospect, we are planning that the programming limitation due to the complexity of the back-end compilers of MHOAT will be eliminated by directly implementing the compilation function of OpenACC to OpenCL for GPU and FPGA in the Omni Compiler Infrastructure used for implementing MHOAT.

researchmap
Efficiency and Effectiveness Analysis of a Scratchpad Memory on FPGA and GPU for Diffuse Radiation Transfer Simulation International conference

FURUKAWA, Kazuki, YAMAGUCHI, Yoshiki, YOSHIKAWA, Kohji, KOBAYASHI, Ryohei, FUJITA, Norihisa, BOKU, Taisuke, UMEMURA, Masayuki

The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022) 2022.1

　More details

Event date： 2022.1

Language：English Presentation type：Poster presentation

Venue：Online Online

Radiation hydrodynamics is a fundamental scientific concept to unveil the cosmic physics process in astrophysics. The enormous computing efforts require a specialized approach accelerated by FPGAs and GPUs. Thus, this project targets implementing the ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) \cite{argot} onto them. In concrete, a part of ARGOT, the ART (Authentic Radiation Transfer) \cite{art}, was accelerated on both architectures. The ART scheme handles rays that represent radiation and progress linearly and parallelly in the simulation space. Moreover, the scheme sequentially and parallelly computes on the ray-traced meshes. The memory throughput has been the most critical factor in the ART acceleration because of complicated and enormous memory access. Therefore, this project proposed a buffering scheme for the ART available in FPGA and GPU to achieve sufficient acceleration. The efficiency and effectiveness are also discussed.

researchmap
FPGA Memory System for HPC Applications using Addressable Cache International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

HPC Asia 2022 – International Conference on High Performance Computing in Asia‐Pacific Region 2022.1

　More details

Event date： 2022.1

Language：English Presentation type：Poster presentation

In our previous work, we implemented an astrophysics application for the early universe for a Field Programmable Gate Array (FPGA) cluster. We conducted that FPGAs can run parallel applications efficiently thanks to their high-performance direct inter-FPGA communication. However, the memory bandwidth of an FPGA is the bottleneck to implementing an HPC application on it. The FPGA board used in the work has only 4 channels of DDR4 memory (76.8GB/s), whereas other accelerators have more than 1TB/s of memory bandwidth.
Intel Stratix 10 MX FPGA has High Bandwidth Memory (HBM) 2 providing up to 512GB/s of memory bandwidth. HBM2 aggregates many slow memory channels to achieve high performance. Although an FPGA does not have a sophisticated memory network, it must handle all memory channels simultaneously to obtain maximum performance from HBM2 memory. This is a big challenge for FPGAs equipping HBM2 memory.
We propose a new memory system for FPGA and HPC applications using addressable caches. We believe that automatic cache system like CPUs is not suitable for HPC FPGAs. If we use an automatic cache system that consumes a lot of resources, resources for the computation are reduced. These caches have data copy controllers to transfer data between caches and memories. We describe how to copy data explicitly because these caches are not automatic. Our system also has crossbars to maximize the performance and flexibility of data transfer from and to HBM2 memory. In this poster, we show the work-in-progress implementation and the performance evaluation of the proposed memory system.

researchmap
OpenACC Implementation of Radiative Transfer Simulation Code International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

HPC Asia 2022 – International Conference on High Performance Computing in Asia‐Pacific Region 2022.1

　More details

Event date： 2022.1

Language：English Presentation type：Poster presentation

Graphics processing units (GPUs) offer good peak performance and high memory bandwidth. They have been widely used in high-performance computing (HPC) systems as accelerators. However, they are not suitable for all applications, and there are applications where they don’t efficiently work on. One of such applications is multiphysics simulation. Multiphysics is defined as the coupled processes or systems involving more than one simultaneously occurring physical fields and the studies of and knowledge about these processes and systems. Therefore, multiphysics applications perform simulations with multiple interacting physical properties and there are various computations within a simulation, and GPU-non-suited ones can be included. Because of that, accelerating simulation speed by GPU only is quite difficult and this is why we try to combine GPU and FPGA and make the FPGA cover GPU-non suited computation. We call this concept Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices (CHARM) and have been working on GPU-FPGA-accelerated computation for radiative transfer simulation in astrophysics as a proof of concept.
We are currently working on a programming environment that enables to use both accelerators in a GPU-FPGA equipped HPC cluster system with OpenACC. In order to realize it, we investigate performance of OpenACC-based GPU implementation of the simulation code by comparing with those of OpenMP-based CPU implementation and CUDA-based GPU implementation, and confirmed that there is almost no difference between the CUDA and OpenACC implementations.

researchmap
Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment International conference

Kashino, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

HPC Asia 2022 – International Conference on High Performance Computing in Asia‐Pacific Region 2022.1

　More details

Event date： 2022.1

Language：English Presentation type：Oral presentation (general)

GPU (Graphics Processing Unit) computing is one of the most popular accelerating methods for various high-performance computing applications. For scientific computations based on multi-physical phenomena, however, a single device solution on a GPU is insufficient, where the single timescale or degree of parallelism is not simply supported by a simple GPU-only solution. We have been researching a combination of a GPU and FPGA (Field Programmable Gate Array) for such complex physical simulations. The most challenging issue is how to program these multiple devices using a single code.

OneAPI, recently provided by Intel, is a programming paradigm supporting such a solution on a single language platform using DPC++ based on SYCL 2020. However, there are no practical applications utilizing its full features or supporting heterogeneous multi-device programming to demonstrate its potential capability. In this study, we present the implementation and performance evaluation of our astrophysics code ARGOT used to apply the oneAPI solution with a GPU and an FPGA. To realize our concept of Cooperative Heterogeneous Acceleration by Reconfigurable Multidevices, also known as CHARM, as a type of next-generation accelerated supercomputing for complex multi-physical simulations, this study was conducted on our multi-heterogeneous accelerated cluster machine running at the University of Tsukuba.

Through the research, we found that current oneAPI framework is effective not only for its typical programming by DPC++ but also for utilizing traditionally developed applications coded by several other languages such as CUDA or OpenCL to support multiple types of accelerators. As an example of real application, we successfully implemented and executed an early stage universe simulation by fundamental astrophysics code to utilize both GPU and FPGA effectively. In this paper, we demonstrate the actual procedure for this method to program multi-device acceleration over oneAPI.

researchmap
An Efficient RTL Buffering Scheme for an FPGA-Accelerated Simulation of Diffuse Radiative Transfer International conference

Furukawa, Kazuki, Yokono, Tomoya, Yamaguchi, Yoshiki, Yoshikawa, Kohji, Fujita, Norihisa, Kobayashi, Ryohei, Boku, Taisuke, Umemura, Masayuki

International Conference on Field Programmable Technology 2021.12 FPT Steering Committee

　More details

Event date： 2021.12

Language：English Presentation type：Oral presentation (general)

Venue：New Zealand Online

In recent decades, FPGA-based HPC systems have been in the limelight. As an FPGA-accelerated application, the Accelerated Radiative transfer on Grids using Oct-Tree (ARGOT) program has been developing as a cosmic radiative transfer simulation code at the Center for Computational Sciences in the University of Tsukuba. The ARGOT program was originally a GPU-accelerated program. However, it includes the Authentic Radiation Transfer (ART) scheme for diffuse photon, which is based on a ray-tracing method for parallel rays and accounts for more than 90% of the total computation time. After examining the aspects of the ART, it is expected that FPGA acceleration would be better than GPU. Therefore, in this paper, we focus on implementing the ART scheme on a large-scale FPGA. The prime issue of the ART on FPGA is found to be its high memory bandwidth requirement. The reason is each processing element (PE) needs large size mesh data while operating. As a result of this, Terabytes of aggregated bandwidth of external memories are required by total PEs depending on the parallel number. To achieve this demand, we first conducted a mini-benchmark to investigate the use of HBM and Xilinx HBM Subsystem on HBM-FPGA, Xilinx Alveo U280 Accelerator Card. However, the result revealed that HBM does not necessarily meet the bandwidth requirements of the ART operation. To deal with this limitation, we proposed an application-specific buffering mechanism, which is named the ``PRefetchable and Instantly accessible Scratchpad Memory'' (PRISM). We used the UltraRAM on Virtex UltraScale+ FPGA chip as this local storage system. The PRISM stores the mesh data on the triangular-prism-shaped subspace, cut out from the simulation space. In addition, to reduce routing congestion, we also devised to have it tightly connected with the ART accelerators. The evaluation result with 16 PEs shows that the PRISM reduces memory access loads to less than 10\% compared to a DMA model to HBM. We will also discuss an intermediate data storage system using DDR4 SDRAM to cope with multiple FPGA operations.

researchmap
演算精度の動的制御によるApproximate Computingの実現に向けた予備評価

和田, 康孝, 小林, 諒平, 坂本, 龍一, 森江, 善之

第181回ハイパフォーマンスコンピューティング研究発表会 2021.9

　More details

Event date： 2021.9

Language：Japanese Presentation type：Oral presentation (general)

演算精度と実行性能あるいは消費電力等とのトレードオフを最適化する Approximate Computing 技術が浸透し始めている．Approximate Computing 技術を活用することで，アプリケーションを実行する際に，必要十分な精度の演算結果を得つつも，実行性能の最大化や消費電力の削減を可能とすることができる．今後さらにその効果を拡大させるためには，GPGPU や FPGA などのアクセラレータを搭載したシステムや，構成が異なるノードを複数台接続することで構成されるシステムなど，様々な状況に即して Approximate Computing を適用する必要がある．特に，アプリケーション実行時に，アプリケーションの構造やシステムの状況に応じて，動的に演算精度を調整することが重要となると考えられる．このような背景から，本稿では，アプリケーション実行時に動的に演算精度を変更・調整することを想定し，これをアプリケーションのレベルで適用した際の実行性能と演算結果への影響・トレードオフを評価する．

researchmap
HBM2 Memory System for HPC Applications on an FPGA International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

FPGA for HPC Workshop 2021 (HPC FPGA 2021)

　More details

Event date： 2021.9

Field Programmable Gate Arrays (FPGAs) have been targeted as a new accelerator of the HPC field. This is because the barrier to using FPGAs has been gradually lowered due to the widespread use of high-level synthesis (HLS) technology. In addition, the bandwidth of external memory in FPGAs is much lower than that of other accelerators widely used in HPC, such as NVIDIA V100 GPUs. However, the latest FPGAs can use High Bandwidth Memory 2 (HBM2), which has a memory bandwidth of up to 512GB/s. Therefore, we believe FPGAs will be a viable option for speeding up applications. However, unlike CPUs and GPUs, FPGAs do not have caches and memory networks to exploit the full potential of HBM2, which may limit the efficiency of the application. In this paper, we propose a memory system for HBM2 and HPC applications. We show the prototype implementation of the system and evaluate its performance. We also demonstrate the use of the proposed system from an application developed in High-Level Synthesis (HLS) written in C++.

researchmap
FPGA-cluster deployment and adoption to application users Invited International conference

Kobayashi, Ryohei

FPGA for HPC Workshop 2021 (HPC FPGA 2021) 2021.9

　More details

Event date： 2021.9

Language：English Presentation type：Symposium, workshop panel (nominated)

researchmap
HBM-FPGA implementation of a large-scale radiative transfer simulation of diffuse photon and its subjects

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

Forum on Information Technology 2021 2021.8 情報処理学会

　More details

Event date： 2021.8

Language：Japanese Presentation type：Oral presentation (general)

Venue：Online

HBM-FPGAを利用した宇宙輻射輸送シミュレーションARGOT(Accelerated Radiative transfer on Grids using Oct-Tree/筑波大学計算科学研究センター)の演算加速について議論する．本シミュレーションは，HBM などの大規模・高帯域なメモリを利用しても、メモリ帯域幅がボトルネックとなり十分な加速が難しいことが知られる．そこで本研究では，メモリアクセス効率を高めるため，演算バッファに細粒度なデータフロー制御を組み込むことでメモリアクセス数の削減を図り，飛躍的な演算速度の向上を目指している．本報告では，等方性拡散する各光線が直線的に進行する性質に着目し演算空間を三角錐型に分割するとともにその更新順序を最適化することで，高効率なストリーム演算が実現可能であることを示す．

researchmap
oneAPIを用いたGPU・FPGA混載ノードにおけるヘテロ演算加速プログラム開発

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

2021年並列／分散／協調処理に関するサマー・ワークショップ (SWoPP2021) 2021.7

　More details

Event date： 2021.7

Language：Japanese Presentation type：Oral presentation (general)

我々は，メモリバンド幅と空間並列性基づく演算性能に優れた GPU とパイプライン並列性による演算性能と通信性能に優れた FPGA を相補的に活用することでアプリケーション全体の性能向上を目指している．このコンセプトを CHARM（Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices）と呼んでおり，多様な HPC ワークロードに対して効果的に働くことが期待できる．しかしながら，一般に GPU と FPGA は異なるプログラム開発環境で開発されるアクセラレータであり，開発ユーザーにとって負担が大きい．そのため，開発の複雑さを解決する統一的な開発環境が必要である．この問題に対して，Intel 社により提供される oneAPI 開発環境が有効に働くことが期待できる．oneAPI は，異なるアクセラレーター間において統一的な言語および各オフローディングモジュールを統合的に実行する API を提供する．本稿では，NVIDIA GPU 及び Intel FPGA の 2 つのアクセラレータをターゲットとして，oneAPI を用いたヘテロ演算加速プログラムを開発する手法について報告する．

researchmap
FPGAにおけるHPCアプリケーション向けHBM2メモリシステムの提案と実装

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

2021年並列／分散／協調処理に関するサマー・ワークショップ (SWoPP2021) 2021.7

　More details

Event date： 2021.7

Language：Japanese Presentation type：Oral presentation (general)

高性能計算の分野で Field Programmable Gate Array (FPGA) が新たなるアクセラレータとして注目されている．近年，高位合成 (High Level Synthesis: HLS) 開発環境が発展しておきており，C や C++ といった言語を用いた開発が可能になりつつある．FPGA は外部メモリ帯域が弱いという課題があり FPGA を HPC で利用する際の障壁となることがあったが，High Bandwidth Memory 2 (HBM2) を搭載した FPGA チップがベンダーからリリースされ始めており，最大で 512GB/s のメモリ帯域を有する．しかしながら，FPGA には，キャッシュやメモリネットワークといったメモリを利用するための機能はなく，HBM2 を FPGA で利用する際の課題の一つである．本稿では，HPC アプリケーションに適する HBM2 メモリシステムの提案と実装を行い性能評価について報告を行う．また，高位合成で記述したカーネルから提案システムが扱えることを示す．

researchmap
FPGA向け浮動小数点数型ソーティングライブラリの提案と実装

小林, 諒平, 三浦, 賢人, 藤田, 典久, 朴, 泰祐, 天笠, 俊之

SWoPP2021: 並列／分散／協調システムとディペンダブルコンピューティングおよび一般 2021.7

　More details

Event date： 2021.7

Language：Japanese Presentation type：Oral presentation (general)

我々はこれまで基本的な算術演算であるデータのソートに着目し，FPGA (Field-Programmable Gate Array) のプログラミングモデルであるOpenCLで使用可能なソーティングライブラリを開発している．本稿では，浮動小数点数型データに対応する機構の提案および実装について報告する．提案するソーティングライブラリは、3つのハードウェアソートアルゴリズムを組み合わせて構築され，OpenCLプログラミングモデル用に再実装したマージソートアルゴリズムと比較した結果，全体のハードウェアリソースを2倍以上消費する一方で，3桁以上のソート性能を達成した．

researchmap
Performance Evaluation of HBM2 on an Intel Stratix 10 MX device for HPC Applications International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

ISC 2021: International Supercomputing Conference

　More details

Event date： 2021.6 - 2021.7

High Level Synthesis (HLS) decreases difficulties in developing FPGA hardware, which enables us to use software languages such as C, C++, and OpenCL for creating FPGA hardware logic. HLS also enables HPC application developers to implement their applications on FPGA systems. However, the memory bandwidth of an FPGA is lower than other accelerators used in the HPC area such as GPU. An FPGA board for HPC with DDR4 memory had only 76.8GB/s memory bandwidth at maximum. Recently, an FPGA chip with High bandwidth memory 2 (HBM2) is available with a 3D-stacked memory structure and many channels aggregated to obtain high bandwidth. It has up to 512GB/s of memory bandwidth in the latest Intel Stratix10. Comparing to GPU, it is still around a quarter of GPU but this ratio is much better than before. In this poster, we evaluate the performance of HBM2 in an Intel Stratix 10 MX FPGA. We implement a tester module that supports not only sequential access but also stride access which is widely used in HPC applications. In addition to the performance evaluation, we discuss how to utilize a lot of memory channels from HBM2 and propose our memory subsystem for HBM2 and HPC applications in an FPGA.

researchmap
A Sorting Library for FPGA Implementation in OpenCL Programming International conference

Kobayashi, Ryohei, Miura, Kento, Fujita, Norihisa, Boku, Taisuke, Amagasa, Toshiyuki

International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2021) 2021.6

　More details

Event date： 2021.6

Language：English Presentation type：Oral presentation (general)

In this study, we focus on data sorting, which is a basic arithmetic operation, and we present a sorting library that can be used with the OpenCL programming model for field-programmable gate arrays (FPGAs). Our sorting library is built by combining three hardware sorting algorithms. It consumes more than twice the overall hardware resources compared to the merge sort restructured for the OpenCL programming model for FPGAs. However, its operating frequency is 1.09x higher and its sorting throughput is three orders of magnitude greater than the baseline.

researchmap
コンパクション処理を活用した正規パス問合わせアクセラレータのFPGA実装

小林, 諒平, 三浦賢人, 藤田典久, 朴泰祐, 天笠俊之

2021年6月リコンフィギャラブルシステム研究会 2021.6

　More details

Event date： 2021.6

Language：Japanese Presentation type：Oral presentation (general)

グラフ構造は身の回りの様々なデータを表すのに効果的なデータ構造である．ビッグデータ分析などの普及に伴い，現在では様々な分野においてグラフ構造データが用いられている．そのようなグラフ構造データからユーザの望むデータを抽出する方法の一つとして，指定されたエッジの並びをもつパスをグラフ内から探索し，そのパスの始点・終点ノードを返す正規パス問合わせ(RPQ)が存在する．本研究では，RPQ評価をパイプライン的に処理するための手法とそのFPGA実装を提案する．実装したRPQアクセラレータの性能を評価したところ，比較手法と比べ最大で約3桁の高速化を達成した．また本研究では，より大規模なグラフを扱えるようにする拡張手法を提案しており，それが実機で正しく動作することを確認した．

researchmap
HBM2メモリを持つFPGAボードの性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

第178回ハイパフォーマンスコンピューティング研究発表会 2021.3

　More details

Event date： 2021.3

Language：Japanese Presentation type：Oral presentation (general)

近年，高位合成（High Level Synthesis: HLS）と呼ばれる技術が発展してきており，Field Programmable Gate Array（FPGA）開発の障壁が低下しつつある．しかしながら，FPGA の持つメモリ帯域は他のアクセラレータと比べて低く，HPC 分野で FPGA を利用する際の障壁となることがあった．しかし，High Bandwidth Memory 2（HBM2）を搭載した FPGA チップがベンダーからリリースされ始めており，最大で 512GB/s のメモリ帯域を有する．依然として，Graphics Processing Unit（GPU）のアクセラレータと比べると，1/4 倍性能の開きがあるものの，性能が一桁以上違うという状況からは改善しつつある．本稿では，Intel Stratix10 FPGA に搭載された HBM2 メモリの性能評価および HPC アプリケーションに適用する手法について述べる．

researchmap
Preliminary Evaluation of Multi-hybrid Acceleration for Radiative Transfer Simulation by OpenACC International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

The 3rd R-CCS international symposium 2021.2

　More details

Event date： 2021.2

Language：English Presentation type：Poster presentation

Graphics processing units (GPUs) offer good peak performance and high memory bandwidth. They have been widely used in high-performance computing (HPC) systems as accelerators. However, they are not suitable for all applications, and there are applications where they don’t efficiently work on. One of such applications is multiphysics simulation. Multiphysics is defined as the coupled processes or systems involving more than one simultaneously occurring physical fields and the studies of and knowledge about these processes and systems. Therefore, multiphysics applications perform simulations with multiple interacting physical properties and there are various computations within a simulation, and GPU-non-suited ones can be included. Because of that, accelerating simulation speed by GPU only is quite difficult and this is why we try to combine GPU and FPGA and make the FPGA cover GPU-non suited computation. We call this concept Cooperative Heterogeneous Acceleration with Reconfigurable Multidevices (CHARM) and have been working on GPU-FPGA-accelerated computation for radiative transfer simulation in astrophysics as a proof of concept.
The implementation method of GPU-FPGA-accelerated computation is a mixture of CUDA and OpenCL programming, which means that the computation kernels running on GPUs are written in CUDA and those running on FPGAs are written in OpenCL. We do not write all computation kernels in OpenCL for the following three reasons. First, since most of the existing HPC applications are CUDA-based implementations, it is very burdensome for programmers to rewrite the entire code in OpenCL. Secondly, even if OpenCL is a platform that is designed to run applications in a heterogeneous environment, in order to use both GPUs and FPGAs at the same time, it is necessary to separately compile and link the computation kernels using the OpenCL compiler for GPUs and the OpenCL compiler for FPGAs. This is essentially the same as the CUDA and OpenCL programming environments. Finally, most of the GPUs used in HPC are made by NVIDIA, and it is not hard to imagine that it is easier to maximize the performance of GPUs by using CUDA, which is a programming model that follows the GPU architecture. For these reasons, we use a mixture of CUDA and OpenCL programming. On the other hand, such a multi-lingual programming imposes a heavy burden on programmers, and therefore, a programming environment with higher usability is required.
We are currently working on a programming environment that enables to use both accelerators in a GPU-FPGA equipped HPC cluster system with OpenACC. Since it is a directive-based programming model, we can specify to the compiler by directives which part of the application should be offloaded to which accelerator. In addition, Oak Ridge National Laboratory (ORNL) is developing a compiler that can write computation kernels for FPGAs as well as GPUs in OpenACC. We are currently collaborating with ORNL with the goal of realizing cooperative computation of both accelerators in a GPU-FPGA equipped HPC cluster system, and as part of this collaboration, we use the compiler being developed by ORNL to realize the high usability GPU-FPGA-accelerated computation described above.
Given the above background, we implement the radiative transfer simulation code with OpenACC and evaluate the performance by comparing with those of OpenMP-based CPU implementation and CUDA-based GPU implementation. Moreover, we introduce a data transfer method between the GPU and the FPGA for realizing a highly-usable GPU-FPGA-accelerated computation. With this data transfer method, it is possible to implement GPU computation kernels in OpenACC, which are assumed to be communicated from FPGAs by the PCIe DMA.

researchmap
Multi-device Programming Environment for GPU and FPGA Cooperative Acceleration International conference

Tsunashima, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke, Lee, Seyong, Vetter, Jeffrey, Murai, Hitoshi, Nakao, Masahiro, Sato, Mitsuhisa

The 3rd R-CCS international symposium 2021.2

　More details

Event date： 2021.2

Language：English Presentation type：Poster presentation

In recent High Performance Computing (HPC), hardware acceleration is becoming common because accelerators have high performance/power ratio and flexibility to attach to the host CPU through universal bus such as PCIe. In particular, the Graphics Processing Unit (GPU) has very high parallel processing performance and high memory bandwidth so that is the most popular accelerator. However, the performance of GPU depends highly on a large degree of SIMD parallelism and has difficulty sustaining a high performance on programs with frequent branch operations or a partially low degree of parallelism even in a part of code.
By contrast, Field Programmable Gate Array (FPGA) has received attention as a different type of accelerator from GPU. FPGA is the processor that can reconfigure a circuit any number of times so that can fit the target applications. The high performance of FPGA is mainly provided by a pipelined operation and optimized circuit suitable for any operation even with frequent conditional branches. We have been focusing on the flexibility of FPGA to compensate for the weakness of GPU.
However, application users of the GPU and FPGA coupling system need quite programming effort by traditional programming environment. In HPC, the most popular language of GPUs is CUDA by NVIDIA. But CUDA is still difficult for application users. Therefore, OpenACC is becoming popular that is higher level abstraction framework than CUDA in recent years. On the other hand, traditional FPGA programming is used a hardware description language such as Verilog HDL and VHDL. This is burden for application users. Recently, High Level Synthesis (HLS) has become available even on high-end FPGAs so application users can code by OpenCL. Moreover, several recent studies have also enabled the OpenACC coding for use in FPGA. In this study, we propose new unified programming environment for the multi- device cooperative computation aiming at the next-generation accelerated supercomputer framework as Cooperative Acceleration by Multi-device Programming, or CAMP for short. We provide a unified programming system based on OpenACC for a platform equipped with both GPU and FPGA. To realize this concept, we have developed a programming environment called Multi-Hybrid OpenACC Translator, or MHOAT for short. we show the basic concept and prototype system of MHOAT based on the evaluation both on the coding
amount and the computing performance.

researchmap
Implementation and Performance Evaluation of Space Radiative Transfer Code on multiple-FPGAs International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

The 3rd R-CCS international symposium 2021.2

　More details

Event date： 2021.2

Language：English Presentation type：Poster presentation

In recent years, research on Field Programmable Gate Array (FPGA) for High performance Computing (HPC) has been widely studied. Traditionally, we have to use low-level languages such as Verilog HDL or VHDL to describe FPGA hardware. It is difficult for HPC researchers to use these languages. However, High-Level Synthesis (HLS) development environments relax this problem. We can use languages for software such as C, C++, and OpenCL for FPGA development.
As an HLS development environment, we use Intel FPGA SDK for OpenCL in this study. The SDK allows us to describe FPGA hardware in OpenCL language. Moreover, The SDK has FPGA-specific extensions to the OpenCL language for optimization on an FPGA. One of the extensions is called as “channel”. It connects and transfers data between two OpenCL kernels inside an FPGA directly. As a result, transferring data using channels is much faster and efficient compared to the traditional method of using an external memory as a buffer. However, the channel extension supports only intra-FPGA communication and does not support inter-FPGA communication. Therefore, we have proposed the Communication Integrated Reconfigurable CompUting System (CIRCUS) inter-FPGA communication framework for OpenCL. It allows us to use inter-FPGA communication over channels.
The Center for Computational Sciences (CCS) at the University of Tsukuba has been developing Accelerated Radiative transfer of Grids using Oct-Tree (ARGOT) program. It solves the space radiative transfer problem in the early stage of the universe. ARGOT combines two algorithms to solve the radiative problem. The ARGOT method is used to compute the radiative transfer from a point source such as a star. The Authentic Radiation Transfer (ART) method is used to compute the radiative transfer from sources diffused into space. The ART method takes approximately 90% of the computation time and is the dominant part of the ARGOT program. In this poster, we optimize the ART method for Intel Stratix 10 FPGA and apply it to the CIRCUS communication framework.
We use the Pre-PACX-X (PPX) cluster for performance evaluation, which is a development platform for the future supercomputer at CCS. It has four Bittware 520N FPGA boards that equip Intel Stratix 10 FPGAs. The board has four external QSFP28 ports supporting up to 100Gbps inter-FPGA communication. We make a 2x2 2D-torus network connecting four boards by optical cables at a speed of 100Gbps.
We evaluate the performance of the FPGA implementation using CIRCUS comparing to the CUDA+MPI implementation. Because the FPGA implementation is preliminary, we have a restriction about the problem size that FPGA can solve. It supports only 323 problem size per FPGA. Therefore, we evaluate the performance based on weak scaling cases with one, two, and four accelerators (GPU or FPGA). The FPGA implementation is 5.70-, 8.41-, and 10.6- times faster than that of a GPU on one, two, and four nodes, respectively. The parallelized implementation of the ART method using CIRCUS shows better efficiency than that of the CUDA implementation. The FPGA achieves parallel efficiency of 0.924 on four nodes, whereas the GPU achieved an efficiency of 0.492.

researchmap
FPGA/GPU協調によるネットワーク型不正侵入検知システムの構築

菊地, 駿太, 池上, 努, Akram, ben Ahmed, 工藤, 知宏, 小林, 諒平, 藤田, 典久, 朴, 泰祐

2021-01-CPSY-RECONF-VLD-ARC-SLDM 2021.1

　More details

Event date： 2021.1

Language：Japanese Presentation type：Oral presentation (general)

These days, Heterogeneous computing is becoming common. In this study, we made an NIDS (Network Intrusion Detection System) as a proof-of-concept application which co-operate FPGA and GPU. NIDS is used to monitor the network and alert us when there is an input that matches a malicious packet. In the system, FPGA handles more than 100Gbps input, which other processing units cannot handle. FPGA pre-filters suspicious packets, because FPGA is suitable for simple tasks. GPU performs exact pattern matching, which can handle various length pattern matching. Future work is to increase the throughput of processing.

researchmap
Performance Evaluation of OpenCL-Enabled Inter-FPGA Optical Link Communication Framework CIRCUS and SMI International conference

Kashino, Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke

HPC Asia 2021 (International Conference on High Performance Computing in Asia-Pacific Region) 2021.1

　More details

Event date： 2021.1

Language：English Presentation type：Oral presentation (general)

In recent years, Field Programmable Gate Array (FPGAs) have attracted much attention as accelerators in the research area of HighPerformance Computing (HPC). One of the strong features of current FPGA devices is their ability to achieve high-bandwidth communication performance with direct optical links to construct multi-FPGA platforms as well as their adjustability. However, FPGA programming is not easily performed on user applications. By more user-friendly programming environments, FPGAs can be applied to various HPC applications on multi-FPGA platforms. Of the several studies aimed at realizing high-level synthesis to utilize the FPGA communication feature, we focus on two systems: Communication Integrated Recongurable CompUting System (CIRCUS) and Streaming Message Interface (SMI) which are available on an Intel FPGA with direct optical links with a peak performance of 40 ∼ 100 Gbps. In both systems, a user can access the optical link in OpenCL kernels where high-level programming for HPC applications is possible. In this paper, we introduce them for practical cases and compare their implementations and performance in real systems. In conclusion, we evaluated that the CIRCUS system for single point-to-point communication achieves a bandwidth of up to 90 Gbps with a 100-Gbps optical link using OpenCL code. It is 2.7 times faster than the SMI system implemented on the same platform, and we also confirmed that the broadcast data transfer among four FPGAs using CIRCUS is up to 31 Gbps of bandwidth which is 5.3 times faster compared to that achieved using SMI. In addition, we determined the main cause of the performance bottleneck on SMI when it is applied to a 100-Gbps platform and compared it with the CIRCUS implementation.

researchmap
OpenACCとOpenCLの混合記述によるGPU-FPGAデバイス間連携

小林, 諒平, 藤田, 典久, 朴, 泰祐

第177回ハイパフォーマンスコンピューティング研究発表会 2020.12

　More details

Event date： 2020.12

Language：Japanese Presentation type：Oral presentation (general)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本稿ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，CUDA より抽象度を引き上げたプログラミングモデルである OpenACC と OpenCL の組み合わせにより GPU と FPGA の両演算加速デバイスを連携させ，性能向上を目指す枠組みを示す．

researchmap
OpenACCによるGPUデバイスメモリ管理についての考察

渡邉, 孔英, 菊池, 航平, 柏野, 隆太, 綱島, 隆太, 藤田, 典久, 小林, 諒平, 朴, 泰祐

第177回ハイパフォーマンスコンピューティング研究発表会 2020.12

　More details

Event date： 2020.12

Language：Japanese Presentation type：Oral presentation (general)

アプリケーションの GPU 化によって高速化を図るとき，CPU メモリと GPU メモリの間のデータ移動管理が必要になる．OpenACC で記述されたプログラムを PGI コンパイラでコンパイルするとき，データ移動の管理は自動的に行わせるか，プログラマが記述するかを選択することができる．本研究では，両方の方法によるデータ移動管理とその性能について，実験を行って比較および考察した．その結果，データアクセスのパターンによっては，データ移動管理を自動的に行わせる方がデータ転送を削減でき，高速化に役立つ場合があることがわかった．

researchmap
Toward OpenACC-enabled GPU-FPGA Accelerated Computing International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

2020 IEEE International Conference on Cluster Computing (CLUSTER 2020) 2020.9

　More details

Event date： 2020.9

Language：English Presentation type：Poster presentation

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore's Law. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose a GPU-FPGA-accelerated simulation based on the concept and show preliminary results of the proposed concept.

researchmap
FPGAに組み込まれたHBMの効率的な利用とその考察

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

IEICE Technical Committee on Reconfigurable Systems (RECONF) 2020.9 電子情報通信学会リコンフィギャラブルシステム研究専門委員会

　More details

Event date： 2020.9

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本オンライン（Zoom）

複数の FPGA を用いた演算加速が高性能計算において期待される中，AiS (Accelerators in Switch) という一概念に注目が集まっている.AiS は，各 FPGA を繋ぐ通信機構の中にアプリケーション特化の演算機構を組みこみ，通信 × 演算の密結合型機構の実現とそれによるシステム性能の向上を提案している.筑波大学計算科学研究センターでは，宇宙輻射輸送シミュレーションコード ARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) を開発し，これに AiS を応用することで，シミュレーションシステムの高速化を目指す研究が進められている.本研究では， ARGOT のうち ART (Authentic Radiation Transfer) スキームを FPGA で高速化することを提案する.ART は3次元格子空間を扱うため，これに由来するランダムに近いメモリアクセス制御は FPGA による解決を期待できる.一方，演算時に発生する膨大なメッシュデータのメモリアクセスについては，FPGA 内の BRAM 等に保存することが難しく，性能低下の原因となっていた.そこで本稿では HBM (High Bandwidth Memory) に着目し，これを用いた ART スキームの実装について提案する.まず，Xilinx Alveo U280 における HBM のメモリアクセス性能について議論する.続けて，HBM からメッシュデータを読み出す場合の SPM (Scratchpad Memory) として On-chip RAM(BRAM・URAM)を用いることを想定し，メモリアクセスがボトルネックとならない SPM へのアクセス率の検証と，外部メモリへのアクセス回数を減らすための工夫に関して議論を行う.

researchmap
再結合光子の輻射輸送大規模計算に向けたHBM-FPGA実装への考察

古川, 和輝, 横野, 智也, 山口, 佳樹, 吉川, 耕司, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 梅村, 雅之

第19回情報科学技術フォーラム 2020.9

　More details

Event date： 2020.9

Language：Japanese Presentation type：Oral presentation (general)

筑波大学計算科学研究センターのプロジェクトに，宇宙輻射輸送シミュレーションを利用した天体現象の解明がある．このシミュレーションは、星および星間媒質からのエネルギー演算により構成されるARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) 法を用いて演算を行う．後者の演算スキーム，ART (Authentic Radiation Transfer) は，ランダムメモリアクセスが可能なことから FPGA 実装による飛躍的な速度向上が期待されているが，GPU実装を大きく超える高速化は実現されていない。そこで本研究では，演算方式の見直しを含め，メモリシステムを含めた演算加速部の高速化について議論する．

researchmap
Stratix 10 FPGAを用いたray-tracing法による輻射輸送計算の高速化

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第175回ハイパフォーマンスコンピューティング研究発表会（SWoPP2020） 2020.7

　More details

Event date： 2020.7

Language：Japanese Presentation type：Oral presentation (general)

我々はこれまでの研究で，宇宙輻射輸送問題で用いられる Authentic Radiative Transfer（ART）法を Arria 10 FPGA 上に実装し性能評価を行ってきた．本稿では，ART 法を最新の Intel Field Programmable Gate Array（FPGA）である Stratix 10 FPGA 向けに最適化し，性能評価を行う．また，我々が提唱している FPGA 間通信フレームワークである Communication Integrated Reconfigurable CompUting System（CIRCUS）を用いて並列計算を実現し，複数 FPGA を用いる際の性能評価も行う．

researchmap
OpenCL対応FPGA間光リンク接続フレームワークCIRCUSとSMIの性能評価

柏野, 隆太, 小林, 諒平, 藤田, 典久, 朴, 泰祐

第175回ハイパフォーマンスコンピューティング研究発表会（SWoPP2020） 2020.7

　More details

Event date： 2020.7

Language：Japanese Presentation type：Oral presentation (general)

近年，高性能分野において FPGA に対する期待が高まっている．高位合成により開発の障壁が低下し，強力な通信性能をもつことが可能な FPGA は従来のシステムでは高速化できない種類のアプリケーションに対しても効果的に働く可能性がある．これらの FPGA の特徴を最大限に活用するためには，FPGA に特化した通信フレームワークが必要となる．既にこの研究は行われており，筑波大学から CIRCUS，チューリッヒ工科大学から SMI が提案されている．いずれも 40～100Gbps の光リンクを OpenCL から利用可能とするもので，今後の FPGA の HPC 利用において重要なパーツとなると考えられる．本報告では，この 2 つの手法，CIRCUS と SMI について実機性能評価を行い，その特性を比較する．

researchmap
宇宙幅射輸送コードARGOTのOpenACCによるGPU実装

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第175回ハイパフォーマンスコンピューティング研究発表会（SWoPP2020） 2020.7

　More details

Event date： 2020.7

Language：Japanese Presentation type：Oral presentation (general)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．しかし，その実装方式は GPU で動作する計算カーネルを CUDA にて，FPGA で動作する計算カーネルを OpenCL にて記述するというような複数のプログラミング言語を用いたマルチリンガルプログラミングであり，そのようなプログラミングモデルはプログラマに多大な負担を強いるため，よりユーザビリティの高い GPU-FPGA 連携を実現するプログラミング環境が必要となる．そのことを踏まえ，本稿ではユーザビリティの高い GPU-FPGA 連携の実現を見据えた予備評価として，初期宇宙における天体形成をシミュレーションするプログラムを OpenACC によって実装し，OpenMP ベースの CPU 実装および CUDA ベースの GPU 実装との性能評価を行う．

researchmap
OpenACC unified programming environment for GPU and FPGA multi-hybrid acceleration International conference

Tsunashima Ryuta, Kobayashi, Ryohei, Fujita, Norihisa, Boku, Taisuke, Lee, Seyong, Vetter, Jeffrey S., Murai, Hitoshi, Nakao, Masahiro, Sato, Mitsuhisa

HLPP 2020: 13th International Symposium on High-level Parallel Programming and Applications 2020.7

　More details

Event date： 2020.7

Language：English Presentation type：Oral presentation (general)

researchmap
Accelerating Radiative Transfer Simulation with GPU-FPGA Cooperative Computation International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2020.7

　More details

Event date： 2020.7

Language：English Presentation type：Oral presentation (general)

Venue：United Kingdom Manchester

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGA-accelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to 17.4× compared with GPU-based implementation. This performance is still 1.32× higher even when solving problems with the largest size, which is the fastest problem size for GPU-based implementation. We consider the realization of GPU–FPGA-accelerated simulation to be the most significant difference between our work and previous studies.

researchmap
Performance Evaluation of Pipelined Communication Combined with Computation in OpenCL Programming on FPGA International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Ueno, Tomohiro, Sano, Kentaro, Boku, Taisuke

The Tenth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2020.5

　More details

Event date： 2020.5

Language：English Presentation type：Oral presentation (general)

Venue：USA New Orleans, Louisiana

In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.

researchmap
GPU・FPGA複合演算加速による宇宙輻射輸送コードARGOTの性能評価

小林, 諒平, 藤田, 典久, 中道, 安祐未, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第174回ハイパフォーマンスコンピューティング研究発表会 2020.5

　More details

Event date： 2020.5

Language：Japanese Presentation type：Oral presentation (general)

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや複数の同時発生する物理現象を含むシミュレーションであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速が困難な場合がある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．本稿では，マルチフィジックスの例である，宇宙輻射輸送シミュレーションコード ARGOT を対象にする．ARGOT は，点光源と空間に分散した光源の 2 種類の輻射輸送問題を含む．ARGOT 法の演算には既に ARGOT プログラムに実装されている GPU カーネルを用いることで，主要演算部分を GPU と FPGA に適材適所的に機能分散して ARGOT コードを最適化する．また，GPU-FPGA 間のデータ転送には，これまでに提案してきた OpenCL から制御可能な GPU-FPGA 間 DMA 転送を利用する．提案手法を評価したところ，GPU と FPGA に適材適所的に機能分散した ARGOT コードは，そうでない ARGOT コードと比較して最大 10.4 倍の性能向上を達成できた．

researchmap
スーパーコンピュータCygnus上におけるFPGA間パイプライン通信の性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 上野, 知洋, 佐野, 健太郎, 朴, 泰祐

第174回ハイパフォーマンスコンピューティング研究発表会 2020.5

　More details

Event date： 2020.5

Language：Japanese Presentation type：Oral presentation (general)

再構成可能なハードウェアの一つに Field Programmable Gate Array (FPGA) がある．我々は，FPGA が持つ協力な外部通信機構に注目している．FPGA 開発は低レベルな記述が必要でありコストが高かったが，高位合成 (High Level Synthesys, HLS) の技術によって解消されつつある．我々は Communication Integrated Reconfigurable CompUting System (CIRCUS) という FPGA 間通信フレームワークを提唱している．CIRUCS システムを用いることで，通信と演算が一体となったパイプラインを OpenCL で記述できる．筑波大学計算科学研究センターでは 1 ノードあたり 2 FPGA ボードを搭載するスーパーコンピュータ Cygnus を運用しており，本稿では Cygnus 上で CIRCUS の通信性能の評価を行い報告する．

researchmap
Pipelined Communication Combined with Computation in OpenCL Programming on FPGA International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Ueno, Tomohiro, Sano, Kentaro, Boku, Taisuke

The 2nd R-CCS international symposium 2020.2

　More details

Event date： 2020.2

Language：English Presentation type：Poster presentation

Venue：日本神戸

In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA’s I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs.

In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL HLS. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this poster, we proposed and evaluated the CIRCUS system for high-speed inter-FPGA communication in OpenCL. CIRCUS extends intra-FPGA communication using channels for inter-FPGA communication. As a result of using channels, CIRCUS can create a fused pipeline for both computation and communication. We can completely overlap computation with communication in clock cycle resolution. Because this characteristic is unique to FPGAs, we believe we can accelerate HPC applications on FPGAs by combining computation and communication.

We used the Cygnus supercomputer operated by Center for Computational Sciences, University of Tsukuba, for the performance evaluation. Cygnus is a multi-heterogenous system and has a total of 80 nodes, which consist of 48 Deneb Nodes and 32 Albireo Nodes. The Deneb nodes are CPU + GPU nodes (no FPGAs), and the Albireo nodes are CPU + GPU + FPGA nodes. An Albireo node is equipped with four Intel Xeon CPUs, two NVIDIA V100 GPUs, four Mellanox InfiniBand HDR100 HCAs, and two Bittware (formerly Nallatech) 520N FPGA boards. The Bittware 520N FPGA board equips an Intel Stratix10 FPGA, 32GB DDR4 external memory, and four QSFP28 external ports supporting up to 100Gbps.

Moreover, there are 64 FPGAs (32 Albireo nodes x 2 FPGAs / node). Therefore, Cygnus has an 8x8 2D-torus network dedicated to FPGAs connected by Mellanox 100Gbps optical cables. We can still use the InfiniBand network independently for CPU or GPU applications. We used up to 16 FPGAs in the following evaluations.

We used three benchmarks to evaluate the CIRCUS system: pingpong benchmark, allreduce benchmark, and Himeno benchmark (19-point stencil computation). According to the pingpong benchmark results, the minimum latency was 0.5μs, and the maximum throughput was 90.2Gbps, and the additional latency per hop was approximately 0.23μs. We used an allreduce-like program to measure the overlapping effect. The maximum throughput was 90.2Gbps, which was the same throughput as the pingpong benchmark result. This result showed that we can make a successful communication-computation combined pipeline. Finally, we evaluated Himeno benchmark performance. We applied CIRCUS communication to the halo and allreduce communication in the benchmark. Strong-scalability was observed in the case of the problem size L, with 94.2% parallel efficiency. We consider this result to be a validation for the implementation of CIRCUS communication to HPC applications.

researchmap
Accelerating Radiative Transfer Simulation with GPU-FPGA cooperative computation International conference

Kobayashi, Ryohei, Fujita, Norihisa, Nakamichi, Ayumi, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

The 2nd R-CCS international symposium 2020.2

　More details

Event date： 2020.2

Language：English Presentation type：Poster presentation

Venue：日本神戸

Graphics processing units (GPUs) offer good peak performance and high memory bandwidth. They have been widely used in high-performance computing (HPC) systems as accelerators. However, enabling the execution of parallel applications on such heterogeneous clusters requires inter-accelerator communication between nodes. This means that maintaining multiple copies of memory is required; this results in increased latency and severely degraded application performance, particularly when short messages are involved. Moreover, while the GPU has the above beneficial characteristics, it is not effective as an accelerator in applications that employ complicated algorithms using exceptions, non-single instruction multiple data streams (SIMD), and partially poor parallelism.

To address the above problems, Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU.

One reason to need such a GPU–FPGA coupling is to accelerate multiphysics applications. Multiphysics is defined as the coupled processes or systems involving more than one simultaneously occurring physical fields and the studies of and knowledge about these processes and systems. Therefore, multiphysics applications perform simulations with multiple interacting physical properties and there are various computations within a simulation. Because of that, accelerating simulation speed by GPU only is quite difficult and this is why we try to combine GPU and FPGA and make the FPGA cover GPU-non suited computation.

In this paper, we focus on radiative transfer simulation code that is based on two types of radiation transfer: the radiation transfer from spot light and the radiation transfer from spatially distributed light. We make GPUs and FPGAs work together, and perform the former radiation transfer on the GPU and the latter radiation transfer on the FPGA. As a result, we realized GPU–FPGA-accelerated simulation and its performance was up to 10.4x better than GPU-based implementation.

researchmap
OpenCL-enabled GPU-FPGA Accelerated Computing with Inter-FPGA Communication International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Nakamichi, Ayumi, Boku, Taisuke

IXPUG Workshop at HPC Asia 2020 2020.1

　More details

Event date： 2020.1

Language：English Presentation type：Oral presentation (general)

Venue：Fukuoka, JAPAN

Field-programmable gate arrays (FPGAs) have garnered significant interest in high-performance computing research; their computational and communication capabilities have drastically improved in recent years owing to advances in semiconductor integration technologies. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL that reduce the amount of programming effort required have been developed and offered by FPGA vendors. These improvements reveal the possibility of implementing a concept that enables on-the-fly offloading of computational loads at which CPUs/GPUs perform poorly compared to FPGAs while moving data with low latency. We think that this concept is key to improving the performance of heterogeneous supercomputers that use accelerators such as the GPU. In this paper, we propose an approach for GPU--FPGA accelerated computing with the OpenCL programming framework that is based on the OpenCL-enabled GPU--FPGA DMA method and the FPGA-to-FPGA communication method. The experimental results demonstrate that our proposed method can enable GPUs and FPGAs to work together over different nodes.

researchmap
Enabling OpenACC Programming on Multi-hybrid Accelerated with GPU and FPGA International conference

Ryuta Tsunashima, Ryohei Kobayashi, Norihisa Fujita, Ayumi Nakamichi, Taisuke Boku, Seyong Lee, Jeffrey Vetter, Hitoshi Murai, Mitsuhisa Sato

HPC Asia 2020 – International Conference on High Performance Computing in Asia‐Pacific Region

　More details

Event date： 2020.1

Although the GPU is main player for accelerated computation in HPC, some category of applications are not suitable for it. For example, partially poor parallelism, non-regular computation (warp divergence) or frequent inter-node communication strongly degrade the performance in parallel GPU computing. On the other hand, FPGAs have been emerging in HPC. FPGA enables us to program the logic device in true co-designing manner. On April 2019, CCS in University of Tsukuba introduced a new GPU+FPGA hybrid accelerated cluster named Cygnus[1]. However, currently users have to describe programs in two languages, CUDA for GPU and OpenCL for FPGA to utilize both devices effectively and it causes heavy effort for users. It is much better if we can provide a uniform framework to program both devices at a single code. Then we are implementing a meta-compiler to apply OpenACC[2] for both devices, based on background compilers for GPU and FPGA.
We assume to use two background compilers, PGI OpenACC compiler for GPU and OpenARC[3] compiler for FPGA. As shown in Figure 1, the meta-compiler splits the corresponding OpenACCdirected parts out of original code into two parts for GPU and FPGA. Then these parts are compiled by corresponding backend compilers. Finally, two object files are linked to a single executable file by PGI compiler. We use Omni compiler[4] developed by RIKEN R-CCS and CCS of University of Tsukuba to implement the meta-compiler. OpenARC is a compiler to enable OpenACC for FPGA programming developed in ORNL. It translates OpenACC code in C to OpenCL with C++, then compiles OpenCL code by
backend compiler, Intel FPGA SDK for OpenCL.
Since the meta-compiler is under development, we applied a hand-compilation in our assumed manner from single OpenACC code, then compiled them by PGI compiler and OpenARC. To evaluate our method, we compared the performance and source code size (lines and characters) with a currently available programming method with CUDA (for GPU) and OpenCL (for FPGA). We examined a synthetic code (not real application) where GPU performs a matrix-matrix multiply, the result is transferred to FPGA, then FPGA performs a CG method by this result matrix. Figure 2 shows the comparison between our OpenACC-only way and CUDA+OpenCL for the code size (a) and (b), and execution time
(c). Here, "Others" of (a) includes miscellaneous parts such as initialization, validation function, etc. It is shown that our approach can reduce the number of characters and lines in the source code to approximately 50% and 30%, respectively. However, the performance of both devices are degraded (GPU: 3.4x worse, FPGA: 1.67x worse).
We need more performance tuning both on code description and compilers.
As future works, we will complete the meta-compiler, improve the performance especially for FPGA programming by OpenACC, and apply our method to real applications.

researchmap
GPU-FPGA協調プログラミングを実現するコンパイラの開発

綱島, 隆太, 小林, 諒平, 藤田, 典久, 中道, 安祐未, 朴, 泰祐, Lee, Seyong, Vetter, Jeffrey, 村井, 均, 佐藤, 三久

第172回ハイパフォーマンスコンピューティング研究発表会 2019.12

　More details

Event date： 2019.12

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本沖縄産業支援センター大会議室

近年，高性能コンピューティング（HPC : High Performance Computing）分野におけるトップレベルのマシンには，アクセラレータを搭載した大規模計算クラスタが多く含まれている．高い演算性能とメモリバンド幅を有する Graphics Processing Unit（GPU）がアクセラレータとして主に用いられているが，条件分岐が頻出する処理や多数の演算コアを活用できないような並列性の小さい処理といった GPU の不得手する演算は依然として存在し，それが性能向上の妨げとなっている．このような問題に対し，任意の論理回路をプログラム可能な集積回路である Field Programmable Gate Array（FPGA）に，GPU が不得手とする処理を実行する回路を実装し，それを FPGA に適宜にオフロードすることによってアプリケーション全体の性能を向上させるアプローチを我々は試みている．しかしながら，GPU と FPGA の演算カーネルは，それぞれ CUDA と OpenCL といった異なるプログラミング言語で開発する必要があり，このようなマルチリンガルプログラミングは，ユーザーにとって多大な負担となる．そこで本研究では，GPU と FPGA が搭載された計算機システム上にて，両アクセラレータの統合的な制御を可能にする OpenACC を用いたプログラミング環境について検討する．本報告では，OpenACC を用いて記述された一つのプログラムを GPU 用，FPGA 用コンパイラそれぞれに向けたファイルに分割するソース to ソースコンパイラを開発し，最終的にこれらをリンクした単一の実行ファイルにより，両アクセラレータの連携が実現できるか検証を行った．その結果，開発したコンパイラによって，統一したアプリケーションプログラミングインターフェイス（API）で書かれた一つのプログラムから，CPU，GPU，FPGA で連携して演算を行う単一の実行ファイルが生成され，両アクセラレータの連携が実現できることが確認された．

researchmap
OpenCL対応GPU・FPGAデバイス間連携機構による宇宙輻射輸送コードの演算加速

小林, 諒平, 藤田, 典久, 中道, 安祐未, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第172回ハイパフォーマンスコンピューティング研究発表会 2019.12

　More details

Event date： 2019.12

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本沖縄産業支援センター大会議室

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA (Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU・FPGA 複合演算加速が必要とされる理由は，複数の物理モデルや復数の同時発生する物理現象を含むシミュレーショシであるマルチフィジックスアプリケーションに有効だと睨んでいるためである．マルチフィジックスでは，シミュレーション内に様々な特性の演算が出現するので，GPU だけでは演算加速させづらいことがある．したがって，GPU だけでは対応しきれない特性の演算の加速に FPGA を利用することで，アプリケーション全体の性能向上を狙う．本稿では，マルチフィジックスの例である，宇宙輻射輸送シミュレーションコード ARGOT を対象にする．ARGOT は，点光源と空間に分散した光源の 2 種類の輻射輸送問題を含む．ARGOT 法の演算には既に ARGOT プログラムに実装されている GPU カーネルを用いることで，主要演算部分を GPU と FPGA に適材適所的に機能分散して ARGOT コードを最適化する．また，GPU-FPGA 間のデータ転送には，これまでに提案してきた OpenCL から制御可能な GPU-FPGA 間 DMA 転迭を利用する．提案手法を評価したところ，GPU と FPGA に適材適所的に機能分散した ARGOT コードは，そうでない ARGOT コードと比較して最大 3 倍の性能向上を達成できた．

researchmap
再構成可能なハードウェアを用いた演算と通信を融合する手法の提案と性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

第171回ハイパフォーマンスコンピューティング研究発表会 2019.9

　More details

Event date： 2019.9

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本国立情報学研究所

近年，高性能計算の分野で再構成可能なハードウェアである Field Programmable Gate Array (FPGA) が次世代の演算加速装置として注目されている．FPGAを高性能計算で用いる際の障壁は開発の困難さであったが，高位合成手法の発展に伴いこの問題は解決しつつある．最新の FPGA は最大で 100Gbps×4の通信性能を有しており，我々はその強力な通信性能に注目している．FPGA の絶対性能は他のアクセラレータよりも低いが，FPGA が持つ演算能力と通信能力を組み合わせることでより広い範囲の問題に FPGA が適用できると考えている．本研究の目的は，高位合成で記述された FPGA アプリケーションから通信機構を操作し並列処理システムを実現することである．通信のスループットやレイテンシだけでなく，通信と演算を一体化したパイプラインが FPGA 内に構築される点も評価を行い，高位合成で記述した FPGA アプリケーションで並列計算が可能なことを示す．我々は FPGA 間で直接通信を行う環境として CoE というシステムを開発しており，バンド幅は最大で 90.7Gbps を達成し，最小レイテンシは 429.2ns であった．また，パイプライン評価においても，良好な結果が得られ，通信と演算を一体化したパイプラインを構築できていることを確認した．

researchmap
Cygnus: GPU + FPGA accelerated supercomputing platform Invited International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Umemura, Masayuki

1st International Workshop on Reconfigurable High Performance Computing (ReHPC'2019) 2019.9

　More details

Event date： 2019.9

Language：English Presentation type：Oral presentation (general)

Venue：Spain Barcelona

Graphics processing units (GPUs) have been widely used in high-performance computing (HPC) systems as accelerators because they can offer good peak performance and high memory bandwidth. However, the GPU is not almighty as an accelerator because it is not effective in applications that employ complicated algorithms using exception, non single-instruction-multiple-data streams (SIMD), partially poor parallelism, etc. To address these problems, field-programmable gate arrays (FPGAs) have gained attention in HPC research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. This talk shows how to use FPGA for HPC which enables on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency intra/inter-node communication, to build a programming framework to comprehensively control these functionalities from the CPU, and demonstrate the effectiveness of our proposed approach by applying it to computational science applications.

researchmap
GPU-FPGA Heterogeneous Computing with Unified Programming Framework Invited International conference

Tsunashima, Ryuuta, Kobayashi, Ryohei, Fujita, Norihisa, Nakamichi, Ayumi, Boku, Taisuke

OpenACC Annual Meeting 2019 2019.9

　More details

Event date： 2019.9

Language：English Presentation type：Oral presentation (general)

Venue：RIKEN Center for Computational Science (R-CCS)

This talk shows how to use FPGA for HPC which enables on-the-fly offloading computation where CPUs/GPUs perform poorly to FPGAs while performing low-latency intra/inter-node communication and demonstrates the effectiveness of our proposed approach by applying it to computational science applications. OpenACC is a promising interface to realize these objectives and our research group is going to introduce it as a unified programming method for two devices.

researchmap
GPU・FPGA複合演算加速による輻射流体シミュレーションコードARGOTの実装

中道, 安祐未, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 梅村, 雅之

第170回ハイパフォーマンスコンピューティング研究発表会（SWoPP2019） 2019.7 情報処理学会

　More details

Event date： 2019.7

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本北見市民会館

近年，高性能コンピューティング (HPC:High Performance Computing) の分野において，アクセラレータを搭載した大規模計算クラスタが主流の 1 つとなっている．アクセラレータには，主に Graphics Processing Unit (GPU) が用いられているが，HPC 分野では処理の柔軟性や電力効率の高さから Field Programmable Gate Array (FPGA) が注目されつつある．そこで，GPU が不得意な計算を FPGA に行わせる GPU+FPGA の複合システムにより実アプリケーションのさらなる高性能化を目指す．前回の発表では，GPUとFPGA の両方を搭載した計算機で GPU+FPGA のハイブリッドアクセラレーションを実現するプログラムの開発手法と環境について議論した．GPU・FPGA の両デバイスを協調する方法を確立したため，本研究では，その方法を用いて輻射流体シミュレーションコード ARGOT の実装を行う．従来は CPU・GPU を用いて高速化が行われていたが，アルゴリズムの特性より，本研究では FPGA を用いた方がより高速化できるアルゴリズムに対して OpenCL による実装を用いたソースコードを組み込んだ．実装にはまだ至ってはいないが，実装に対する議論を行う．

researchmap
OpenCL対応FPGA間通信機能によるGPU・FPGA複合型演算加速

小林, 諒平, 藤田, 典久, 山口, 佳樹, 中道, 安祐未, 朴, 泰祐

第170回ハイパフォーマンスコンピューティング研究発表会（SWoPP2019） 2019.7 情報処理学会

　More details

Event date： 2019.7

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本北見市民会館

我々は，高い演算性能とメモリバンド幅を有する GPU（Graphics Processing Unit）に演算通信性能に優れている FPGA（Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，OpenCL コードから制御可能な FPGA 間通信技術と GPU-FPGA 間 DMA 転送技術を融合した，複数ノード上における GPU-FPGA 間連携子法を提案する．GPU-FPGA 間 DMA 転送は，GPU デバイスのグローバルメモリを PCIe アドレス空間にマップし，アドレスマップの結果をベースに OpenCL カーネル内で作成したディスクリプタを最終的に FPGA 内の PCIe DMA コントローラに書き込むことによって実現される．また，FPGA 間通信は，Verilog HDL で実装された Ethernet 通信を実行するハードウェアと，そのハードウェアの制御モジュール（OpenCL カーネル）を I/O Channel で接続することによって構成されているシステムで実現される．この提案手法を用いて，ノードを跨いだ GPU 同士の pingpong ベンチマークを実装し，それが正しく動作していることを認した．

researchmap
Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke

The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2019.5

　More details

Event date： 2019.5

Language：English Presentation type：Oral presentation (general)

Venue：Brazil Rio de Janeiro

In recent years, Field Programmable Gate Array (FPGA) has been a topic of interest in High Performance Computing (HPC) research. Although the biggest problem in utilizing FPGAs for HPC applications is in the difficulty of developing FPGAs, this problem is being solved by High Level Synthesis (HLS). We focus on very high-performance inter-FPGA communication capabilities. The absolute floating-point performance of an FPGA is lower than that of other common accelerators such as GPUs. However, we consider that we can apply FPGAs to a wide variety of HPC applications if we can combine computations and communications on an FPGA. The purpose of this paper is to implement a parallel processing system running applications implemented by HLS combining computations and communications in FPGAs. We propose the Channel over Ethernet (CoE) system that connects multiple FPGAs directly for OpenCL parallel programming. "Channel"' is one of the new extensions provided by the Intel OpenCL environment. They are ordinally used for intra-kernel communication inside an FPGA, but we extend them to external communication through the CoE system. In this paper, we introduce two benchmarks as demonstration of the CoE system. We achieved 29.77 Gbps in throughput (approximately 75% of the theoretical peak of 40Gbps) and 950 ns in latency on our system using the pingpong benchmark, which was implemented on Intel Arria10 FPGA. In addition, we evaluated the Himeno benchmark which is a sort of 3D-Computational Fluid Dynamics (CFD) on the system, and we achieved 23689MFLOPS with 4 FPGAs on a problem of size M. We also notice strong scalability, with a 3.93 times speedup compared to a single FPGA run, on the same problem size.

researchmap
GPU-FPGA Heterogeneous Computing with OpenCL-Enabled Direct Memory Access International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Nakamichi, Ayumi, Boku, Taisuke

The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2019.5

　More details

Event date： 2019.5

Language：English Presentation type：Oral presentation (general)

Venue：Brazil Rio de Janeiro

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their computation and communication capabilities have drastically improved in recent years due to advances in semiconductor integration technologies that rely on Moore's Law. In addition to improving FPGA performance, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. These improvements reveal the possibility of implementing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is key to improving the performance of heterogeneous supercomputers using accelerators such as the GPU. In this paper, we propose an OpenCL-enabled data movement method to directly access the global memory of the GPU and show how to implement cooperative GPU-FPGA computation using it. The results of experiments show that our proposed method can achieve a latency of 0.59 μs and a data transfer rate as high as 7.0 GB/s between the GPU and the FPGA, thus confirming that it is effective at realizing high-performance cooperative GPU-FPGA computation.

researchmap
GPU-FPGA協調計算を記述するためのプログラミング環境に関する研究

綱島, 隆太, 小林, 諒平, 藤田, 典久, 中道, 安祐未, 朴, 泰祐

第169回ハイパフォーマンスコンピューティング研究会 2019.5

　More details

Event date： 2019.5

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本海洋研究開発機構（JAMSTEC）横浜研究所三好記念講堂

近年，高性能コンピューティング（HPC : High Performance Computing）分野におけるトップレベルのマシンには，アクセラレータを搭載した大規模計算クラスタが多く含まれている．高い演算性能とメモリバンド幅を有する Graphics Processing Unit （GPU）がアクセラレータとして主に用いられているが，条件分岐が頻出する処理や多数の演算コアが利用できないような並列性の小さい処理といった GPU の不得手する演算は依然として存在し，それが性能向上の妨げとなっている．このような問題に対し，任意の論理回路をプログラム可能な集積回路である Field Programmable Gate Array （FPGA）に，GPU が不得手とする処理を実行する回路を実装し，それを FPGA に適宜にオフロードすることによってアプリケーション全体の性能を向上させるアプローチを我々は試みている．しかしながら，GPU と FPGA の演算カーネルは，それぞれ CUDA と OpenCL といった異なるプログラミング言語で開発する必要があり，このようなマルチリンガルプログラミングは，ユーザーにとって多大な負担となる．そこで本研究では，GPU と FPGA が搭載された計算機システム上にて，両アクセラレータの統合的な制御を可能にする OpenACC を用いたプログラミング環境について検討する．本報告では，OpenACC により記述された別々の GPU 向け，FPGA 向けファイルをコンパイル時にリンクすることで両アクセラレータの連携が可能か検証を行った．その結果，OpenACC による記述のみで GPU-FPGA 協調計算が実現可能であることを確認した．

researchmap
高位設計と低位設計の違いとFPGA演算性能の関係について

横野, 智也, 山口, 佳樹, 藤田, 典久, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

情報処理学会第81回全国大会 2019.3 情報処理学会

　More details

Event date： 2019.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本福岡

FPGA1チップの回路規模が100 万システムゲートを超えた現在，その全ての動作を把握し，RTL(Register Transfer Level)設計により完全な最適化を達成するのは困難になりつつある．そこで，高位記述言語によるHLS(High Level Synthesis) 設計に注目が集まっている．Intel社のIntel SDK for OpenCL，Xilinx 社のVivado HLS およびSDAccel などHLS 設計・開発環境は整いつつある．ここで，データセンターのような多くのユーザが利用しかつ複数のFPGA が並列に動作する環境において，RTL設計のみを唯一の選択肢とし続けることはユーザビリティの点から現実的ではない．一方，高性能演算と言う観点で設計手法をみたとき，HLS 設計のみを選択肢とするのは，現時点では時期尚早と考えられる．そこで本論文では，HDL 設計とHLS 設計の現状を等距離から評価し議論することで，次世代のヘテロジニアス高性能計算およびそこにFPGA が存在する可能性について検討する.

researchmap
GPU・FPGA混載ノードにおけるヘテロ演算加速プログラム環境に関する研究

中道, 安祐未, 小林, 諒平, 藤田, 典久, 朴, 泰祐

第168回ハイパフォーマンスコンピューティング研究発表会 2019.3

　More details

Event date： 2019.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本山代温泉瑠璃光会議室〒922-0295 石川県加賀市山代温泉19-58-1

近年，高性能コンピューティング (HPC ： High Performance Computing) の分野において，アクセラレータを搭載した大規模計算クラスタが主流の 1 つとなっている．アクセラレータには，主に Graphics Processing Unit (GPU) が用いられているが，HPC 分野では処理の柔軟性や電力効率の高さから Field Programmable Gate Array (FPGA) が注目されつつある．そこで，GPU が不得意な計算を FPGA に行わせる GPU + FPGA の複合システムにより実アプリケーションのさらなる高性能化を目指す．本研究では，GPU と FPGA の両方を搭載した計算機で GPU + FPGA のハイブリッドアクセラレーションを実現するプログラムの開発手法と環境について議論する．

researchmap
異デバイス間でのPCIe通信を実現するOpenCL対応FPGAモジュールの提案と検証

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

2019-01-SLDM-RECONF-VLD-CPSY-ARC 2019.1

　More details

Event date： 2019.1

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本慶応義塾大学日吉キャンパス来往舎

我々は，高い演算性能とメモリバンド幅を有する GPU (Graphics Processing Unit) に演算通信性能に優れている FPGA (Field Programmable Gate Array) を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，OpenCL コードから制御可能なデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの外部メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大 33.3 倍の性能差，通信バンド幅の面では最大 2.0 倍の性能差が確認された．

researchmap
OpenCL-enabled high performance direct memory access for GPU-FPGA cooperative computation International conference

Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke

IXPUG Workshop at HPC Asia 2019 2019.1

　More details

Event date： 2019.1

Language：English Presentation type：Oral presentation (general)

Venue：China Guangzhou

Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by FPGA vendors, which reduces the programming effort required as compared to the past. These improvements reveal the possibilities of realizing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is one of the keys to more improve the performance of modern heterogeneous supercomputers using accelerators like GPUs. In this paper, we propose a high-performance GPU-FPGA data communication using OpenCL and Verilog HDL mixed programming in order to make both devices smoothly work together. OpenCL is used to program application algorithms and data movement control when Verilog HDL is used to implement low-level components for memory copies between the two devices. Experimental results using toy programs showed that our proposed method achieves a latency of 0.6 $\mu$s and as much as 6.9 GB/s between the GPU and the FPGA, thus confirming that the proposed method is effective at realizing the high-performance GPU-FPGA cooperative computation.

researchmap
OpenCLによるFPGA上の演算と通信を融合した並列処理システムの実装及び性能評価

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐

第167回ハイパフォーマンスコンピューティング研究発表会 2018.12

　More details

Event date： 2018.12

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本沖縄産業支援センター大会議室

近年，高性能計算の分野で再構成可能なハードウェアである Field Programmable Gate Array (FPGA) が次世代の演算加速装置として注目されている．FPGA を高性能計算で用いる際の障壁は開発の困難さであったが，高位合成手法の発展に伴いこの問題は解決しつつある．最新の FPGA は最大で 100 Gbps × 4 の通信性能を有しており，我々はその強力な通信性能に注目している．FPGA の絶対性能は他のアクセラレータよりも低いが，FPGA が持つ演算能力と通信能力を組み合わせることでより広い範囲の問題に FPGA が適用できると考えている．本研究の目的は，高位合成で記述された FPGA アプリケーションから通信機構を操作し並列処理システムを実現することである．通信のスループットやレイテンシだけでなく，姫野ベンチマークを用いた性能評価を行い，高位合成で記述した FPGA アプリケーションで並列計算が可能なことを示す．我々は FPGA 間で直接通信を行う環境として Channel over Ethernet (CoE) というシステムを開発しており，バンド幅は最大で 7.13 Gbps を達成し，4 バイト通信時のレイテンシは 980 ns であった．姫野ベンチマークで，問題サイズ M を 4 FPGA で実行する場合に 22659 MFLOPS の性能が得られ，4 FPGA 時に 1 FPGA 時と比べて 3.6 1倍という良好な Strong Scaling の結果が得られた．

researchmap
OpenCLとVerilog HDLの混合記述によるGPU-FPGAデバイス間連携

小林, 諒平, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

第167回ハイパフォーマンスコンピューティング研究発表会 2018.12

　More details

Event date： 2018.12

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本沖縄産業支援センター大会議室

我々は，高い演算性能とメモリバンド幅を有する GPU （Graphics Processing Unit）に演算通信性能に優れている FPGA （Field Programmable Gate Array）を連携させ，双方を相補的に利用する GPU - FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，GPU プログラミングと FPGA プログラミングの連携を効率的に行うためのデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの外部メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大で 83 倍の性能差，通信バンド幅の面では最大で 2.4 倍の性能差が確認された．

researchmap
FPGAによる宇宙輻射輸送シミュレーションの演算加速

横野, 智也, 藤田, 典久, 山口, 佳樹, 大畠, 佑真, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

リコンフィギャラブルシステム研究会（RECONF） 2018.9

　More details

Event date： 2018.9

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本 LINE Fukuokaカフェスペース

我々はこれまで，アクセラレータ間を密結合し低レイテンシで通信を行うTCA(Tightly Coupled Accelerators) と呼ばれるアーキテクチャを提案し，FPGA(Field Programmable Gate Array) を用いたTCA 実装としてPEACH2(PCI Ecpress Adaptive Communication Hub Ver.2) の開発を行ってきた．これらの研究を基に現在，TCAの概念をより進めたアーキテクチャとしてAiS(Accelerators in Switch) というコンセプトの研究を進めている．AiSは通信機構の中にアプリケーションに特化した演算機構を組み込み，FPGA 内での演算機構と通信機構のより強い連携を実現する次世代の並列演算加速機構である．本稿では，AiS の実現に向けた評価として，宇宙輻射輸送シミュレーションであるARGOT (Accelerated Radiative transfer on Grids using Oct-Tree) の中で用いられるART (Authentic Radiation Transfer) 法を異なるFPGA(Xilinx/Intel) に実装し，その評価を行う．これは当該シミュレーションがGPU のような加速機構により高速化される部分とそうでない部分をほぼ等しく含んでいるため，GPU とは異なるアーキテクチャとの協調計算が求められるためである．ART 法をFPGA に実装した際，CPU と比較し両デバイス
ともに高速化を実現した．

researchmap
並列FPGAシステムにおけるOpenCLを用いた宇宙輻射輸送コードの演算加速

藤田, 典久, 小林, 諒平, 山口, 佳樹, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

Summer United Workshops on Parallel, Distributed and Cooperative Processing 2018 2018.7

　More details

Event date： 2018.7 - 2018.8

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本熊本市国際交流会館

近年注目されている High Performance Computing (HPC) における挑戦の一つに，どのようにして Field Programable Gate Array (FPGA) 技術を用いて，高い性能と低い低消費電力を次世代スーパーコンピュータシステムで達成するかというものがある．従来手法ではソフトウェアの開発者が Hardware Description Language (HDL) を用いて FPGA 回路を開発することは困難であったが，近年の FPGA における開発環境の進歩により，高位合成の利用が一般的になりつつあり，HDL の記述なしに FPGA 開発が可能になりつつある．本研究では，初期宇宙の研究に重要な輻射輸送を解くプログラム Accelerated Radiative transfer on Grids using Oct-Tree (ARGOT) で用いられているアルゴリズムである Authentic Radiation Transfer (ART) 法を OpenCL で記述して FPGA 向けに最適化を行い，また，今後の展望として，ART 法の計算をどのようにして複数 FPGA で並列計算を行うかについて述べる．これまでの研究では，FPGA 内の Block RAM (BRAM) に収まる大きさの問題しか解けず，ARGOT で実際に計算したい問題サイズに対応できなかったが，大容量の DDR メモリを併用することで実用的な問題サイズを FPGA で解けるようになった．CPU，GPU，FPGA 間での性能比較を行い，CPU と比べて 6.9 倍の高速化が達成され，GPU との比較では GPU と同程度の性能を達成した．FPGA 実装の性能は GPU と同程度であるが，自ら通信機構を操作できる FPGA の方が通信オーバーヘッドは GPU と比べると小さく，並列計算を行う際の性能は GPU の性能を超えられると考えられ，今後，並列 FPGA 計算の実装を行う予定である．

researchmap
GPU-FPGA複合システムにおけるデバイス間連携機構

小林, 諒平, 阿部, 昂之, 藤田, 典久, 山口, 佳樹, 朴, 泰祐

Summer United Workshops on Parallel, Distributed and Cooperative Processing 2018 2018.7

　More details

Event date： 2018.7 - 2018.8

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本熊本市国際交流会館

我々は，高い演算性能とメモリバンド幅を有する GPU (Graphics Processing Unit) に演算通信性能に優れている FPGA (Field Programmable Gate Array) を連携させ，双方を相補的に利用する GPU-FPGA 複合システムに関する研究を進めている．GPU，FPGA といった異なるハードウェアを搭載するシステム上では，各デバイスで実行される演算をどのようにプログラミングし，全デバイスを協調動作させるかが重要な課題となる．そこで本稿では，GPU プログラミングと FPGA プログラミングの連携を効率的に行うためのデバイス間データ転送について提案する．GPU デバイスメモリの PCIe アドレスマッピング結果をベースに作成されたディスクリプタを FPGA に送信し，FPGA 内の PCIe DMA コントローラに書き込むことによって，GPU デバイスのグローバルメモリと FPGA デバイスの内蔵メモリ間で CPU を介さずにデータ転送を実現する．通信レイテンシと通信バンド幅の観点から提案手法を評価した結果，従来手法と比較して，通信レイテンシの面では最大で 8.4 倍の性能差，通信バンド幅の面では最大で 3.7 倍の性能差が確認された．

researchmap
Accelerating HPC applications on FPGAs using OpenCL and FPGA Network International conference

Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Makito Abe, Kohji Yoshikawa, Masayuki Umemura

ISC 2018: International Supercomputing Conference

　More details

Event date： 2018.6

We show two topics in this poster. One of them is OpenCL-ready high-speed 40Gbit Ethernet FPGA networking. The other is optimizing space radiative transfer code using OpenCL. We add networking functionality to the board support package (BSP) to use them from OpenCL. BSP is a hardware component to abstract differences between boards. Since the BSP provided by the board vendor does not support all peripherals on the board, we have to add controllers for Ethernet to BSP. It achieves 1μ sec latency and 4.97GB/s bandwidth (99.4% of the theoretical peak).
Accelerated Radiative transfer on grids Oct-Tree (ARGOT) is a program to solve space radiative transfer problem and has been developed in Center for Computational Sciences (CCS), University of Tsukuba. Authentic Radiation Transfer (ART) is one of algorithms used in ARGOT and is the dominant part of ARGOT program. We optimize ART algorithm to FPGAs using OpenCL. Our implementation uses channels to improve performance block RAMs in the FPGA chip. In performance comparison among FPGA, CPU and GPU, FPGA is 4.9 times faster than CPU and almost equal performance with GPU.
As future work, we will combine the network and the application to realize Accelerator in Switch (AiS) concept. It couples communications and computations tightly. We consider FPGAs can realize AiS because they can act as both of accelerators and switches.

researchmap
Accelerating Space Radiative Transfer on FPGA using OpenCL International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Oobata, Yuma, Boku, Taisuke, Abe, Makito, Yoshikawa, Kohji, Umemura, Masayuki

HEART2018 (9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies) 2018.6

　More details

Event date： 2018.6

Language：English Presentation type：Oral presentation (general)

One of the recent challenges faced by High-Performance Computing (HPC) is how to apply Field-Programmable Gate Array (FPGA) technology to accelerate a next-generation supercomputer as an efficient method of achieving high performance and low power consumption. Graphics Processing Unit (GPU) is the most commonly used accelerator for HPC supported by regularly executed high degree of parallel operations which causes performance bottleneck in some cases. On the other hand, there are great opportunities to flexibly and efficiently utilize FPGAs in logic circuits to fit various computing situations. However, it is not easy for application developers to implement FPGA logic circuits for their applications and algorithms, which generally require complicated hardware logic descriptions. Because of the progress made in the FPGA development environment in recent years, the High-Level Synthesis (HLS) development environment using the OpenCL language has become popular. Based on our experience describing kernels using OpenCL, we found that a more aggressive programming strategy is necessary to realize true high performance based on a "codesign" concept to implement the necessary features and operations to fit the target application in an FPGA design. In this paper, we optimize the Authentic Radiation Transfer (ART) method on an FPGA using OpenCL. We also discuss a method to parallelize its computation in an FPGA and a method to optimize the OpenCL code on FPGAs. Using a codesigned method for the optimized programming of a specific application with OpenCL for an FPGA, we achieved a performance that is 6.9 times faster than that of a CPU implementation using OpenMP, and almost the same performance as a GPU implementation using CUDA. The ART code should work on a larger configuration with multiple FPGAs requiring interconnections between them. Considering the current advanced FPGAs with interconnection features, we believe that their parallelized implementation with multiple FPGAs will achieve a higher performance than GPU.

researchmap
複数のFPGAによる分散ソーティングの実現に向けた予備評価

小林, 諒平, 藤田, 典久, 大畠, 佑真, 山口, 佳樹, 朴, 泰祐

リコンフィギャラブルシステム研究会 (RECONF) 2018.5

　More details

Event date： 2018.5

Language：Japanese Presentation type：Oral presentation (general)

researchmap
Scalable Inter-FPGA Direct Communication for Parallel FPGA Applications Invited International conference

Kobayashi,Ryohei

18th SIAM Conference on Parallel Processing for Scientific Computing 2018.3

　More details

Event date： 2018.3

Language：English Presentation type：Oral presentation (general)

researchmap
宇宙輻射輸送計算におけるHDL設計とOpenCL設計の比較

横野, 智也, 藤田, 典久, 山口, 佳樹, 大畠, 佑真, 小林, 諒平, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第163回ハイパフォーマンスコンピューティング研究発表会 2018.2

　More details

Event date： 2018.2 - 2018.3

Language：Japanese Presentation type：Oral presentation (general)

半導体の高集積化は，FPGA の大規模化・高機能化・低価格化をもたらし，組み込みシステム用途だけでなく高性能計算用途においても導入が検討されるようになった．しかし，FPGA 開発はハードウェア記述言語（HDL : Hardware Description Language）による設計が主流であり，FPGA の利用可能性は開発の困難さによって大きく制約を受けている．FPGA の高性能計算応用を考えたとき，C 言語や OpenCL 言語を初めとする高位記述による設計が考えられるが，開発効率などの定性的な議論はあるものの，演算性能を定量的に比較した報告は少ない．そこで本論文では，宇宙輻射輸送計算をベンチマークに，高位記述設計（OpenCL 言語による HLS 設計）と低位記述設計（Verilog HDL による RTL 設計）とを比較し，高性能計算応用からみた FPGA の利用可能性と演算性能について議論する．具体的には，原始銀河形成シミュレーションにおいて再結合光子の輻射輸送を解く ART （Authentic Radiation Transfer）法を FPGA に実装し，その演算性能について比較を行った．細かな演算回路の調整や外部インタフェースを含むシステムとしての設計を除くと，XILINX 社と Intel 社という利用デバイスの違いがあるものの，記述方法によらず同程度の性能を得ることができることを確認できた．

researchmap
OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing International conference

Kobayashi, Ryohei, Oobata, Yuma, Fujita, Norihisa, Yamaguchi, Yoshiki, Boku, Taisuke

HPC Asia 2018: International Conference on High Performance Computing in Asia-Pacific Region 2018.1

　More details

Event date： 2018.1

Language：English Presentation type：Oral presentation (general)

Venue：Japan Tokyo

Field programmable gate arrays (FPGAs) have gained attention in high-performance computing (HPC) research because their computation and communication capabilities have dramatically improved in recent years as a result of improvements to semiconductor integration technologies that depend on Moore's Law. In addition to FPGA performance improvements, OpenCL-based FPGA development toolchains have been developed and offered by FPGA vendors, which reduces the programming effort required as compared to the past. These improvements reveal the possibilities of realizing a concept to enable on-the-fly offloading computation at which CPUs/GPUs perform poorly to FPGAs while performing low-latency data movement. We think that this concept is one of the keys to more improve the performance of modern heterogeneous supercomputers using accelerators like GPUs. In this paper, we propose high-performance inter-FPGA Ethernet communication using OpenCL and Verilog HDL mixed programming in order to demonstrate the feasibility of realizing this concept. OpenCL is used to program application algorithms and data movement control when Verilog HDL is used to implement low-level components for Ethernet communication. Experimental results using ping-pong programs showed that our proposed approach achieves a latency of 0.99 μs and as much as 4.97 GB/s between FPGAs over different nodes, thus confirming that the proposed method is effective at realizing this concept.

researchmap
OpenCLを用いたFPGAによる宇宙輻射輸送シミュレーションの演算加速

藤田, 典久, 小林, 諒平, 山口, 佳樹, 大畠, 佑真, 朴, 泰祐, 吉川, 耕司, 安部, 牧人, 梅村, 雅之

第161回ハイパフォーマンスコンピューティング研究発表会 2017.9

　More details

Event date： 2017.9

Language：Japanese Presentation type：Oral presentation (general)

我々はこれまで，アクセラレータ間を密結合し低レイテンシで通信を行う TCA (Tightly Coupled Accelerators) と呼ばれるアーキテクチャを提案し，FPGA (Field Programmable Gate Array) を用いた TCA 実装として PEACH2 (PCI Express Adaptive Communication Hub Ver. 2) の開発を行ってきた．これらの研究を基に現在，TCA の概念をより進めたアーキテクチャとして AiS (Accelerators in Switch) というコンセプトの研究を進めている．AiS は通信機構の中にアプリケーションに特化した演算機構を組み込み，FPGA 内での演算機構と通信機構のより強い連携を実現する次世代の並列演算加速機構である．これまでにも PEACH 2 に対して演算機構を組み込む研究は行われてきたが，PEACH 2 は Verilog HDL (Hardware Description Language) によって全体が記述されており，AiS における演算部についても Verilog HDL を用いて記述しなければならず，開発コストが高く，FPGA の専門家でなければその開発ができないという問題があった．近年の FPGA 開発環境の進歩により，より一般的な環境で AiS を実現できるようになり，さらに通信性能についても 40 Gbps，100 Gbps といった高速な通信機構を扱え，また，ソフトウェアで用いられている言語から回路を合成する高位合成と呼ばれる技術が普及してきた．Intel FPGA では OpenCL を用いた高位合成処理系があり，OpenCL 言語からの回路の生成だけでなく，OpenCL API を用いた FPGA の制御が可能となるが，CPU や GPU 向けに記述・最適化された OpenCL コードをそのまま用いても性能がでないことがわかっており，FPGA 向けの最適化をどう行うかが課題となる．本稿では Intel FPGA 向け高位合成開発環境である Intel FPGA SDK for OpenCL を用いて，宇宙輻射輸送シミュレーションコード ARGOT の中で用いられている ART 法を FPGA 向けに最適化を行う．ART 法を FPGA に実装するにあたって，どのように FPGA 内部で並列演算を行うか，どのような FPGA 向け最適化を行うかについて述べる．Intel Arria 10 FPGA を用いて性能評価を行い，CPU 実装と比べて 14.6 倍の高速化が得られ，その実装は 63 % の回路リソースを利用し動作周波数は 236.11 MHz であった．

researchmap
OpenCLとVerilog HDLの混合記述によるFPGA間Ethernet接続

大畠佑真, 小林諒平, 藤田典久, 山口佳樹, 朴泰祐

第160回ハイパフォーマンスコンピューティング研究発表会（SWoPP2017） 2017.7

　More details

Event date： 2017.7

Presentation type：Oral presentation (general)

researchmap
高位合成によるFPGAの高性能計算へ適用

2017年ハイパフォーマンスコンピューティングと計算科学シンポジウム 2017.6

　More details

Event date： 2017.6

Presentation type：Poster presentation

researchmap
OpenCLとVerilog HDLの混合記述によるFPGAプログラミング

藤田典久, 大畠佑真, 小林諒平, 山口佳樹, 朴泰祐

第158回ハイパフォーマンスコンピューティング研究発表会 2017.3

　More details

Event date： 2017.3

Presentation type：Oral presentation (general)

researchmap
A survey of how to efficiently implement application-specific hardware on an FPGA Invited International conference

Kobayashi,Ryohei

2nd International Workshop on FPGA for HPC (IWFH) 2016.10

　More details

Event date： 2016.10

Language：English Presentation type：Oral presentation (general)

researchmap
A High-speed Verilog HDL Simulation Method using a Lightweight Translator International conference

Kobayashi, Ryohei, Misono, Tomohiro, Kise, Kenji

International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2016) 2016.7

　More details

Event date： 2016.7

Language：English Presentation type：Oral presentation (general)

Venue：Hong Kong

Designing with Hardware Description Languages (HDLs) is still the de facto standard way to develop FPGA-based custom computing systems, and RTL simulation is an important step in ensuring that the designed hardware behavior meets the design specification. In this paper, we propose a new high-speed Verilog HDL simulation method. It is based on two previously proposed techniques: ArchHDL and Pyverilog. ArchHDL is used as a simulation engine in the method because the RTL simulation provided by ArchHDL can be parallelized with OpenMP. We use Pyverilog to develop a code translator to convert Verilog HDL source code into ArchHDL code, and due to this, the translator can be realized and its implementation is lightweight. We compare the proposed method with Synopsys VCS, and the experimental results show that the RTL simulation behavior and speed are same as that of Synopsys VCS and up to 5.8x better respectively.

researchmap
世界最速のFPGAソーティングアクセラレータの初期検討

臼井, 琢真, 眞下, 達, 松田, 裕貴, 小林, 諒平, 吉瀬, 謙二

情報処理学会第78回全国大会 2016.3

　More details

Event date： 2016.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本慶應義塾大学

ソーティングはデータベース，画像処理，データ圧縮といった様々なアプリケーションに使用されている非常に重要な計算カーネルである．このため様々な高速化手法が提案されており，中にはFPGAを用いたものが存在する．FPGAはユーザーが自由に内部構成を設計できるLSIであるため，アプリケーションに特化した演算回路やデータ供給機構を実装することにより，CPUやGPUと比較して高い演算性能を持つアクセラレータを作成できる可能性を持つ．本稿では，FPGAを用いた世界最速のソーティングアクセラレータの実現に向けたアプローチを検討する．

researchmap
Frix: Feasible and Reconfigurable IBM PC Compatible SoC

Matsuda, Yuki, Ogawa, Eri, Misono, Tomohiro, Kobayashi, Ryohei, Kise, Kenji

情報処理学会第78回全国大会 2016.3

　More details

Event date： 2016.3

Language：English Presentation type：Oral presentation (general)

Venue：日本慶應義塾大学

In order to develop high performance computer systems effectively, environments to evaluate architectural ideas are required.In these purpose, software based simulators are often used, but they have disadvantage of slow simulation speed.In order to achieve fast simulation speed, hardware environments are desired. We propose Frix (Feasible and Reconfigurable IBM PC Compatible SoC), which is an FPGA-based evaluation environment with an x86 soft processor.Frix can boot general purpose operating systems, FreeDOS and TinyCore.The source code of Frix is written in Verilog HDL, and released as open-source.In this paper, we detail the design of Frix and show how to use Frix for research and education.

researchmap
Effective Parallel Simulation of ArchHDL under Manycore Environment International conference

Misono, Tomohiro, Kobayashi, Ryohei, Kise, Kenji

2015 Third International Symposium on Computing and Networking (CANDAR) 2015.12

　More details

Event date： 2015.12

Language：English Presentation type：Oral presentation (general)

For development of hardware such as System on a Chip (SoC), RTL simulation is very important to verify the design. Since RTL simulation has to be repeated many times during the development period, the simulation speed must be fast. However, as the design becomes larger and more complex, the simulation time dramatically increases and developers may not complete the simulation in a reasonable time. Therefore we have proposed a new hardware description language named ArchHDL which enables fast RTL simulation. Designers can write RTL design and test bench in a Verilog HDL-like style. Designers can compile design files in ArchHDL with standard C++ compiler and simulate them by executing the binary. The ArchHDL simulation is cycle accurate and can be parallelized using OpenMP without decreasing the accuracy. In this paper, we show the effectiveness of ArchHDL under a manycore environment. We use Intel Xeon Phi 31S1P Coprocessor in native execution mode to run parallel ArchHDL simulation. For performance evaluation, we use a NoC and a MIPS based manycore processor. As a result, the ArchHDL simulation on 57 cores of the Xeon Phi running at 1.1 GHz achieves up to 48x speedup compared to 1-core execution. Moreover, ArchHDL on 57 cores of the Xeon Phi is up to 9.7x faster than Synopsys VCS running on a single thread of Intel Xeon CPU E5-2687W operating at 3.1GHz and up to 1.7x faster than ArchHDL on 8 cores of the Xeon CPU E5-2687W.

researchmap
SSDの並列性を引き出すI/Oスケジューラ

奥村, 開里, 小林, 諒平, 吉瀬, 謙二

第135回OS・第39回EMB合同研究発表会 2015.11

　More details

Event date： 2015.11

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本お茶の水女子大学

近年，Solid State Drive(SSD) は個人用のパソコンのみならず，クラウドストレージ，データセンターなどといった幅広い範囲で使われ始めている．SSD は性能向上のために，複数チャンネル，またチャンネル毎に存在する複数のチップによって I/O の並列処理を行い性能を向上させているが，それらを考慮した SSD 用のスケジューラは OS 側に組み込まれていない．そのため本稿では，SSD の並列性を抽出することにより，レイテンシの低減，及びスループットの向上を目的とする Alleviate Conflict(AC) スケジューラを提案する．Linux に提案するスケジューラを実装し，SSD に対する様々な I/O リクエストパターンを用いて，SSD の帯域幅とレイテンシを評価した．その結果，Web サーバに近い I/O アクセスパターンにおいては，提案した I/O スケジューラは，Linux カーネルで標準的に使用されている Noop スケジューラ，Deadline スケジューラ，CFQ スケジューラそれぞれと比較し，Noop スケジューラからは帯域幅 4%の向上，レイテンシは 15%の低減，Deadline スケジューラからは帯域幅 7%の向上，レイテンシは 7%の低減，CFQ スケジューラからは帯域幅 34%の向上，レイテンシは 40%の低減を達成した．

researchmap
Reconfigurable IBM PC Compatible SoC for Computer Architecture Education and Research International conference

Ogawa, Eri, Matsuda, Yuki, Misono, Tomohiro, Kobayashi, Ryohei, Kise, Kenji

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip 2015.9

　More details

Event date： 2015.9

Language：English Presentation type：Oral presentation (general)

Venue：Italy Turin

In order to develop high performance computer systems efficiently, environments to evaluate architectural ideas are required. Software environments such as simulators are very flexible, and thus often used. On the other hand, if the target hardware is complex and large, it is very hard to finish the simulation in practical time because of software's slow simulation speed. Thus, we develop a hardware environment for efficient evaluation of computer systems. We propose and develop an IBM PC Compatible SoC on an FPGA where hardware developers can evaluate their custom architectures. The SoC has an x86 soft core processor which can run general purpose operating systems. By making the proposed system run on FPGAs of two major vendors, i.e. Xilinx and Altera, we believe that it can be widely adopted. Besides, the SoC can be used for learning computer systems, because of its open-source policy. In this paper, we detail the design and implementation of the proposed SoC, and verify that it accurately runs some applications. As a case study to demonstrate usability of the SoC for computer research, we implement two types of L2 caches in Verilog HDL and evaluate their performance by running the SPEC CPU2000 INT benchmark suite. Additionally, we discuss how the SoC can be used for computer education.

researchmap
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems International conference

Kobayashi, Ryohei, Kise, Kenji

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip 2015.9

　More details

Event date： 2015.9

Language：English Presentation type：Oral presentation (general)

Venue：Italy Turin

Performance improvements of a single-core processor relying on high clock rates reached the limit. Instead of a single-core processor, multi-core and many-core processors have been mainstream to accelerate applications by parallel processing. Year by year, the number of cores integrated in a single chip has been increased due to improvements of semiconductor integration technologies depending on Moore's Law. On the other hand, Moore's Law will be ended in the near future. This means that the approaches relying on the increase in the number of cores will be hopeless, thus we have to consider other effective ways. One of them is to design application specific hardware. Several research organizations have explored it and reported its remarkable findings. We focus on such an acceleration approach with dedicated hardware. As a case study with dedicated hardware, we present a sorting acceleration method. Sorting is an extremely important computation kernel that should be accelerated in a lot of fields, such as databases, image processing, data compression, etc. We propose a sorting accelerator combining Sorting Network and Merge Sorter Tree, and detail the design and implementation. Our proposed sorting accelerator is customizable, thus designers can implement a sorting accelerator composed of required hardware resources by means of tuning design parameters. Our experiments show that the proposed hardware achieves up to 10.06x sorting performance, compared with Intel Core i7-4770 operating at 3.4GHz, when sorting 256M 32-bits integer elements. In order to allow every designer to easily and freely use this accelerator, the RTL source code is released as an open-source hardware.

researchmap
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み

小林, 諒平, 吉瀬, 謙二

リコンフィギャラブルシステム研究会（RECONF） 2015.6

　More details

Event date： 2015.6

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本京都大学

Sorting is an extremely important computation kernel that has been accelerated by using FPGAs in a lot of fields, such as databases, image processing, data compression, etc. FPGA-based accelerators can achieve higher computation performance than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we introduce several approaches to realize the fastest FPGA-based sorting hardware in the world, and discuss our present system compared with the prior work. According to these approaches and the performance model, we figure out how to design the sorting hardware, whose performance is equal to that of the prior work, with the half of the hardware resources.

researchmap
FPGAベースのソーティングアクセラレータの設計と実装

小林, 諒平, 吉瀬, 謙二

コンピュータシステム研究会 (CPSY) 2015.4

　More details

Event date： 2015.4

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本明治大学中野キャンパス

Sorting is an extremely important computation kernel that has been tried to be accelerated in a lot of fields, such as database, image processing, data compression and so on. We propose an FPGA-based accelerator that executes sorting at high speed. FPGA-based accelerators can achieve higher computation performance than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. Our proposed FPGA accelerator uses two approaches: “Sorting Network” and “Merge Sorter Tree”. In this paper, we detail design and implementation of the proposed sorting accelerator, and evaluate this performance. As a result, the sorting speed of the proposed hardware is up to 10.06x than Intel Core i7-4770 operating at 3.4GHz.

researchmap
A Challenge of Portable and High-Speed FPGA Accelerator International conference

Usui, Takuma, Kobayashi, Ryohei, Kise, Kenji

11th International Symposium on Applied Reconfigurable Computing 2015 2015.4

　More details

Event date： 2015.4

Language：English Presentation type：Poster presentation

Venue：Germany Bochum

FPGA accelerators can achieve higher computation performance and better power efficiency than CPUs and GPUs, because designers can implement circuits that realize application-specific pipeline dhardware and data supply system. In this paper, we propose a portable and high-speed FPGA accelerator employing USB3.0 which is a datatransfer interface with high versatility and high speed. We choose sorting as a practical application for the FPGA accelerator, and then design and implement the FPGA accelerator that executes sorting at high speed. To demonstrate the high portability, we evaluate the FPGA accelerator with several desktop PCs and laptop PCs. The evaluation result shows the sorting speed of the proposed FPGA accelerator is 1.26x and 2.60x higher than Intel Core i7-3770K operating at 3.5GHz and Intel Core i3-4010U operating at 1.83GHz, respectively. From this evaluation, we also show that the proposed FPGA accelerator has high portability.

researchmap
Ultra High-speed FPGA Accelerator for Sorting Application

Kobayashi, Ryohei, Kise, Kenji

情報処理学会第77回全国大会 2015.3

　More details

Event date： 2015.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本京都大学

FPGA accelerators can obtain higher computation performance and better power efficiency than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we propose an approach of sorting acceleration by using a large FPGA. Sorting is an extremely important computation kernel that has been tried to be accelerated in lots of fields. We design and implement the proposed FPGA accelerator, and then evaluate its performance by comparing with a modern desktop computer. From this evaluation, we show how sorting is accelerated.

researchmap
USB3.0接続の手軽で高速なFPGAアクセラレータの設計と実装

臼井, 琢真, 小林, 諒平, 吉瀬, 謙二

リコンフィギャラブルシステム研究会（RECONF） 2015.1

　More details

Event date： 2015.1

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本慶應義塾大学日吉キャンパス

FPGA accelerators can obtain higher computation performance and better power efficiency than CPUs and GPUs, because designers can implement circuits that realize application-specific pipelined hardware and data supply system. In this paper, we propose a portable and high-speed FPGA accelerator employing USB3.0. USB3.0 is a data transfer interface with high versatility and high speed. We choose sorting as a practical application accelerated by the FPGA accelerator, and designed and implemented this. To show the high portability of the proposed FPGA accelerator, we evaluated the FPGA accelerator with several computers, such as desktop PCs, laptop PCs and so on.As a result, the sorting effective performances of the proposed FPGA accelerator are 1.28 and 2.60-times higher than Intel Core i7-3770K operating at 3.5GHz and Intel i3-4010U operating at 1.83GHz respectively. From this evaluation, we show that the proposed FPGA accelerator has high portability.

researchmap
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons

ImmanuelV, Encarnacion, Kobayashi, Ryohei, Kise, Kenji

組込みシステムシンポジウム2014 2014.10

　More details

Event date： 2014.10

Language：Japanese Presentation type：Oral presentation (general)

An embedded system we developed, the MieruEMB system, is used as an educational kit for learning implementation skills and knowledge regarding embedded systems. In this paper we present 3bOS, a simple and easily customizable embedded OS, running on the MieruEMB system. 3bOS comes with a three-button interface and a built-in file explorer for FAT file systems. 3bOS is capable of running ELF executables, providing approximately 400 KB of memory for an application. It can also support basic graphics functions. This embedded OS is written in C, and just consists of around 800 lines of the code. Because of its simplicity, users can easily understand how this embedded OS runs on the MieruEMB system, and can easily modify this embedded OS if they want. We show the design, the implementation, and the features of 3bOS, and conclude that 3bOS is usable for educational purposes.

researchmap
FPGAの消費電力を削減するHDLコーディング手法の検討

Kobayashi, Ryohei, Kise, Kenji

情報処理学会第76回全国大会 2014.3

　More details

Event date： 2014.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本東京電機大学

The advantages of using FPGAs (Field Programmable Gate Arrays) are to change design easily, low respin costs and speeding up development time. However to get these benefits, the FPGA has disadvantages: higher power consumption, larger silicon areas and lower operating speeds compared with the ASIC. In particular, higher power consumption not only requires higher packaging costs, shortens chip life-times, expensive cooling systems, but also decreases system reliability. Therefore, it is truly important to reduce FPGA s power consumption. In this paper, we compare HDL (Hardware Description Language) coding styles, which have already been proposed to reduce power consumption for FPGAs, and seek a more effective way than those.

researchmap
Development of Scalable Stencil-Computation Accelerator Based on Multiple Small FPGAs

小林, 諒平, 高前田(山崎), 伸也, 吉瀬, 謙二

Symposium on Advanced Computing Systems and Infrastructures 2013.5

　More details

Event date： 2013.5

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本仙台国際センター

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as earthquake simulation, digital signal processing and fluid calculation. We have proposed high performance architecture for 2D stencil computation and implemented the architecture by using multiple small FPGAs. We develop the system in stages. First, We implement software simulator in C++, which emulates stencil computation in cycle level accuracy on multiple FPGA nodes. Second, we implement the circuits based on the software simulator in Verilog HDL. We implement the circuits in FPGA array and verify FPGA array. We evaluate the performance, the scalability and the power consumption of developed FPGA array. As a result, we establish the validity on the proposed architecture since the FPGA array operated successfully. The FPGA array with 100-FPGA achieved about 0.6GFlop/sW. This performance/W value is about 3.8 times better than typical CPU card.

researchmap
Design of Synchronization Mechanism to Conquer the Clock Oscillator Variation for High Performance Stencil Computation Accelerator

Kobayashi, Ryohei, Takamaeda-Yamazaki, Shinya, Kise, Kenji

情報処理学会第75回全国大会 2013.3

　More details

Event date： 2013.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本東北大学

Stencil computation is one of the typical scientific computing kernels. It is applied diverse areas as Earthquake simulation, seismic imaging for the oil and gas exploration industry. We have proposed the effective stencil computation method and the architecture by employing multiple small FPGAs with 2Dmech topology. However, as we implemented stencil computation accelerator, we realized that the accelerator does not stable operate because clock oscillator variation occurs. This variation occurs because each FPGA node which composes the accelerator has unique clock domain. In this paper, we evaluate clock oscillator variation quantitatively and describe design of synchronization mechanism to conquer the variation to operate the accelerator successfully.

researchmap
メッシュ接続FPGAアレーを用いた高性能ステンシル計算機の設計と実装

小林, 諒平, 高前田(山崎), 伸也, 吉瀬, 謙二

リコンフィギャラブルシステム研究会（RECONF） 2013.1

　More details

Event date： 2013.1

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本慶応義塾大学日吉キャンパス

We develop an effective stencil computation accelerator by using multiple FPGAs, which employs 2D-mesh architecture connecting multiple small FPGAs. On the process of the development, there is a trouble that the system generates an illegal computation result when the multiple FPGA nodes are used. The cause of it is clock period variation. This paper describes a quantitative evaluation result of clock variations for every FPGA node and the design and implementation of a mechanism to operate the system successfully.

researchmap
Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations International conference

Kobayashi, Ryohei, Takamaeda-Yamazaki, Shinya, Kise, Kenji

2012 Third International Conference on Networking and Computing 2012.12

　More details

Event date： 2012.12

Language：English Presentation type：Oral presentation (general)

Venue：Japan Okinawa

We have proposed the effective stencil computation method and the architecture by employing multiple small FPGAs with 2D-mech topology. In this paper, we show that our proposed architecture works correctly on the real 2D-mesh connected FPGA array. We developed a software simulator in C++, which emulates our proposed architecture, and implemented two prototype systems in Verilog HDL. One prototype system is for logic verification with communication modules and the other is for estimation of power consumption without communication modules. We run the former prototype system for 2M cycles and check the behavior with the software simulator. Our architecture is developed towards a low-power accelerator of many FPGAs. The evaluation result with the second prototype shows that the system of a single FPGA node with eight floating-point adders and eight floating-point multipliers archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. This performance/W value is about six-times better than NVidia GTX280 GPU card.

researchmap
メッシュ接続FPGAアレーにおける高性能ステンシル計算

小林, 諒平, 佐野, 伸太郎, 高前田(山崎), 伸也, 吉瀬, 謙二

SACSIS2012 - 先進的計算基盤システムシンポジウム 2012.5

　More details

Event date： 2012.5

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本神戸国際会議場

FPGA (Field Programmable Gate Array) is a remarkable device to easily develop custom hardware accelerators with higher performance. In this paper, we propose scalable stencil computation mechanism by employing multiple small FPGAs. Stencil computation is one of the most important kernels in scientific computations. This paper describes the architecture of our multi-FPGA-based stencil computation system with 2D-mech topology and the details of primary implementation. In order to eliminate the handshaking overhead among the neighbor FPGAs, computation order is customized for each FPGA to increase the overwrap rate of computations and communications. The evaluation result shows that a single FPGA node archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. We estimated the system performance of 256 FPGAs. The 256 FPGAs system achieves 537GFlop/s with 0.94GFlop/sW efficiency.

researchmap
メッシュ接続FPGAアレーにおけるステンシル計算の検討

小林, 諒平, 佐野, 伸太郎, 高前田(山崎), 伸也, 吉瀬, 謙二

情報処理学会第74回全国大会 2012.3

　More details

Event date： 2012.3

Language：Japanese Presentation type：Oral presentation (general)

Venue：日本名古屋工業大学

近年,FPGAの有する専用のハードウェアを柔軟に構成できるという性質から，FPGAを科学計算のアクセラレータとして用いる研究が盛んに行われている.本研究ではメッシュ接続のFPGAアレーを用いて，科学技術計算において重要な計算カーネルの一つであるステンシル計算に対する検討を行った．

researchmap
OpenCL-Enabled GPU–FPGA Accelerated Computing with Inter-FPGA Communication International conference

Boku, Taisuke, Kobayashi, Ryohei, Fujita, Norihisa, Yamaguchi, Yoshiki, Nakamichi, Ayumi

IXPUG Workshop HPC Asia 2020 2020.1

　More details

Language：English Presentation type：Oral presentation (general)

Venue：Japan Fukuoka

researchmap
OpenCL-enabled Parallel Raytracing for Astrophysical Application on Multiple FPGAs with Optical Links International conference

Fujita, Norihisa, Kobayashi, Ryohei, Yamaguchi, Yoshiki, Boku, Taisuke, Yoshikawa, Kohji, Abe, Makito, Umemura, Masayuki

Sixth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'20) 2020.11

　More details

Language：English Presentation type：Oral presentation (general)

In an earlier study, we optimized the Authentic Radiative Transfer (ART) method to solve the space radiative transfer problems in early universe astrophysical simulations using an Intel Arria 10 Field Programmable Gate Array (FPGA). In this paper, we optimize this method for use on the latest FPGA, an Intel Stratix 10, and evaluate its performance by comparing the GPU implementation on multiple nodes. For the multi-FPGA computing and communication framework, we apply our original system, called as Communication Integrated Reconfigurable CompUting System (CIRCUS), to realize OpenCL based programming and utilize multiple optical links on an FPGA for parallel FPGA processing, and this study is the first implementation of a real application applied using CIRCUS. The FPGA implementation is 4.54-, 8.41-, and 10.64-times faster than that of a GPU on one, two, and four nodes, respectively, for multi-GPU cases using an InfiniBand HDR100 network. It also achieves 94.2 % parallel efficiency running on four FPGAs. We believe this efficiency is brought about from the low-latency and high-efficiency pipelined communication of CIRCUS, which provide easy programming on multi-FPGAs using OpenCL for

researchmap

▼display all

Awards

HPC in ASIA poster award

2018.6

Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Makito Abe, Kohji Yoshikawa, Masayuki Umemura

　More details

Award type：Award from international society, conference, symposium, etc.

researchmap
電子情報通信学会コンピュータシステム研究会優秀若手講演賞

2015.4 電子情報通信学会コンピュータシステム研究会

小林諒平

　More details

Award type：Award from Japanese society, conference, symposium, etc. Country：Japan

FPGAベースのソーティングアクセラレータの設計と実装
小林, 諒平, 吉瀬, 謙二
IEICE-CPSY2015-5 115(7) 25 - 30 2015年4月

researchmap
The 2nd ARC/CPSY/RECONF High-Performance Computer System Design Contest(第２回 ARC/CPSY/RECONF 高性能コンピュータシステム設計コンテスト) コンピュータシステム設計部門優勝

2014.9 第２回 ARC/CPSY/RECONF 高性能コンピュータシステム設計コンテスト実行委員会

小林諒平

　More details

Award type：Award from Japanese society, conference, symposium, etc. Country：Japan

http://www.is.utsunomiya-u.ac.jp/pearlab/contest/

researchmap

Research Projects

Investigating the disk structure around the black hole with two temperature GRRMHD simulations

Grant number：24K00678 2024.4 - 2028.3

　 More details

Grant type：Competitive

Grant amount：\18590000 （ Direct Cost: \14300000 、 Indirect Cost：\4290000 ）

researchmap
複数の演算加速装置に対応できる次世代高性能システムのためのプログラミング環境

Grant number：24K14967 2024.4 - 2027.3

日本学術振興会基盤研究(C)

藤田典久

　 More details

Grant type：Competitive

Grant amount：\4680000 （ Direct Cost: \3600000 、 Indirect Cost：\1080000 ）

スーパーコンピューターの消費電力が増加する問題を解決するために、演算加速装置が広く用いられている。特に、Graphics Processing Unit（GPU）が広く用いられている。しかしながら、GPUは不得手な計算があり、その部分がボトルネックとなることが知られている。申請者らは、異なる演算加速装置を組み合わせて密結合して用いることで、この問題を解決しさらなる演算加速が実現できると考えている。本研究では、多様な計算機を統一的・横断的に扱うことは可能かを問い、演算加速装置を統一的に扱えるプログラミング環境を実現することを目的とする。

researchmap
多要素協調型Approximate Computing実現に向けたHPCアプリケーション解析手法

Grant number：23K11056 2023.4 - 2026.3

日本学術振興会基盤研究(C)

和田康孝

　 More details

Grant type：Competitive

Grant amount：\4810000 （ Direct Cost: \3700000 、 Indirect Cost：\1110000 ）

researchmap
GPU・FPGA複合型グラフ構造データ分析基盤の創出

Grant number：22K17895 2022.4 - 2025.3

日本学術振興会若手研究若手研究

小林諒平

　 More details

Authorship：Principal investigator Grant type：Competitive

Grant amount：\4550000 （ Direct Cost: \3500000 、 Indirect Cost：\1050000 ）

researchmap
多重複合演算加速機構を用いた次世代スーパーコンピューティング

Grant number：21H04869 2021.4 - 2025.3

日本学術振興会基盤研究(A) 基盤研究(A)

朴泰祐

　 More details

Grant type：Competitive

Grant amount：\41730000 （ Direct Cost: \32100000 、 Indirect Cost：\9630000 ）

大規模高性能並列システムにおいて、従来のGPUによる演算加速に加え、FPGAを追加導入することで、性能を大幅に向上させる次世代演算加速システム技術である、多重複合型演算加速型スーパーコンピュータの基盤技術を開発する。両演算加速デバイスを統一的にプログラミング可能とし、分散メモリ並列モデルであるPGASシステム上でこのプログラミングを可能とする環境を構築する。

researchmap
An FPGA-based ultra-fast hardware sorting algorithm

Grant number：19K20276 2019.4 - 2022.3

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Early-Career Scientists

Kobayashi Ryohei

　 More details

Authorship：Principal investigator Grant type：Competitive

Grant amount：\3380000 （ Direct Cost: \2600000 、 Indirect Cost：\780000 ）

This research took full advantage of the features of FPGAs, such as the ability to implement application-specific computation pipelines and data supply mechanisms, to develop hardware algorithms that perform sorting at high speed. Specifically, I proposed a new architecture that applies a virtual merge sorter tree utilizing on-chip memory to a sub-tree of the high-throughput merge sort tree of existing research, and combined the tree with a sorting network. We developed an OpenCL library that calls this sorting engine as a function, and evaluated its sorting performance. The evaluation result showed that the sorting perfomance of the proosed method was three orders of magnitude better than that of merge sort written in OpenCL.

researchmap

▼display all

Social Activities

第2回 TSUBAME ミニキャンプ

Role(s)： Lecturer

東京科学大学(情報基盤センター，総合研究院スーパーコンピューティング研究センター) 2025.9

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、東京科学大学のTSUBAME4.0スーパーコンピュータを使って高速に計算したいアプリケーションを持つユーザを対象に、TSUBAME4上での最適化作業の実践機会を提供します。参加費は無料で、参加者には期間中にTSUBAME4のアカウントが発行されます。

本ミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センターの教員に加えて、GPUを活用した高性能計算のスペシャリストがメンターとして参加し、各自のペースでプログラムのGPU化や、GPU利用率・性能向上の作業を進めるにあたり随時相談することができます。

本ミニキャンプでは、特に高性能計算分野のアプリケーションをGPUを活用するべく最適化することに取り組みます。

本イベントは、東京科学大学の現地会場(初日はすずかけ台キャンパス(TSUBAME4見学つき)・最終日は大岡山キャンパス)とオンラインでのハイブリッド開催となります。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、TSUBAME4.0に接続して作業できる環境から参加していただきます。現地会場では、電源は用意いたしますが、端末はありませんので各自でノートパソコン等をお持ち込みください。

researchmap
SupercomputingContest2025

Role(s)： Lecturer,　Advisor,　Planner,　Organizing member

東京科学大学(情報基盤センター，総合研究院スーパーコンピューティング研究センター) / 大阪大学(D3センター) / 理化学研究所(計算科学研究センター) 2025.8

　More details

Audience： High school students

Type：Science festival

夏の電脳甲子園第31回大会：2025年8月東京科学大学のスーパーコンピュータ「TSUBAME4.0」を使用し、オンラインで開催します。

researchmap
第249回お試しアカウント付き並列プログラミング講習会「第13回 GPUミニキャンプ」

Role(s)： Lecturer

最先端共同HPC基盤施設(JCAHPC)（筑波大学計算科学研究センター、東京大学情報基盤センター）、北海道大学情報基盤センター、東北大学サイバーサイエンスセンター、東京科学大学情報基盤センター、京都大学学術情報メディアセンター、大阪大学D3センター、九州大学情報基盤研究開発センター・データ駆動イノベーション推進本部データ分析支援部門エヌビディア合同会社、PCクラスタコンソーシアム（HPCオープンソースソフトウェア普及部会） 2025.7

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、スーパーコンピュータ Miyabi を活用した実践を行うGPUミニキャンプを開催いたします。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センターの教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、オンラインでの開催となります。各自のSlackとZoomを立ち上げて、 Miyabi に接続して作業ができる環境から参加していただきます。ZoomおよびSlackの接続情報は申込者にのみご連絡いたします。

researchmap
第239回お試しアカウント付き並列プログラミング講習会「JCAHPC Open Hackathon」

Role(s)： Lecturer

最先端共同HPC基盤施設(JCAHPC)（筑波大学計算科学研究センター，東京大学情報基盤センター） 2025.1 - 2025.2

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、スーパーコンピュータ Miyabiを活用した実践を行うハッカソンを開催いたします。参加費は無料です。
JCAHPC Open Hackathonでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ハッカソンでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本ハッカソンは、東京大学柏キャンパス第2総合研究棟の現地会場とオンラインでのハイブリッド開催となります。参加者の方によるMiyabiの見学も予定しております。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、Miyabi に接続して作業ができる環境から参加していただきます。

researchmap
スーパーコンピュータ「不老」TypeIIサブシステム利用 GPUミニキャンプ（機械学習）

Role(s)： Lecturer

名古屋大学情報基盤センター 2024.12

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

名古屋大学情報基盤センターでは、社会貢献、および、大規模並列処理の普及を目的として、当センターが有するスーパーコンピュータ「不老」を利用した講習会やイベントを実施しています。

「不老」Type IIサブシステムの利用促進・システム利用者のスキルレベル向上のためには搭載されているGPUの活用技術が重要であるため、初心者向けには情報基盤センター単独開催の講習会を実施（計画）しています。一方、中級以上のスキルのある利用者や、機械学習には慣れているがスパコン利用には慣れていない（潜在的）利用者には、より実践的な演習・情報交換の場も必要です。そこで、GPUの開発元であるエヌビディア合同会社と協力し、課題持ち込み型のGPUプログラミング演習・相談会（ミニキャンプ）の実施を計画しました。今回は特にGPUを用いた機械学習に興味のあるユーザを主な対象者とします。

本イベントの参加料金は無料です。参加者はセンターのユーザである必要はありません。（一時的に利用可能な無料アカウントを発行します。）また産業利用を想定されている企業技術者の方々も参加できます。

researchmap
第233回お試しアカウント付き並列プログラミング講習会「第12回 GPUミニキャンプ」

Role(s)： Lecturer

最先端共同HPC基盤施設(JCAHPC)（筑波大学計算科学研究センター，東京大学情報基盤センター）、北海道大学情報基盤センター、東京科学大学情報基盤センター・スーパーコンピューティング研究センター、九州大学情報基盤研究開発センター・データ駆動イノベーション推進本部データ分析支援部門、エヌビディア合同会社、PCクラスタコンソーシアム（HPCオープンソースソフトウェア普及部会） 2024.10

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2025年1月に運用開始予定のGPUを搭載したMiyabi (OFP-II)への移植に向けたGPUミニキャンプ第7弾です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、オンラインでの開催となります。各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。ZoomおよびSlackの接続情報は申込者にのみご連絡いたします。

researchmap
第227回お試しアカウント付き並列プログラミング講習会「第11回 GPUミニキャンプ」

Role(s)： Lecturer

最先端共同HPC基盤施設(JCAHPC)（筑波大学計算科学研究センター，東京大学情報基盤センター）、東京工業大学学術国際情報センター、名古屋大学情報基盤センター、九州大学情報基盤研究開発センター・データ駆動イノベーション推進本部データ分析支援部門、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2024.6

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2025年1月に運用開始予定のGPUを搭載した Miyabi (OFP-II) への移植に向けたGPUミニキャンプ第6弾で、ハイブリッド開催です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、東京大学浅野キャンパス情報基盤センターの現地会場とオンラインでのハイブリッド開催となります。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
TSUBAME 深層学習ミニキャンプ

Role(s)： Lecturer

東京工業大学学術国際情報センター 2024.6

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、東京工業大学のTSUBAME4.0スーパーコンピュータを使って高速に計算したい深層学習などのアプリケーションを持つユーザを対象に、TSUBAME4上での実践の機会を提供します。参加費は無料で、参加者には期間中にTSUBAME4のアカウントが発行されます。

本ミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。学術国際情報センターの教員に加えて、GPUを活用した深層学習のスペシャリストがメンターとして参加し、各自のペースでプログラムのGPU化や、GPU利用率・性能向上の作業を進めるにあたり随時相談することができます。

本ミニキャンプでは、特に深層学習分野のアプリケーションをGPUを活用するべく最適化することに取り組みます。PyTorchやTensorFlowなどのフレームワークを利用するアプリケーションを想定していますが、それ以外の自作コード等でも構いません。

本イベントは、東京工業大学の現地会場(初日は大岡山キャンパス・最終日はすずかけ台キャンパス)とオンラインでのハイブリッド開催となります。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、TSUBAME4.0に接続して作業できる環境から参加していただきます。現地会場では、電源は用意いたしますが、端末はありませんので各自でノートパソコン等をお持ち込みください。

researchmap
第223回お試しアカウント付き並列プログラミング講習会「第10回 GPUミニキャンプ」

Role(s)： Lecturer

東京大学情報基盤センター、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2024.2

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2025年1月に運用開始予定のGPUを搭載したOFP-IIへの移植に向けたGPUミニキャンプ第5弾で、ハイブリッド開催です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、東京大学浅野キャンパス情報基盤センターの現地会場とオンラインでのハイブリッド開催となります。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
第215回お試しアカウント付き並列プログラミング講習会「第9回 GPUミニキャンプ」

Role(s)： Lecturer

東京大学情報基盤センター、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2023.10

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2025年1月に運用開始予定のGPUを搭載したOFP-IIへの移植に向けたGPUミニキャンプ第4弾です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、オンラインでの開催となります。各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
第210回お試しアカウント付き並列プログラミング講習会「第8回 GPUミニキャンプ」

Role(s)： Lecturer

東京大学情報基盤センター、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2023.7

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2025年1月に運用開始予定のGPUを搭載したOFP-IIへの移植に向けたGPUミニキャンプ第3弾で、初のハイブリッド開催です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、東京大学柏Ⅱキャンパス情報基盤センターの現地会場とオンラインでのハイブリッド開催となります。現地会場およびオンラインにて各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
第200回お試しアカウント付き並列プログラミング講習会「第7回 GPUミニキャンプ」

Role(s)： Lecturer

東京大学情報基盤センター、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2023.3

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2024年4月に運用開始予定のGPUを搭載したOFP-IIへの移植に向けたGPUミニキャンプ第2弾です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、新型コロナウィルス感染症の拡大を受け、オンラインでの開催となります。各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
第197回お試しアカウント付き並列プログラミング講習会「第6回 GPUミニキャンプ」

Role(s)： Lecturer

東京大学情報基盤センター、エヌビディア合同会社、PCクラスタコンソーシアム（実用アプリケーション部会） 2022.12

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

本ミニキャンプでは、既存のCPUシミュレーションコードをGPU化する方や、既存の単体GPUコードを複数GPUコードにする方などを対象に、情報基盤センター（以降、センター）に設置されたスーパーコンピュータ Wisteria/BDEC-01 を活用した実践を行います。2024年4月に運用開始予定のGPUを搭載したOFP-IIへの移植に向けたGPUミニキャンプ第一弾です。参加費は無料です。
GPUミニキャンプでは、参加者がコードやデータセットを持ち込み、GPUに関連した課題に対して、メンターからの助言を受けながら、その課題解決に取り組みます。情報基盤センター教員に加えて、GPUのスペシャリストがメンターとして参加し、各自のペースでコードのGPU化やGPU利用率向上の作業を進めるにあたり随時相談することができます。
本ミニキャンプでは、特に、既存のCPUシミュレーションコードをOpenACC（指示文）、GPU向けライブラリ、CUDA（GPU専用言語）でGPU化したり、既存の単体GPUコードをMPIで複数GPUコードにすることなどに取り組みます。
本イベントは、新型コロナウィルス感染症の拡大を受け、オンラインでの開催となります。各自のSlackとZoomを立ち上げて、 Wisteria/BDEC-01 に接続して作業ができる環境から参加していただきます。

researchmap
FPGA for HPC Workshop: 科学研究費「再構成可能システムとGPUによる複合型高性能計算プラットフォーム」成果報告会

Role(s)： Organizing member

科学研究費基盤研究(B)「再構成可能システムとGPUによる複合型高性能計算プラットフォーム」 2021.2

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

FPGAの高性能計算（HPC）への適用はこの数年、大きく注目されており、様々な大学・研究機関において活発に研究が進んでいます。本ワークショップはタイトルにありますように、FPGAとGPUを融合した高性能計算プラットフォームに関する科学研究費による研究の成果報告会となっていますが、国内の著名なFPGA研究者による最先端の研究開発に関するトピックスが満載となっており、FPGAのHPC利用、システム構築、FPGA利用技術等、充実した内容になっています。この機会に、FPGAの先進的利用についての知見を共有し、議論をさせて頂ければ幸いです。

ワークショップはZoomによる完全オンライン形式で実施し、どなたでも参加頂けます。講演は全て日本語で行われます。参加費は無料ですが、オンライン接続情報の共有のため参加登録が必須となります。詳細は上記のホームページをご覧ください。多くの皆さんの参加をお待ちしております。

researchmap
GPUオンラインキャンプ

Role(s)： Lecturer

筑波大学計算科学研究センター、エヌビディア合同会社 2020.9

　More details

Audience： College students,　Graduate students,　Teachers,　Researchesrs

Type：Seminar, workshop

筑波大学計算科学研究センターでは、GPUコンピューティングに関するワークショップイベント「GPUオンラインキャンプ」を開催します。GPUの初心者の方からある程度経験のある方まで、どなたでも自由に参加頂けます。参加費は無料です。

特に、筑波大学計算科学研究センターが運用するGPUやFPGAを搭載するスーパーコンピュータであるCygnusのユーザの方、これから同システムを使ってみようという方の積極的な参加を歓迎します。

GPUオンラインキャンプは科学者や研究者、またコード開発者等の参加者が３日間集中して、メンターや他の参加者と一緒にGPU コンピューティング関連の課題を解決するためのイベントです。本来であれば計算科学研究センターの一室に集合して密な連携や議論を行いたいところですが、現在の情勢を踏まえてSlackやzoom等を利用するオンラインでの開催となります。遠方にお住まいの方も、この機会をぜひご活用ください。

本イベントでは１人の参加者はNVIDIA V100 32GB GPUを１基、占有して利用することができます。GPU利用については、科学技術計算やディープラーニングなどに限っていません。幅広い分野からの参加を歓迎します。CPUコードのGPU化、GPUコードの高速化、V100 GPUへの最適化など、各参加者は本イベントで取り組むGPUコンピューティング関連の課題を設定していただきます。

researchmap
我々と高性能コンピュータシステム

Role(s)： Lecturer

茨城県立日立第一高等学校茨城県立日立第一高等学校 2018.10

　More details

Audience： High school students

Type：Visiting lecture

高校生に対し大学の講義を実施することで，高校生の教養を高めるとともに，高校生の将来の進路選択の一助となることを目的とする．

researchmap

▼display all