TornadoVM, a Java parallel programming framework to run on hardware accelerators, can sometimes outperform OpenCL code on GPUs, despite the latter being closer to the hardware. This is possible due to the TornadoVM’s ability to automatically apply compiler and runtime optimizations.
In this video, we are going to explore and analyse the different optimizations that are applied in TornadoVM using the Matrix Multiplication application as an example. Furthermore, we are going to build an OpenCL C++ application from scratch to replicate, step by step, all compiler and runtime optimizations that TornadoVM automatically applies.
🔗 Blog article about this video: [ Ссылка ]
Chapters:
00:00 Introduction
01:57 Java Baseline
03:57 Machine Specifications
06:28 Java Performance Analysis
08:35 Parallel Java Implementations
14:25 TornadoVM Implementation
17:00 TornadoVM Performance on 4090
17:50 OpenCL C++
25:00 Enabling Loop Interchange
29:30 Enabling FMA & Graal IGV
34:30 Enabling Loop Unroll
37:47 OpenCL Compiler Flags
39:30 Runtime Scheduling
45:30 Discussion
48:20 FLOPS Discussion
54:40 JMH Report
55:00 Outro
links {
🔗 TornadoVM on GitHub: [ Ссылка ]
🔗 TornadoVM Documentation: [ Ссылка ]
🔗 TornadoVM-Examples: [ Ссылка ]
🔗 OpenCL C++ examples: [ Ссылка ]
}
followMe {
🔗 [ Ссылка ]
🔗 [ Ссылка ]
}
website {
🔗 [ Ссылка ]
}
support {
🔗[ Ссылка ]
}
Ещё видео!