Heterogeneous runtime systems, which use multiple devices for different types of computations, are important in scientific research and engineering and are becoming more common. They provide efficient and effective solutions to complex problems. However, for some data-parallel applications, communication among heterogeneous devices within a compute node can easily become a performance bottleneck, hindering the collaborative execution of multiple devices. This is particularly true when the devices are of different types or capabilities and may require different communication protocols or approaches to achieve optimal performance. In this paper, firstly, for the current data copy APIs used in the runtime system, we suggest implementing an optimization for the small-size data copy between different devices using the multi-vector access instructions supported in the X86 architecture. Second, based on our analysis of the transfer rate during the two phases of the pipeline copy process for the page-locked memory buffer, we have found that the host-to-buffer transfer takes up a significant amount of time. To optimize the data communication between the host and page-locked memory buffer, we make full use of multi-vector access instructions to reduce the transfer time. This approach can help improve the efficiency of the transfer. Third, we analyzed the transfer latency for various buffer sizes and numbers of buffers for the pipeline copy of the page-locked memory buffer and determined the optimal buffer configuration for different transfer data, leading to a significant reduction in transfer latency. According to experiments conducted on current heterogeneous computing nodes, our proposed optimization has reduced the transfer latency to 40% of the original value for small data segments (4KB-4MB) and to 90% of the original value for medium-sized data segments (4MB-100MB). These optimizations allow us to achieve higher transmission bandwidth between different devices.
|