HIP: Heterogenous-computing Interface for Portability
|
hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system. There are two use cases for this host memory:
hipHostMalloc always sets the hipHostMallocPortable and hipHostMallocMapped flags. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory. See the hipHostMalloc API for more information.
ROCm defines two coherency options for host memory:
IP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable:
Coherent host memory is automatically visible at synchronization points.
Non-coherent
HIP API | Synchronization Effect | Fence | Coherent Host Memory Visibiity | Non-Coherent Host Memory Visibility |
---|---|---|---|---|
hipStreamSynchronize | host waits for all commands in the specified stream to complete | system-scope release | yes | yes |
hipDeviceSynchronize | host waits for all commands in all streams on the specified device to complete | system-scope release | yes | yes |
hipEventSynchronize | host waits for the specified event to complete | device-scope release | yes | depends - see below |
hipStreamWaitEvent | stream waits for the specified event to complete | none | yes | no |
Developers can control the release scope for hipEvents:
A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags:
Please note that this document lists possible ways for experimenting with HIP stack to gain performance. Performance may vary from platform to platform.
There are two possible ways to transfer data from host-to-device (H2D) and device-to-host(D2H)
There are three possible ways to transfer data from host-to-device (H2D)
And there are two possible ways to transfer data from device-to-host (D2H)
Some GPUs may not be able to directly access host memory, and in these cases we need to stage the copy through an optimized pinned staging buffer, to implement H2D and D2H copies.The copy is broken into buffer-sized chunks to limit the size of the buffer and also to provide better performance by overlapping the CPU copies with the DMA copies.
PinInPlace is another algorithm which pins the host memory "in-place", and copies it with the DMA engine.
Unpinned memory transfer mode can be controlled using environment variable HCC_UNPINNED_COPY_MODE.
By default HCC_UNPINNED_COPY_MODE is set to 0, which uses default threshold values to decide which transfer way to use based on data size.
Setting HCC_UNPINNED_COPY_MODE = 1, forces all unpinned transfer to use PinInPlace logic.
Setting HCC_UNPINNED_COPY_MODE = 2, forces all unpinned transfer to use Staging buffers.
Setting HCC_UNPINNED_COPY_MODE = 3, forces all unpinned transfer to use direct memcpy on large BAR systems.
Following environment variables can be used to control the transfer thresholds:
hip-hcc and hip-clang supports device-side malloc and free. Users can allocate memory dynamically in a kernel. The allocated memory are in global address space, however, different threads get different memory allocations for the same call of malloc. The allocated memory can be accessed or freed by other threads or other kernels. It persists in the life time of the HIP program until it is freed.
The memory are allocated in pages. Users can define macro __HIP_SIZE_OF_PAGE
for controlling the page size in bytes and macro __HIP_NUM_PAGES
for controlling the total number of pages that can be allocated.
In HCC and HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HCC and HIP-Clang treat long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
By default HIP-Clang assumes -ffp-contract=fast and HCC assumes -ffp-contract=off. For x86_64, FMA is off by default since the generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native on CPU's supporting FMA.
When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can produce different numerical results than the CPU for expressions that can be contracted. Tolerance should be used for floating point comparsions.