summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorTony Tye <Tony.Tye@amd.com>2017-10-18 22:16:55 +0000
committerTony Tye <Tony.Tye@amd.com>2017-10-18 22:16:55 +0000
commit0a09220c32b6468f4d33ff1ab9eb8916312194af (patch)
treed89f0d0cbc2fe5b12792a3e07b3a0b304e7c1c15 /docs
parentb5cb868aaa03172b5f83ce48b574bdcad6377b7c (diff)
[AMDGPU] Corrections to memory model description.
- Add description on nontemporal support. - Correct OpenCL sequentially consistent and fence code sequences. - Minor test cleanup. Differential Revision: https://reviews.llvm.org/D39073 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@316131 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs')
-rw-r--r--docs/AMDGPUUsage.rst550
1 files changed, 376 insertions, 174 deletions
diff --git a/docs/AMDGPUUsage.rst b/docs/AMDGPUUsage.rst
index 268abc2ff86..a8b06f27323 100644
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@@ -1240,7 +1240,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
=================================== ============== ========= ==============
.. TODO
- Plan to remove the debug properties metadata.
+ Plan to remove the debug properties metadata.
Kernel Dispatch
~~~~~~~~~~~~~~~
@@ -1431,9 +1431,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
.. table:: Kernel Descriptor for GFX6-GFX9
:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
- ======= ======= =============================== ===========================
+ ======= ======= =============================== ============================
Bits Size Field Name Description
- ======= ======= =============================== ===========================
+ ======= ======= =============================== ============================
31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
address space memory
required for a work-group
@@ -1461,7 +1461,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
97 1 bit IsXNACKEnabled Indicates if the generated
machine code is capable of
suppoting XNACK.
- 127:98 30 bits Reserved. Must be 0.
+ 127:98 30 bits Reserved, must be 0.
191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
negative) from base
address of kernel
@@ -1469,7 +1469,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
entry point instruction
which must be 256 byte
aligned.
- 383:192 24 Reserved. Must be 0.
+ 383:192 24 Reserved, must be 0.
bytes
415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
program settings used by
@@ -1477,7 +1477,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
``COMPUTE_PGM_RSRC1``
configuration
register. See
- :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
+ :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
program settings used by
CP to set up
@@ -1509,16 +1509,16 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
should always be 0.
457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
should always be 0.
- 463:458 6 bits Reserved. Must be 0.
- 511:464 6 Reserved. Must be 0.
+ 463:458 6 bits Reserved, must be 0.
+ 511:464 6 Reserved, must be 0.
bytes
512 **Total size 64 bytes.**
- ======= ===================================================================
+ ======= ====================================================================
..
.. table:: compute_pgm_rsrc1 for GFX6-GFX9
- :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
+ :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
======= ======= =============================== ===========================================================================
Bits Size Field Name Description
@@ -1529,8 +1529,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
specific:
GFX6-9
- roundup((max-vgpg + 1)
- / 4) - 1
+ - max_vgpr 1..256
+ - roundup((max_vgpg + 1)
+ / 4) - 1
Used by CP to set up
``COMPUTE_PGM_RSRC1.VGPRS``.
@@ -1540,11 +1541,13 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
specific:
GFX6-8
- roundup((max-sgpg + 1)
- / 8) - 1
+ - max_sgpr 1..112
+ - roundup((max_sgpg + 1)
+ / 8) - 1
GFX9
- roundup((max-sgpg + 1)
- / 16) - 1
+ - max_sgpr 1..112
+ - roundup((max_sgpg + 1)
+ / 16) - 1
Includes the special SGPRs
for VCC, Flat Scratch (for
@@ -1628,7 +1631,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
with DX10 clamp mode
enabled. Used by the vector
- ALU to force DX-10 style
+ ALU to force DX10 style
treatment of NaN's (when
set, clamp NaN to zero,
otherwise pass NaN
@@ -1676,29 +1679,25 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
CP is responsible for
filling in
``COMPUTE_PGM_RSRC1.CDBG_USER``.
- 26 1 bit FP16_OVFL GFX6-8:
- Reserved. Must be 0.
- GFX9:
- Wavefront starts
- execution with specified
- fp16 overflow mode.
-
- - If 0, then fp16
- overflow generates
+ 26 1 bit FP16_OVFL GFX6-8
+ Reserved, must be 0.
+ GFX9
+ Wavefront starts execution
+ with specified fp16 overflow
+ mode.
+
+ - If 0, fp16 overflow generates
+/-INF values.
- - If 1, then fp16
- overflow that is the
- result of an +/-INF
- input value or divide
- by 0 generates a
- +/-INF, otherwise
- clamps computed
- overflow to +/-MAX_FP16
- as appropriate.
+ - If 1, fp16 overflow that is the
+ result of an +/-INF input value
+ or divide by 0 produces a +/-INF,
+ otherwise clamps computed
+ overflow to +/-MAX_FP16 as
+ appropriate.
Used by CP to set up
``COMPUTE_PGM_RSRC1.FP16_OVFL``.
- 31:27 5 bits Reserved. Must be 0.
+ 31:27 5 bits Reserved, must be 0.
32 **Total size 4 bytes**
======= ===================================================================================================================
@@ -1855,7 +1854,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
_ZERO (rcp_iflag_f32 instruction
only)
- 31 1 bit Reserved. Must be 0.
+ 31 1 bit Reserved, must be 0.
32 **Total size 4 bytes.**
======= ===================================================================================================================
@@ -2245,9 +2244,6 @@ This section describes the mapping of LLVM memory model onto AMDGPU machine code
.. TODO
Update when implementation complete.
- Support more relaxed OpenCL memory model to be controlled by environment
- component of target triple.
-
The AMDGPU backend supports the memory synchronization scopes specified in
:ref:`amdgpu-memory-scopes`.
@@ -2264,19 +2260,23 @@ additional ``s_waitcnt`` instructions are required to ensure registers are
defined before being used. These may be able to be combined with the memory
model ``s_waitcnt`` instructions as described above.
-The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
-OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
-relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
-memory model which has separate happens-before relations for the global and
-local address spaces, and only a fence specifying both global and local address
-space joins the relationships. Since the LLVM ``memfence`` instruction does not
-allow an address space to be specified the OpenCL fence has to convervatively
-assume both local and global address space was specified. However, optimizations
-can often be done to eliminate the additional ``s_waitcnt``instructions when
-there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
-instructions. The code sequences in the table indicate what can be omitted for
-the OpenCL memory. The target triple environment is used to determine if the
-source language is OpenCL (see :ref:`amdgpu-opencl`).
+The AMDGPU backend supports the following memory models:
+
+ HSA Memory Model [HSA]_
+ The HSA memory model uses a single happens-before relation for all address
+ spaces (see :ref:`amdgpu-address-spaces`).
+ OpenCL Memory Model [OpenCL]_
+ The OpenCL memory model which has separate happens-before relations for the
+ global and local address spaces. Only a fence specifying both global and
+ local address space, and seq_cst instructions join the relationships. Since
+ the LLVM ``memfence`` instruction does not allow an address space to be
+ specified the OpenCL fence has to convervatively assume both local and
+ global address space was specified. However, optimizations can often be
+ done to eliminate the additional ``s_waitcnt`` instructions when there are
+ no intervening memory instructions which access the corresponding address
+ space. The code sequences in the table indicate what can be omitted for the
+ OpenCL memory. The target triple environment is used to determine if the
+ source language is OpenCL (see :ref:`amdgpu-opencl`).
``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
operations.
@@ -2308,11 +2308,11 @@ For GFX6-GFX9:
that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
vector memory order if they access LDS memory, and out of LDS operation order
if they access global memory.
-* The vector memory operations access a vector L1 cache shared by all wavefronts
- on a CU. Therefore, no special action is required for coherence between
- wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
- coherence between waves executing in different work-groups as they may be
- executing on different CUs.
+* The vector memory operations access a single vector L1 cache shared by all
+ SIMDs a CU. Therefore, no special action is required for coherence between the
+ lanes of a single wavefront, or for coherence between wavefronts in the same
+ work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
+ executing in different work-groups as they may be executing on different CUs.
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
scalar operations are used in a restricted way so do not impact the memory
@@ -2376,45 +2376,62 @@ future wave that uses the same scratch area, or a function call that creates a
frame at the same address, respectively. There is no need for a ``s_dcache_inv``
as all scalar writes are write-before-read in the same thread.
-Scratch backing memory (which is used for the private address space) is accessed
-with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
-is only accessed by a single thread, and is always write-before-read,
-there is never a need to invalidate these entries from the L1 cache. Hence all
-cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
-lines.
+Scratch backing memory (which is used for the private address space)
+is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
+address space is only accessed by a single thread, and is always
+write-before-read, there is never a need to invalidate these entries from the L1
+cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
+volatile cache lines.
On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
-to invalidate the L2 cache. This also causes it to be treated as non-volatile
-and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
-coherent) and so the L2 cache will coherent with the CPU and other agents.
+to invalidate the L2 cache. This also causes it to be treated as
+non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
+(cache coherent) and so the L2 cache will coherent with the CPU and other
+agents.
.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
- ============ ============ ============== ========== =======================
+ ============ ============ ============== ========== ===============================
LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
Ordering Sync Scope Address
Space
- ============ ============ ============== ========== =======================
+ ============ ============ ============== ========== ===============================
**Non-Atomic**
- ---------------------------------------------------------------------------
- load *none* *none* - global non-volatile
- - generic 1. buffer/global/flat_load
- volatile
+ -----------------------------------------------------------------------------------
+ load *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_load
+ - constant
+ - volatile & !nontemporal
+
1. buffer/global/flat_load
glc=1
+
+ - nontemporal
+
+ 1. buffer/global/flat_load
+ glc=1 slc=1
+
load *none* *none* - local 1. ds_load
- store *none* *none* - global 1. buffer/global/flat_store
+ store *none* *none* - global - !nontemporal
- generic
+ - private 1. buffer/global/flat_store
+ - constant
+ - nontemporal
+
+ 1. buffer/global/flat_stote
+ glc=1 slc=1
+
store *none* *none* - local 1. ds_store
**Unordered Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
load atomic unordered *any* *any* *Same as non-atomic*.
store atomic unordered *any* *any* *Same as non-atomic*.
atomicrmw unordered *any* *any* *Same as monotonic
atomic*.
**Monotonic Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
load atomic monotonic - singlethread - global 1. buffer/global/flat_load
- wavefront - generic
- workgroup
@@ -2440,16 +2457,15 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- wavefront
- workgroup
**Acquire Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
- wavefront - local
- generic
- load atomic acquire - workgroup - global 1. buffer/global_load
- load atomic acquire - workgroup - local 1. ds/flat_load
- - generic 2. s_waitcnt lgkmcnt(0)
+ load atomic acquire - workgroup - global 1. buffer/global/flat_load
+ load atomic acquire - workgroup - local 1. ds_load
+ 2. s_waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen before
any following
global/generic
@@ -2462,8 +2478,23 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
older than the load
atomic value being
acquired.
+ load atomic acquire - workgroup - generic 1. flat_load
+ 2. s_waitcnt lgkmcnt(0)
- load atomic acquire - agent - global 1. buffer/global_load
+ - If OpenCL, omit.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than the load
+ atomic value being
+ acquired.
+ load atomic acquire - agent - global 1. buffer/global/flat_load
- system glc=1
2. s_waitcnt vmcnt(0)
@@ -2516,12 +2547,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
- atomicrmw acquire - workgroup - global 1. buffer/global_atomic
- atomicrmw acquire - workgroup - local 1. ds/flat_atomic
- - generic 2. waitcnt lgkmcnt(0)
+ atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
+ atomicrmw acquire - workgroup - local 1. ds_atomic
+ 2. waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen before
any following
global/generic
@@ -2535,7 +2565,24 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw value
being acquired.
- atomicrmw acquire - agent - global 1. buffer/global_atomic
+ atomicrmw acquire - workgroup - generic 1. flat_atomic
+ 2. waitcnt lgkmcnt(0)
+
+ - If OpenCL, omit.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than the
+ atomicrmw value
+ being acquired.
+
+ atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
- system 2. s_waitcnt vmcnt(0)
- Must happen before
@@ -2592,9 +2639,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- If OpenCL and
address space is
- not generic, omit
- waitcnt. However,
- since LLVM
+ not generic, omit.
+ - However, since LLVM
currently has no
address space on
the fence need to
@@ -2633,14 +2679,14 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
value read by the
fence-paired-atomic.
- fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
- - system lgkmcnt(0)
+ fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
+ - system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
- However, since LLVM
+ - However, since LLVM
currently has no
address space on
the fence need to
@@ -2672,7 +2718,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- s_waitcnt lgkmcnt(0)
must happen after
any preceding
- group/generic load
+ local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
@@ -2699,8 +2745,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer_wbinvl1_vol
- - Must happen before
- any following global/generic
+ - Must happen before any
+ following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
@@ -2710,14 +2756,13 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
global data.
**Release Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
- wavefront - local
- generic
store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
- - generic
- - If OpenCL, omit
- waitcnt.
+
+ - If OpenCL, omit.
- Must happen after
any preceding
local/generic
@@ -2737,8 +2782,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer/global/flat_store
store atomic release - workgroup - local 1. ds_store
- store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
- - system - generic lgkmcnt(0)
+ store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
+
+ - If OpenCL, omit.
+ - Must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ store.
+ - Ensures that all
+ memory operations
+ to local have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. flat_store
+ store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
+ - system - generic vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@@ -2770,7 +2836,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
store.
- Ensures that all
memory operations
- to global have
+ to memory have
completed before
performing the
store that is being
@@ -2781,9 +2847,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- wavefront - local
- generic
atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
- - generic
- - If OpenCL, omit
- waitcnt.
+
+ - If OpenCL, omit.
- Must happen after
any preceding
local/generic
@@ -2803,8 +2868,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer/global/flat_atomic
atomicrmw release - workgroup - local 1. ds_atomic
- atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
- - system - generic lgkmcnt(0)
+ atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
+
+ - If OpenCL, omit.
+ - Must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ to local have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 2. flat_atomic
+ atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
+ - system - generic vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@@ -2842,23 +2928,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
the atomicrmw that
is being released.
- 2. buffer/global/ds/flat_atomic*
+ 2. buffer/global/ds/flat_atomic
fence release - singlethread *none* *none*
- wavefront
fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
- If OpenCL and
address space is
- not generic, omit
- waitcnt. However,
- since LLVM
+ not generic, omit.
+ - However, since LLVM
currently has no
address space on
the fence need to
conservatively
- always generate
- (see comment for
- previous fence).
+ always generate. If
+ fence had an
+ address space then
+ set to address
+ space of OpenCL
+ fence flag, or to
+ generic if both
+ local and global
+ flags are
+ specified.
- Must happen after
any preceding
local/generic
@@ -2883,21 +2975,32 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
following
fence-paired-atomic.
- fence release - agent *none* 1. s_waitcnt vmcnt(0) &
- - system lgkmcnt(0)
+ fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
+ - system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
- However, since LLVM
+ - If OpenCL and
+ address space is
+ local, omit
+ vmcnt(0).
+ - However, since LLVM
currently has no
address space on
the fence need to
conservatively
- always generate
- (see comment for
- previous fence).
+ always generate. If
+ fence had an
+ address space then
+ set to address
+ space of OpenCL
+ fence flag, or to
+ generic if both
+ local and global
+ flags are
+ specified.
- Could be split into
separate s_waitcnt
vmcnt(0) and
@@ -2933,21 +3036,20 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
fence-paired-atomic).
- Ensures that all
memory operations
- to global have
+ have
completed before
performing the
following
fence-paired-atomic.
**Acquire-Release Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen after
any preceding
local/generic
@@ -2965,12 +3067,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw that is
being released.
- 2. buffer/global_atomic
+ 2. buffer/global/flat_atomic
atomicrmw acq_rel - workgroup - local 1. ds_atomic
2. s_waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen before
any following
global/generic
@@ -2986,8 +3087,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen after
any preceding
local/generic
@@ -3008,8 +3108,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. flat_atomic
3. s_waitcnt lgkmcnt(0)
- - If OpenCL, omit
- waitcnt.
+ - If OpenCL, omit.
- Must happen before
any following
global/generic
@@ -3022,8 +3121,9 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
older than the load
atomic value being
acquired.
- atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
- - system lgkmcnt(0)
+
+ atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
+ - system vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@@ -3061,7 +3161,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw that is
being released.
- 2. buffer/global_atomic
+ 2. buffer/global/flat_atomic
3. s_waitcnt vmcnt(0)
- Must happen before
@@ -3085,8 +3185,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
will not see stale
global data.
- atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
- - system lgkmcnt(0)
+ atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
+ - system vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@@ -3157,8 +3257,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- If OpenCL and
address space is
- not generic, omit
- waitcnt. However,
+ not generic, omit.
+ - However,
since LLVM
currently has no
address space on
@@ -3196,8 +3296,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
- fence-paired-atomic)
- has completed
+ acquire-fence-paired-atomic
+ ) has completed
before following
global memory
operations. This
@@ -3217,19 +3317,19 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
- fence-paired-atomic).
- This satisfies the
+ release-fence-paired-atomic
+ ). This satisfies the
requirements of
release.
- fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
- - system lgkmcnt(0)
+ fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
+ - system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
- However, since LLVM
+ - However, since LLVM
currently has no
address space on
the fence need to
@@ -3274,8 +3374,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
- fence-paired-atomic)
- has completed
+ acquire-fence-paired-atomic
+ ) has completed
before invalidating
the cache. This
satisfies the
@@ -3295,8 +3395,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
- fence-paired-atomic).
- This satisfies the
+ release-fence-paired-atomic
+ ). This satisfies the
requirements of
release.
@@ -3317,13 +3417,103 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
acquire.
**Sequential Consistent Atomic**
- ---------------------------------------------------------------------------
+ -----------------------------------------------------------------------------------
load atomic seq_cst - singlethread - global *Same as corresponding
- - wavefront - local load atomic acquire*.
- - workgroup - generic
- load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
- - system - local
- - generic - Must happen after
+ - wavefront - local load atomic acquire,
+ - generic except must generated
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
+ - generic
+ - Must
+ happen after
+ preceding
+ global/generic load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own s_waitcnt
+ lgkmcnt(0) and so do
+ not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent local
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ waitcnt of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ competing out of
+ order.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generated
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - local *Same as corresponding
+ load atomic acquire,
+ except must generated
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
+ - system - generic vmcnt(0)
+
+ - Could be split into
+ separate s_waitcnt
+ vmcnt(0)
+ and s_waitcnt
+ lgkmcnt(0) to allow
+ them to be
+ independently moved
+ according to the
+ following rules.
+ - waitcnt lgkmcnt(0)
+ must happen after
+ preceding
+ global/generic load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own s_waitcnt
+ lgkmcnt(0) and so do
+ not need to be
+ considered.)
+ - waitcnt vmcnt(0)
+ must happen after
preceding
global/generic load
atomic/store
@@ -3351,7 +3541,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
prevents reordering
a seq_cst store
followed by a
- seq_cst load (Note
+ seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
@@ -3360,7 +3550,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
followed by a store
release is
prevented by the
- waitcnt vmcnt(0) of
+ waitcnt of
the release, but
there is nothing
preventing a store
@@ -3372,24 +3562,36 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. *Following
instructions same as
corresponding load
- atomic acquire*.
-
+ atomic acquire,
+ except must generated
+ all instructions even
+ for OpenCL.*
store atomic seq_cst - singlethread - global *Same as corresponding
- - wavefront - local store atomic release*.
- - workgroup - generic
+ - wavefront - local store atomic release,
+ - workgroup - generic except must generated
+ all instructions even
+ for OpenCL.*
store atomic seq_cst - agent - global *Same as corresponding
- - system - generic store atomic release*.
+ - system - generic store atomic release,
+ except must generated
+ all instructions even
+ for OpenCL.*
atomicrmw seq_cst - singlethread - global *Same as corresponding
- - wavefront - local atomicrmw acq_rel*.
- - workgroup - generic
+ - wavefront - local atomicrmw acq_rel,
+ - workgroup - generic except must generated
+ all instructions even
+ for OpenCL.*
atomicrmw seq_cst - agent - global *Same as corresponding
- - system - generic atomicrmw acq_rel*.
+ - system - generic atomicrmw acq_rel,
+ except must generated
+ all instructions even
+ for OpenCL.*
fence seq_cst - singlethread *none* *Same as corresponding
- - wavefront fence acq_rel*.
- - workgroup
- - agent
- - system
- ============ ============ ============== ========== =======================
+ - wavefront fence acq_rel,
+ - workgroup except must generated
+ - agent all instructions even
+ - system for OpenCL.*
+ ============ ============ ============== ========== ===============================
The memory order also adds the single thread optimization constrains defined in
table
@@ -3799,7 +4001,7 @@ used. The default value for all keys is 0, with the following exceptions:
- *kernel_code_entry_byte_offset* defaults to 256.
- *wavefront_size* defaults to 6.
- *kernarg_segment_alignment*, *group_segment_alignment*, and
- *private_segment_alignment* default to 4. Note that alignments are specified
+ *private_segment_alignment* default to 4. Note that alignments are specified
as a power of two, so a value of **n** means an alignment of 2^ **n**.
The *.amd_kernel_code_t* directive must be placed immediately after the