12_Global_Memory_Maximizing_CUDA

2021. 9. 7. 00:06

CUDA가 Memory에 접근하는 형태

Global Memory는 L2 cache를 통해서 이루어집니다.
- 한번에 32 byte만큼 읽어옵니다.
- L2 cache에서 L1 cahce로 이동을 할때 128 byte만큼 읽어옵니다.

Aligned memory access: 메모리 접근시 기준점에 맞춰서 읽기 시작하면 Aligned memory access라고 합니다.
- L2 cache: 32bit * 2n 기준점
- L1 cache: 128bit * 2n 기준점

Coalesced memory access
- Warp 내 32개 thread가 연속된 메모리 공간을 접근
- 아래의 사진과 같이 연속된 메모리에 접근하는 것입니다.

예시 사진에서 32개의 Thread가 있고 하나의 Thread당 4 byte를 할당 받는다고 하면

하나의 Warp이 요청하는 데이터의 양이 128 byte라고 할 수 있습니다.

Aligned and Coalesced Access
- L2 cache로 읽는 경우 4등분
- L1 cache로 읽는 경우 한번에 접근 가능

Not Aligned and Coalesced Access
- 데이터들이 연속적이지 않고 크기도 중구난방일 경우
- 더 많은 tracsact가 필요하여 메모리에 접근하는 횟수가 늘어납니다.

되도록 Global Memory Access 할때는 Warp 안에 있는 thread들이 Aligned하고 Coalesced된 Memory Access 하도록 작성을 해야합니다.

위의 사진에서

row = threadIdx.x

col = threadIdx.y

보다는

row = threadIdx.y

col = threadIdx.x

가 더욱 Coalesced하기 때문에 더욱 효율적입니다.

출처:

https://www.youtube.com/watch?v=ualcIR5pmsg&list=PLBrGAFAIyf5pp3QNigbh2hRU5EUD0crgI&index=24

13_Shared_Memory_Maximizing_CUDA (0)	2021.09.07
11_Using Shared Memory_Example_Quiz_2_CUDA (0)	2021.09.02
10_Using Shared Memory_Example_Quiz_1_CUDA (0)	2021.09.01
09_Memory_Active_Warp_CUDA (0)	2021.09.01
08_Memory_Architecture_CUDA (0)	2021.08.30

IT 공부방