chapter 4 Data parallelism SIMD vs MIMD fig 4.1 px 292 p 263 vector architecture Vector architectures grab sets of data elements scattered about memory, place them into large, sequential register files, operate on data in those register files, and then disperse the results back into memory. A single instruction operates on vectors of data, which results in dozens of register–register operations on independent data elements. example VMIPS fig 4.2 VMIPS px 294 p 265 vector registers vector functional units fig 4.3 vectore instructions px 295 p 266 how it works? Y = a*X + Y call (SAXPY, DAXPY), a scalar , X Y vector assembly program px 297 p 268 ld f0,a dadd r4 rx #512 loop: ld f2,0(rx) ; load X[i] dmul f2 f2 f0 ld f4,0(ry) ; load Y[i] dadd f4 f4 f2 st f4,9(ry) ; store Y[i] dadd rx rx #8 ; next i for x dadd ry ry #8 ; next i for y dsub r20 r4 rx ; check bound bnez r20 loop VMIPS assembly ld f0,a lv v1 rx vmul v2 v1 f0 lv v3 ry vadd v4 v2 v3 sv v4 ry multiple lane, more than one element per clock fig 4.4 px 301 p 272 fig 4.5 px 302 p 273 vector length, strip mining for i=0; i Y[i] = a*X[i] + Y[i] n may be larger than vector length memory banks, supply to vector l/s units example CRAY T90 px 306 p 277 real performance fig 4.7 px 310 p 281 ---------- break ------------- GPU fig 4.14 px 323 p 294 difference between CPU vs GPU explain and demo NPU