Running C+=AB^T benchmark
  num_threads: 3
  QoS: User Interactive
  num_reps: 20000000
  M: 32
  N: 32
  K: 32
  Max absolute error: 0
  Max relative error: 0
  Accelerate Duration:    2.95451 s
  Accelerate Performance: 1330.9 GFLOPS
  Kernel Duration:        2.88261 s
  Kernel Performance:     1364.1 GFLOPS