Running C+=AB^T benchmark
  num_threads: 4
  QoS: User Interactive
  num_reps: 20000000
  M: 32
  N: 32
  K: 32
  Max absolute error: 0
  Max relative error: 0
  Accelerate Duration:    4.02868 s
  Accelerate Performance: 1301.39 GFLOPS
  Kernel Duration:        3.97264 s
  Kernel Performance:     1319.75 GFLOPS