Skip to content

pidanself/pidan_cpu_gemm_opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pidan_cpu_gemm_opt

Try to exceed openblas gemm by using single cpu core

TODO:

  • Micro kernel: 90% cpu perf when datas are in L1 cache
  • Pack A matrixs
  • Pack B matrixs
  • Loop NR : NC
  • Loop MR : MC
  • Loop NC : N
  • Loop KC : K
  • Loop MC : M
  • Elegant makefile
  • Test tools
    • validity
    • performance
  • make 1152 size get about 90% cpu perf by tuning
  • why errors are so big with openblas? Is that normal?
  • how useful prefetch is in sgemm_6x16? prefetch in sgemm_6x16 improves about 10 Gflops
    • add prefetch
    • why prefetch is useful? reduce necessary cache misses
  • how useful prefetch is in A_pack and B_pack? prefetch in B_pack is useless. prefetch in A_pack improves about 3 Gflops
    • add prefetch
    • why prefetch is useful? reduce necessary cache misses

About

Try to exceed openblas gemm by using single cpu core

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published