Try to exceed openblas gemm by using single cpu core
TODO:
- Micro kernel: 90% cpu perf when datas are in L1 cache
- Pack A matrixs
- Pack B matrixs
- Loop NR : NC
- Loop MR : MC
- Loop NC : N
- Loop KC : K
- Loop MC : M
- Elegant makefile
- Test tools
- validity
- performance
- make 1152 size get about 90% cpu perf by tuning
- why errors are so big with openblas? Is that normal?
- how useful prefetch is in sgemm_6x16? prefetch in sgemm_6x16 improves about 10 Gflops
- add prefetch
- why prefetch is useful? reduce necessary cache misses
- how useful prefetch is in A_pack and B_pack? prefetch in B_pack is useless. prefetch in A_pack improves about 3 Gflops
- add prefetch
- why prefetch is useful? reduce necessary cache misses