pidan_cpu_gemm_opt

Try to exceed openblas gemm by using single cpu core

TODO:

Micro kernel: 90% cpu perf when datas are in L1 cache
Pack A matrixs
Pack B matrixs
Loop NR : NC
Loop MR : MC
Loop NC : N
Loop KC : K
Loop MC : M
Elegant makefile
Test tools
- validity
- performance
make 1152 size get about 90% cpu perf by tuning
why errors are so big with openblas? Is that normal?
how useful prefetch is in sgemm_6x16? prefetch in sgemm_6x16 improves about 10 Gflops
- add prefetch
- why prefetch is useful? reduce necessary cache misses
how useful prefetch is in A_pack and B_pack? prefetch in B_pack is useless. prefetch in A_pack improves about 3 Gflops
- add prefetch
- why prefetch is useful? reduce necessary cache misses

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
include		include
res		res
src		src
testing		testing
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback