High-efficiency floating-point neural network inference operators for mobile, server, and Web
-
Updated
Dec 10, 2025 - C
High-efficiency floating-point neural network inference operators for mobile, server, and Web
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
🔍 Explore GEMM: a C/C++ library for efficient matrix multiplication using OpenMP, designed for parallel computing learners and practitioners.
🔍 Simulate quantum entanglement using C++ with grids and random particles, offering an engaging way to explore this fundamental concept.
🚀 Explore GPU capabilities on Mac with hands-on comparisons of CPU and Metal GPU performance for AI training using PyTorch and TensorFlow.
Data Structure & Algorithm : This journey is not just about coding but also about developing problem-solving thinking, optimizing solutions, and building a strong foundation for coding interviews and real-world programming. So far i am loving it.
💥 Fast matrix-multiplication as a self-contained Python library – no system dependencies!
DBCSR: Distributed Block Compressed Sparse Row matrix library
Seminar on parallel programming with OpenMP
Benchmark suite comparing LabVIEW GPU toolkits (CuLab, G2CPU, Graiphic Accelerator). Includes methods, sources, results, and reproducible test pipelines.
A powerful library extending VBA with over 100 functions for math, stats, finance, and data manipulation. It supports matrix operations, and user-defined functions, enhancing automation and analysis within Microsoft Office and LibreOffice environments for data management, financial calculations, an more.
M4RI is a library for fast arithmetic with dense matrices over GF(2)
High-performance GPU-accelerated linear algebra library for scientific computing. Custom kernels outperform cuBLAS+cuSPARSE by 2.4x in iterative solvers. Built for circuit simulation workloads.
A lightweight triton-based General Matrix Multiplication (GEMM) library.
High-performance GPU matrix multiplication achieving 6,436 GFLOPS (69% of peak) on Tesla P100 through progressive CUDA optimization
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
[DEPRECATED] Moved to ROCm/rocm-libraries repo
generalized (square) matrix multiplication w/ C++26 experimental::simd
Add a description, image, and links to the matrix-multiplication topic page so that developers can more easily learn about it.
To associate your repository with the matrix-multiplication topic, visit your repo's landing page and select "manage topics."