SpelRight

Yes, it is intentional.

A simple Spell Checker written in Rust. Includes CLI and lib.

Also available in crates.io!

Supports any utf-8 (kinda, WIP), as long as input file is of right format (look Dataset Fixer or load_words_dict).

Was primarily written for MangaHub project's Novel ecosystem. And to learn Rust :D

[!WARN]

For now, only supports bytes processing, WIP

Some benchmarks

On my i5-12450H laptop with VSC opened.

English.

Load and parse 4mb file with 370105 words in ~<2ms.

Words spelling check ~50,000,000 words/s for all correct words (worst case scenario, batch_par_check).

Sorted suggestions for 1000 incorrect words in ~63ms (~15800 words/s, words case scenario, batch_par_suggest).

Memory usage is minimal, a few big strings of all words without a delimiters + a small vec of information. Totaling dict size + ~200 bytes (depending on the biggest word's length) + additional cost of some operations.

CLI

spell.exe in %PATH%. words.txt in the same folder.

> spell funny wrd sjdkfhsdjfh
✅ funny
❓ wrd => wro wry word wad rd wird ord urd ward wd
❌ Wrong word 'sjdkfhsdjfh', no suggestions

Breakthroughs that lead to this

Storing blobs of words, and their metadata

Storing words of each length in immutable (optional) blobs, sorted by bytes.

Store info about those blobs: len and/or count.

Pros:

Incredibly easy to iterate over
SIMD compatible
Highly parallelizable
Great cache locality (a shit ton of cache hits)
Search words with binary search O(log n)
Working with bytes instead of chars
- Support any language
Other that I forgor

Cons:

Needs precise dataset
Pretty difficult words addition without moving the whole Vec

Pros totally outweigh the Cons!

Specialized matching algorithm

When iterating over each LenGroup, based on max difference, we can calculate maximum amount of deletions, insertions and substitutions.

As an example:

Checking nothng (group 6) against group 7, the difference between them is 1 insertion and 1 (optional) substitution.

With one insertion, nothng will become group 7, and with optional substitution it can match other words.

There will always be exactly max_dif of max_delete + max_insert + max_substitution.

This is multiple times faster then any other distance finding algorithm.

Goals

Possible Optimizations

Hardware

Memory usage

Blobs of words with no other symbol (aka. no \n)
Storing minimal metadata about each word length
Storing first letter offsets, size depends on the language, but minimal overall

Total memory usage is pretty much minimal.

Reduce amount of words checked

Word length groups (depend on dataset)
For length that are max distance from a word (no chars change is allowed, only deletions)
- Tracking first letter offsets, use only the once, whose first letter is the same
For length that are the same as a word's (no chars deletion or insertion, only change)

Caching

Often mistakes

Loading

Note

read_to_string of 370000 words (~4 mb) is about 2 ms.

on my machine.

Reduce parsing by pre-parsing the dataset, look Better dataset

Better dataset

Reduce words amount, most words are never used in an average text
Store offsets, no unnecessary \n
Store first letters offsets

Note

Made it harder to work manually with dataset.

Better algorithms

Custom
- See Breakthrough

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.cargo		.cargo
.vscode		.vscode
benches		benches
dataset_fixer		dataset_fixer
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rust-toolchain		rust-toolchain
words.txt		words.txt
words_alpha.txt		words_alpha.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpelRight

Some benchmarks

CLI

Breakthroughs that lead to this

Storing blobs of words, and their metadata

Specialized matching algorithm

Goals

Possible Optimizations

Hardware

Memory usage

Reduce amount of words checked

Caching

Loading

Better dataset

Better algorithms

About

Uh oh!

Releases

Packages

Languages

License

Zefirchiky/SpelRight

Folders and files

Latest commit

History

Repository files navigation

SpelRight

Some benchmarks

CLI

Breakthroughs that lead to this

Storing blobs of words, and their metadata

Specialized matching algorithm

Goals

Possible Optimizations

Hardware

Memory usage

Reduce amount of words checked

Caching

Loading

Better dataset

Better algorithms

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages