Skip to content

Zefirchiky/SpelRight

Repository files navigation

SpelRight

Yes, it is intentional.

A simple Spell Checker written in Rust. Includes CLI and lib.

Also available in crates.io!

Supports any utf-8 (kinda, WIP), as long as input file is of right format (look Dataset Fixer or load_words_dict).

Was primarily written for MangaHub project's Novel ecosystem. And to learn Rust :D

[!WARN]

For now, only supports bytes processing, WIP

Some benchmarks

On my i5-12450H laptop with VSC opened.

English.

Load and parse 4mb file with 370105 words in ~<2ms.

Words spelling check ~50,000,000 words/s for all correct words (worst case scenario, batch_par_check).

Sorted suggestions for 1000 incorrect words in ~63ms (~15800 words/s, words case scenario, batch_par_suggest).

Memory usage is minimal, a few big strings of all words without a delimiters + a small vec of information. Totaling dict size + ~200 bytes (depending on the biggest word's length) + additional cost of some operations.

CLI

spell.exe in %PATH%. words.txt in the same folder.

> spell funny wrd sjdkfhsdjfh
✅ funny
❓ wrd => wro wry word wad rd wird ord urd ward wd
❌ Wrong word 'sjdkfhsdjfh', no suggestions

Breakthroughs that lead to this

Storing blobs of words, and their metadata

Storing words of each length in immutable (optional) blobs, sorted by bytes.

Store info about those blobs: len and/or count.

Pros:

  • Incredibly easy to iterate over
  • SIMD compatible
  • Highly parallelizable
  • Great cache locality (a shit ton of cache hits)
  • Search words with binary search O(log n)
  • Working with bytes instead of chars
    • Support any language
  • Other that I forgor

Cons:

  • Needs precise dataset
  • Pretty difficult words addition without moving the whole Vec

Pros totally outweigh the Cons!

Specialized matching algorithm

When iterating over each LenGroup, based on max difference, we can calculate maximum amount of deletions, insertions and substitutions.

As an example:

Checking nothng (group 6) against group 7, the difference between them is 1 insertion and 1 (optional) substitution.

With one insertion, nothng will become group 7, and with optional substitution it can match other words.

There will always be exactly max_dif of max_delete + max_insert + max_substitution.

This is multiple times faster then any other distance finding algorithm.

Goals

  • Checking word correctness

  • Suggesting similar words

  • Adding new words

  • Support different languages

  • Full languages support

    • Full ascii support
    • Full UTF-8 support
      • Normalize some languages
      • Divide languages into words with pure ascii, with possible normalization, and with present UTF-8
    • Plugin
      • For everything
        • Default plugins
      • For especially complex languages
  • Make good CLI

    • Long ruining Server
    • Config
  • Make it fast

    Suggestions (12500 words/s)

    • 100 words/s
    • 250 words/s
    • 1000 words/s
    • 2500 words/s
    • 10000 words/s
    • 25000 words/s
    • 100000 words/s

    Loading (2.2 ms)

    • <200 ms
    • <100 ms
    • <50 ms
    • <20 ms
    • <10 ms
    • <5 ms
    • <3 ms
    • <2 ms (read_to_string is more then 2 ms, not sure if even possible (nvm, after reloading pc, its less then 2 ms))
    • <1 ms (No idea how the fuck this could be possible, but hey, goals!)

Possible Optimizations

Hardware

  • Cache locality (dence blob of words)
  • SIMDeez nuts
    • Distance matching
    • Binary search (might be optimized by the compiler)
  • Parallelism
    • Rayon
      • Test with and without
      • Auto deciding between parallel and normal
    • Manual
  • GPU Acceleration

Memory usage

  • Blobs of words with no other symbol (aka. no \n)
  • Storing minimal metadata about each word length
  • Storing first letter offsets, size depends on the language, but minimal overall

Total memory usage is pretty much minimal.

Reduce amount of words checked

  • Word length groups (depend on dataset)
  • For length that are max distance from a word (no chars change is allowed, only deletions)
    • Tracking first letter offsets, use only the once, whose first letter is the same
  • For length that are the same as a word's (no chars deletion or insertion, only change)

Caching

  • Often mistakes

Loading

Note

read_to_string of 370000 words (~4 mb) is about 2 ms.

on my machine.

  • Reduce parsing by pre-parsing the dataset, look Better dataset

Better dataset

  • Reduce words amount, most words are never used in an average text
  • Store offsets, no unnecessary \n
  • Store first letters offsets

Note

Made it harder to work manually with dataset.

Better algorithms

  • Custom
    • See Breakthrough

About

A simple spell checker written in rust. Includes CLI and lib.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages