This library enables the identification, parsing, and generation of hashtags within text. It defines a syntax that supports character escaping, delimiters, and Unicode processing.
Two distinct hashtag formats are recognized: unwrapped and wrapped.
The unwrapped format begins with a number sign (#) followed
immediately by properly encoded Unicode text (including emojis and
surrogate pairs), terminating at spaces or specific punctuation
characters.
Punctuation behavior depends on a spacing strategy. The scanner distinguishes between trailing punctuation (typically followed by a space) and none punctuation (commonly used without a trailing space). Some systems also use "surrounding" punctuation (space before and after), but it does not apply here because whitespace always breaks unwrapped hashtags (unless escaped).
With trailing punctuation, the idea is: if the scanner reaches one of
these punctuation characters (and it is not escaped), and the next
character is a strong-terminator, another punctuation-char, or end
of input (EOI), then the punctuation is treated as closing
punctuation: the hashtag ends before the punctuation (the punctuation
is not part of the tag), but if a non-whitespace character follows, the
punctuation may remain inside the hashtag as a continuation (e.g.,
#v1.0).
With none punctuation, the idea is: you do not use "is the next character whitespace?" as a signal, because in those writing systems the punctuation is commonly written without a following space. So the scanner treats the punctuation as closing under the same continuation rule but without assuming a trailing space.
A reversed solidus (\) allows the inclusion of spaces, punctuation
characters, a literal less-than sign (<), and itself within an
unwrapped hashtag. For example, #this\ is\ example yields
this is example. Since a less-than sign following the number sign
(#<) initiates a wrapped hashtag, a sequence like #<example is not a
valid hashtag if the closing bracket is missing (it produces no match);
to include a literal less-than sign at the start, it must be escaped
(#\<example). In case of unwrapped hashtag format a resulting hashtag
text must contain at least one valid character, so number sign
followed by space (# ) or number sign followed by a punctuation
character and space (e.g., #. ) are not valid unwrapped hashtags.
The wrapped format encloses hashtag text between a less-than sign (<)
and a greater-than sign (>). This format allows spaces and special
characters within the hashtag. The less-than sign is valid without
escaping inside the brackets. The greater-than sign and the reversed
solidus must be preceded by a reversed solidus to be interpreted as
text. The less-than sign may be escaped (\<) and is treated as a
literal <; createHashtag() will escape < when producing wrapped
hashtags. Therefore, #<<example> is valid and equivalent to
#<\<example>, both producing the text <example. Also in case of
wrapped hashtag format a resulting hashtag text must contain at least
one valid character, so #<> is not a valid hashtag.
Wrapped hashtags may span multiple lines. Line breaks (\n, \r, or
\r\n) in wrapped hashtags are normalized to a single space character,
and any horizontal whitespace (HTAB or SP) immediately following the
line break is ignored.
This ABNF grammar is structural and normative and MUST be implemented together with the normative semantic rules (see APPENDIX B).
hashtag = wrapped-hashtag
/ unwrapped-hashtag
unwrapped-hashtag = unescaped-hash unwrapped-text
; see APPENDIX B.4
unwrapped-text = 1*unwrapped-char
unwrapped-char = escape-pair
/ punctuation-continuation
/ unwrapped-regular-char
escape-pair = backslash non-linebreak
; a backslash followed by a line break terminates the hashtag.
punctuation-continuation = punctuation-char non-terminator
; see APPENDIX B.3
unwrapped-regular-char = non-terminator
non-terminator = scalar
- strong-terminator
- hash-sign
- backslash
- punctuation-char
wrapped-hashtag = unescaped-hash lt-sign wrapped-text gt-sign
; see APPENDIX B.2 and B.4
wrapped-text = 1*wrapped-char
wrapped-char = escape-any
/ wrapped-regular-char
escape-any = backslash scalar
; see APPENDIX B.2
wrapped-regular-char = scalar
- gt-sign
- backslash
punctuation-char = punctuation-trailing ; see APPENDIX A.2
/ punctuation-none ; dittoSee APPENDIX B.
The parser functions as a deterministic linear-time scanner. It traverses the input in a single pass, utilizing a finite state machine (FSM) to handle delimiter detection and Unicode surrogate pairs. The scanner state exhibits a time complexity of O(n) and auxiliary space complexity of O(1). Returned values allocate proportionally to the number and size of matches.
The syntax exceeds the capabilities of standard regular expressions.
Determining if a delimiter is escaped requires tracking the parity (even
or odd count) of preceding backslashes, a task finite automata cannot
perform. This parity check is used to determine whether a # is
escaped; individual escape sequences apply only to the immediately
following character (they do not "span" beyond that character).
Furthermore, the grammar requires conditional lookahead to validate
punctuation characters within unwrapped tags.
Unwrapped hashtags treat certain punctuation characters as closing only
when the punctuation is encountered (and is not escaped) and the next
character is a strong-terminator, a punctuation-char, or EOI; this
matches common spacing rules for Latin, Cyrillic, Greek, Hebrew, Indic
scripts, Arabic, Persian, Urdu, Armenian, Ethiopic, and Georgian (e.g.,
#tag, yields tag, but #v1.0 keeps the . because it is followed
by 0). For scripts where punctuation is commonly written without a
trailing space (Chinese, Japanese, Korean and Tibetan), the parser must
not rely on trailing whitespace; those punctuation characters are
treated as closing under the same continuation rule without assuming a
trailing space.
type PunctuationStrategyCode = 0 | 1;Controls how a punctuation code point behaves in unwrapped hashtags:
0= trailing1= none
type PunctuationStrategyCodeConfig = Record<number, PunctuationStrategyCode>;const punctuationStrategyCode: PunctuationStrategyCodeConfig;type HashtagType = 'unwrapped' | 'wrapped';type HashtagMatch = {
type: HashtagType;
start: number;
end: number;
raw: string;
rawText: string;
text: string;
};Represents a parsed hashtag found in the input string.
start and end are UTF-16 indices into the input; end is exclusive.
raw is the full matched token, including the leading # and, for
wrapped hashtags, the surrounding < and >.
rawText is the payload as it appears in the token, with escapes still
present and without any wrappers.
text is the unescaped payload. For wrapped hashtags, line breaks are
normalized to a single space and any horizontal whitespace immediately
following the line break is ignored.
Malformed surrogate code units are rejected inside hashtags.
type HashtagPatternOptions = {
type?: HashtagType | 'any';
global?: boolean;
sticky?: boolean;
capture?: 'rawText' | 'text';
};
type HashtagPattern = {
source: string;
flags: string;
lastIndex: number;
exec(input: string): RegExpExecArray | null;
test(input: string): boolean;
reset(): void;
execMatch(input: string): HashtagMatch | null;
matchAll(input: string): IterableIterator<RegExpExecArray>;
matchAllMatches(input: string): IterableIterator<HashtagMatch>;
};
function hashtagPattern(options?: HashtagPatternOptions): HashtagPattern;Creates a RegExp-like matcher.
By default, global is false, so exec() finds at most one match,
like JavaScript RegExp.
The type option selects what to match. With 'wrapped' or
'unwrapped', only that form is matched. With 'any', both forms are
matched, and the match also reports which form was found.
The shape of the exec() result depends on type. For 'wrapped' and
'unwrapped', the result is [full, payload]. For 'any', the result
is [full, payload, type], where type is 'wrapped' | 'unwrapped'.
The captured payload is rawText by default; set capture: 'text' to
capture the unescaped payload instead.
If sticky is true, exec() only accepts a match that starts exactly
at lastIndex.
lastIndex is always coerced to a non-negative integer. If it is
greater than the input length, exec() returns null and, when
global or sticky is enabled, resets lastIndex to 0. When
global or sticky is enabled, any failed exec() also resets
lastIndex to 0.
const hashtag: HashtagPattern;
const wrappedHashtag: HashtagPattern;
const unwrappedHashtag: HashtagPattern;These are equivalent to:
hashtagPattern({ type: 'any' });
hashtagPattern({ type: 'wrapped' });
hashtagPattern({ type: 'unwrapped' });type FindOptions = {
type?: HashtagType | 'any';
fromIndex?: number;
};
function findFirstHashtag(
input: string,
options?: FindOptions,
): HashtagMatch | null;
function findAllHashtags(
input: string,
options?: FindOptions,
): HashtagMatch[];
function iterateHashtags(
input: string,
options?: FindOptions,
): IterableIterator<HashtagMatch>;These helpers operate as a thin layer on top of hashtagPattern with
global: true and use fromIndex to initialize the scan position.
fromIndex is coerced to a non-negative integer.
function createHashtag(text: string): stringGenerates a hashtag string from the provided text, automatically selecting the wrapped or unwrapped format based on the content. If the input contains malformed surrogate code units, an empty string is returned.
createHashtag("hello world");
createHashtag("simple");function unescapeHashtagText(text: string): stringRemoves escape backslashes from a raw hashtag payload, returning the clean text content.
unescapeHashtagText("foo\\ bar");hash-sign = "#"
backslash = "\"
lt-sign = "<"
gt-sign = ">"
unescaped-hash = hash-sign
linebreak = CR
/ LF
ascii-ctl = CTL
c1-ctl = %x80-9F
strong-terminator = ascii-ctl
/ SP
/ c1-ctl
non-linebreak = scalar - linebreak
scalar = %x00-D7FF
/ %xE000-10FFFF
; Unicode scalar values (surrogates excluded)Punctuation treated as closing only when followed by
strong-terminator, punctuation-char, or EOI; otherwise it may
continue.
punctuation-trailing = "." ; FULL STOP
/ "," ; COMMA
/ "!" ; EXCLAMATION MARK
/ "?" ; QUESTION MARK
/ ";" ; SEMICOLON
/ ":" ; COLON
/ %x00B7 ; MIDDLE DOT
/ %x0964 ; DEVANAGARI DANDA
/ %x0965 ; DEVANAGARI DOUBLE DANDA
/ %x060C ; ARABIC COMMA
/ %x061B ; ARABIC SEMICOLON
/ %x061F ; ARABIC QUESTION MARK
/ %x06D4 ; ARABIC FULL STOP
/ %x0589 ; ARMENIAN FULL STOP
/ %x055B ; ARMENIAN MODIFIER LETTER LEFT HALF RING
/ %x055C ; ARMENIAN EXCLAMATION MARK
/ %x055E ; ARMENIAN QUESTION MARK
/ %x1361 ; ETHIOPIC WORDSPACE
/ %x1362 ; ETHIOPIC FULL STOP
/ %x1363 ; ETHIOPIC COMMA
/ %x1364 ; ETHIOPIC SEMICOLON
/ %x1365 ; ETHIOPIC COLON
/ %x10FB ; GEORGIAN PARAGRAPH SEPARATORPunctuation treated as closing without relying on trailing whitespace.
punctuation-none = %x0F0D ; TIBETAN MARK SHAD
/ %x0F0E ; TIBETAN MARK NYIS SHAD
/ %x3002 ; IDEOGRAPHIC FULL STOP
/ %x3001 ; IDEOGRAPHIC COMMA
/ %xFF0C ; FULLWIDTH COMMA
/ %xFF1F ; FULLWIDTH QUESTION MARK
/ %xFF01 ; FULLWIDTH EXCLAMATION MARK
/ %xFF1B ; FULLWIDTH SEMICOLON
/ %xFF1A ; FULLWIDTH COLON
/ %x30FB ; KATAKANA MIDDLE DOT
/ %xFF0E ; FULLWIDTH FULL STOPNormative semantic rules MUST be applied in addition to the normative grammar.
A # begins a hashtag only if it is preceded by an even number of \
code points immediately adjacent to it (including zero).
In wrapped hashtags, an unescaped > closes the wrapped text. An
escaped \> is literal payload.
Additionally, wrapped hashtags normalize line breaks in the payload to a
single space character, and any horizontal whitespace (HTAB or SP)
immediately following the line break is ignored.
For punctuation-trailing code points, the punctuation is treated as
closing iff the next code point is a strong-terminator or another
punctuation-char or EOI. Otherwise the punctuation may remain inside
the hashtag as a continuation.
For punctuation-none code points, the punctuation is always treated as
closing.
After an unescaped-hash, if the next code point is an unescaped
lt-sign, the hashtag MUST be parsed as a wrapped hashtag;
otherwise it MUST be parsed as an unwrapped hashtag.
A missing closing gt-sign makes the wrapped hashtag invalid (no
match).
MIT