Unicode Segmentation in JavaScript

Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. Unicode also defines an algorithm for finding boundaries between words and sentences that CLDR adjusts based on locale settings. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences, the Intl.Segementer API helps us to segement these codepoints with ease, let us see few examples!

let segmenter = new Intl.Segmenter("kn-IN", {granularity: "word"});
let input = "ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ";
let segments = segmenter.segment(input);

new Intl.Segmenter(locale, options) Creates a new locale-dependent Segmenter, in this case the locale is kn-IN which is for the language kannada in this case. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity ("grapheme", "word", or "sentence", defaulting to "grapheme").

Intl.Segmenter.prototype.segment(string) Creates a new Iterable %Segments% instance for the input string using the Segmenter's locale and granularity.

Segment data Segments are described by plain objects with the following data properties:

segment is the string segment.
index is the code unit index in the string at which the segment begins.
input is the string being segmented.
isWordLike is true when granularity is "word" and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is "word" and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not "word".

We can iterate through the segements as:

for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

It would log:

segment at code units [0, 3): «ಆವು» (word-like)
segment at code units [3, 4): « »
segment at code units [4, 8): «ಈವಿನ» (word-like)
segment at code units [8, 9): « »
segment at code units [9, 13): «ನಾವು» (word-like)
segment at code units [13, 14): « »
segment at code units [14, 20): «ನೀವಿಗೆ» (word-like)
segment at code units [20, 21): « »
segment at code units [21, 24): «ಆನು» (word-like)
segment at code units [24, 25): « »
segment at code units [25, 29): «ತಾನದ» (word-like)
segment at code units [29, 30): « »
segment at code units [30, 35): «ತನನನಾ» (word-like)

The segments's prototype has has containing method and [Symbol.iterator], also the SegmentIterator has a next method on it's prototype.

segments.containing(5); 

/*
Would output something like:
{
    segment: 'ಈವಿನ', 
    index: 4, 
    input: 'ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ', 
    isWordLike: true
}
*/

P.S: Don't miss to read the FAQs in the porposal document.

Feel free to share this article. You may as well ping me on Twitter.

Published 25 Oct 2021