Unicode Segmentation in JavaScript

Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. Unicode also defines an algorithm for finding boundaries between words and sentences that CLDR adjusts based on locale settings. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences. The Intl.Segmenter API helps us to segment these codepoints with ease!

Example

let segmenter = new Intl.Segmenter("kn-IN", {granularity: "word"});
let input = "ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ";
let segments = segmenter.segment(input);

`new Intl.Segmenter(locale, options)`

Creates a new locale-dependent Segmenter. In this case the locale is kn-IN which is for the language Kannada. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity (“grapheme”, “word”, or “sentence”, defaulting to “grapheme”).

`Intl.Segmenter.prototype.segment(string)`

Creates a new Iterable %Segments% instance for the input string using the Segmenter’s locale and granularity.

Segment Data

Segments are described by plain objects with the following data properties:

segment is the string segment.
index is the code unit index in the string at which the segment begins.
input is the string being segmented.
isWordLike is true when granularity is “word” and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is “word” and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not “word”.

Iterating Through Segments

for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

Output:

segment at code units [0, 3): «ಆವು» (word-like)
segment at code units [3, 4): « »
segment at code units [4, 8): «ಈವಿನ» (word-like)
segment at code units [8, 9): « »
segment at code units [9, 13): «ನಾವು» (word-like)
segment at code units [13, 14): « »
segment at code units [14, 20): «ನೀವಿಗೆ» (word-like)
segment at code units [20, 21): « »
segment at code units [21, 24): «ಆನು» (word-like)
segment at code units [24, 25): « »
segment at code units [25, 29): «ತಾನದ» (word-like)
segment at code units [29, 30): « »
segment at code units [30, 35): «ತನನನಾ» (word-like)

Segments Prototype Methods

The segments prototype has a containing method and [Symbol.iterator]. Also the SegmentIterator has a next method on its prototype.

segments.containing(5);
/*
Would output something like:
{
  segment: 'ಈವಿನ',
  index: 4,
  input: 'ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ',
  isWordLike: true
}
*/

P.S: Don’t miss to read the FAQs in the proposal document.

#javascript#intl#unicode#i18n