Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. Unicode also defines an algorithm for finding boundaries between words and sentences that CLDR adjusts based on locale settings. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences. The Intl.Segmenter API helps us to segment these codepoints with ease!
Example
let segmenter = new Intl.Segmenter("kn-IN", {granularity: "word"});
let input = "ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ";
let segments = segmenter.segment(input);
new Intl.Segmenter(locale, options)
Creates a new locale-dependent Segmenter. In this case the locale is kn-IN which is for the language Kannada. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity (“grapheme”, “word”, or “sentence”, defaulting to “grapheme”).
Intl.Segmenter.prototype.segment(string)
Creates a new Iterable %Segments% instance for the input string using the Segmenter’s locale and granularity.
Segment Data
Segments are described by plain objects with the following data properties:
segmentis the string segment.indexis the code unit index in the string at which the segment begins.inputis the string being segmented.isWordLikeis true when granularity is “word” and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is “word” and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not “word”.
Iterating Through Segments
for (let {segment, index, isWordLike} of segments) {
console.log("segment at code units [%d, %d): «%s»%s",
index, index + segment.length,
segment,
isWordLike ? " (word-like)" : ""
);
}
Output:
segment at code units [0, 3): «ಆವು» (word-like)
segment at code units [3, 4): « »
segment at code units [4, 8): «ಈವಿನ» (word-like)
segment at code units [8, 9): « »
segment at code units [9, 13): «ನಾವು» (word-like)
segment at code units [13, 14): « »
segment at code units [14, 20): «ನೀವಿಗೆ» (word-like)
segment at code units [20, 21): « »
segment at code units [21, 24): «ಆನು» (word-like)
segment at code units [24, 25): « »
segment at code units [25, 29): «ತಾನದ» (word-like)
segment at code units [29, 30): « »
segment at code units [30, 35): «ತನನನಾ» (word-like)
Segments Prototype Methods
The segments prototype has a containing method and [Symbol.iterator]. Also the SegmentIterator has a next method on its prototype.
segments.containing(5);
/*
Would output something like:
{
segment: 'ಈವಿನ',
index: 4,
input: 'ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ',
isWordLike: true
}
*/
P.S: Don’t miss to read the FAQs in the proposal document.
About Hemanth HM
Hemanth HM is a Sr. Machine Learning Manager at PayPal, Google Developer Expert, TC39 delegate, FOSS advocate, and community leader with a passion for programming, AI, and open-source contributions.