Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. Unicode also defines an algorithm for finding boundaries between words and sentences that CLDR adjusts based on locale settings. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences, the Intl.Segementer API helps us to segement these codepoints with ease, let us see few examples!
let segmenter = new Intl.Segmenter("kn-IN", {granularity: "word"});
let input = "ಆವು ಈವಿನ ನಾವು ನೀವಿಗೆ ಆನು ತಾನದ ತನನನಾ";
let segments = segmenter.segment(input);
new Intl.Segmenter(locale, options)
Creates a new locale-dependent Segmenter, in this case the locale is kn-IN
which is for the language kannada in this case. If options is provided, it is treated as an object and its granularity property specifies the segmenter granularity ("grapheme", "word", or "sentence", defaulting to "grapheme").
Intl.Segmenter.prototype.segment(string)
Creates a new Iterable %Segments%
instance for the input string using the Segmenter's locale and granularity.
Segment data
Segments are described by plain objects with the following data properties:
segment
is the string segment.
index
is the code unit index in the string at which the segment begins.
input
is the string being segmented.
isWordLike
is true when granularity is "word" and the segment is word-like (consisting of letters/numbers/ideographs/etc.), false when granularity is "word" and the segment is not word-like (consisting of spaces/punctuation/etc.), and undefined when granularity is not "word".
We can iterate through the segements
as:
for (let {segment, index, isWordLike} of segments) {
console.log("segment at code units [%d, %d): «%s»%s",
index, index + segment.length,
segment,
isWordLike ? " (word-like)" : ""
);
}
It would log:
segment at code units [0, 3): «ಆವು» (word-like)
segment at code units [3, 4): « »
segment at code units [4, 8): «ಈವಿನ» (word-like)
segment at code units [8, 9): « »
segment at code units [9, 13): «ನಾವು» (word-like)
segment at code units [13, 14): « »
segment at code units [14, 20): «ನೀವಿಗೆ» (word-like)
segment at code units [20, 21): « »
segment at code units [21, 24): «ಆನು» (word-like)
segment at code units [24, 25): « »
segment at code units [25, 29): «ತಾನದ» (word-like)
segment at code units [29, 30): « »
segment at code units [30, 35): «ತನನನಾ» (word-like)
The segments
's prototype has has containing
method and [Symbol.iterator]
, also the SegmentIterator
has a next
method on it's prototype.
segments.containing(5);
P.S: Don't miss to read the FAQs in the porposal document.
Feel free to share this article. You may as well ping me on Twitter.
Published