|
2 | 2 |
|
3 | 3 | Adds functions supporting some string algorithms in the Unicode standard. For example: |
4 | 4 |
|
5 | | -* `Unicode.String.fold/1,2` that applies the [Unicode Case Folding algorithm](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf) |
| 5 | +* The [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm to provide case-independent equality checking irrespective of language or script with `Unicode.String.fold/2` and `Unicode.String.equals_ignoring_case?/2` |
6 | 6 |
|
7 | | -* `Unicode.String.equals_ignoring_case?/2` that compares two strings for equality after applying `Unicode.String.fold/2` to the arguments. |
| 7 | +* The [Unicode Segmentation](https://unicode.org/reports/tr29/) algorithm to detect, break, split or stream strings into grapheme clusters, words, sentences and line break points. |
8 | 8 |
|
9 | | -## Examples |
| 9 | +* The [Unicode Line Breaking](https://www.unicode.org/reports/tr14/) algorithm to determine line breaks (as in breaks where word-wrapping would be acceptable). |
10 | 10 |
|
11 | | - iex> Unicode.String.equals_ignoring_case? "ABC", "abc" |
12 | | - true |
| 11 | +## Casing |
13 | 12 |
|
14 | | - iex> Unicode.String.equals_ignoring_case? "beißen", "beissen" |
15 | | - true |
| 13 | +The [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm defines how to perform case folding. This allows comparison of strings in a case-insensitive fashion. It does not define the means to compare ignoring diacritical marks (accents). Some examples follow, for details see: |
16 | 14 |
|
17 | | - iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen" |
18 | | - false |
| 15 | +* `Unicode.String.fold/2` |
| 16 | +* `Unicode.String.equals_ignoring_case?/3` |
| 17 | + |
| 18 | +```elixir |
| 19 | +iex> Unicode.String.equals_ignoring_case? "ABC", "abc" |
| 20 | +true |
| 21 | + |
| 22 | +iex> Unicode.String.equals_ignoring_case? "beißen", "beissen" |
| 23 | +true |
| 24 | + |
| 25 | +iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen" |
| 26 | +false |
| 27 | +``` |
| 28 | +## Segmentation |
| 29 | + |
| 30 | +The [Unicode Segmentation](https://unicode.org/reports/tr29/) annex details the algorithm to be applied with segmenting text (Elixir strings) into words, sentences, graphemes and line breaks. Some examples follow, for details see: |
| 31 | + |
| 32 | +* `Unicode.String.split/2` |
| 33 | +* `Unicode.String.break?/2` |
| 34 | +* `Unicode.String.break/2` |
| 35 | +* `Unicode.String.splitter/2` |
| 36 | +* `Unicode.String.next/2` |
| 37 | +* `Unicode.String.stream/2` |
| 38 | + |
| 39 | +```elixir |
| 40 | +# Split text at a word boundary. |
| 41 | +iex> Unicode.String.split "This is a sentence. And another.", break: :word |
| 42 | +["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."] |
| 43 | + |
| 44 | +# Split text at a word boundary but omit any whitespace |
| 45 | +iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true |
| 46 | +["This", "is", "a", "sentence", ".", "And", "another", "."] |
| 47 | + |
| 48 | +# Split text at a sentence boundary. |
| 49 | +iex> Unicode.String.split "This is a sentence. And another.", break: :sentence |
| 50 | +["This is a sentence. ", "And another."] |
| 51 | + |
| 52 | +# By default, common abbreviations are suppressed (ie |
| 53 | +# the do not cause a break) |
| 54 | +iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :word, trim: true |
| 55 | +["No", ",", "I", "don't", "have", "a", "Ph.D", ".", "but", "I", "don't", |
| 56 | + "think", "it", "matters", "."] |
| 57 | + |
| 58 | +iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :sentence, trim: true |
| 59 | +["No, I don't have a Ph.D. but I don't think it matters."] |
| 60 | + |
| 61 | +# Sentence Break suppressions are locale sensitive. |
| 62 | +iex> Unicode.String.Segment.known_locales |
| 63 | +["de", "el", "en", "en-US", "en-US-POSIX", "es", "fi", "fr", "it", "ja", "pt", |
| 64 | + "root", "ru", "sv", "zh", "zh-Hant"] |
| 65 | + |
| 66 | +iex> Unicode.String.split "Non, c'est M. Dubois.", break: :sentence, trim: true, locale: "fr" |
| 67 | +["Non, c'est M. Dubois."] |
| 68 | + |
| 69 | +# Note that break: :line does NOT mean split the string |
| 70 | +# at newlines. It splits the string where a line break would be |
| 71 | +# acceptable. This is very useful for calculating where |
| 72 | +# to perform word-wrap on some text. |
| 73 | +iex> Unicode.String.split "This is a sentence. And another.", break: :line |
| 74 | +["This ", "is ", "a ", "sentence. ", "And ", "another."] |
| 75 | +``` |
| 76 | + |
| 77 | +## Segment Streaming |
| 78 | + |
| 79 | +Segmentation can also be streamed using `Unicode.String.stream/2`. For large strings this may improve memory usage since the intermediate segments will be garbage collected when they fall out of scope. |
| 80 | + |
| 81 | +```elixir |
| 82 | +iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true) ["this", "is", "a", "set", "of", "words"] |
| 83 | + |
| 84 | +iex> Enum.map Unicode.String.stream("this is a set of words", trim: true), |
| 85 | +...> fn word -> %{word: word, length: String.length(word)} end |
| 86 | +[ |
| 87 | + %{length: 4, word: "this"}, |
| 88 | + %{length: 2, word: "is"}, |
| 89 | + %{length: 1, word: "a"}, |
| 90 | + %{length: 3, word: "set"}, |
| 91 | + %{length: 2, word: "of"}, |
| 92 | + %{length: 5, word: "words"} |
| 93 | +] |
| 94 | +``` |
19 | 95 |
|
20 | 96 | ## Installation |
21 | 97 |
|
22 | | -The package can be installed by adding `unicode_string` to your list of dependencies in `mix.exs`: |
| 98 | +The package can be installed by adding `:unicode_string` to your list of dependencies in `mix.exs`: |
23 | 99 |
|
24 | 100 | ```elixir |
25 | 101 | def deps do |
|
0 commit comments