Skip to content

Commit 45bf3b0

Browse files
committed
Merge upstream
2 parents 3cab952 + 419ca88 commit 45bf3b0

13 files changed

Lines changed: 7929 additions & 70 deletions

File tree

.github/workflows/ci.yml

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,19 @@ on:
88

99
jobs:
1010
test:
11-
runs-on: ubuntu-18.04
11+
runs-on: ubuntu-20.04
1212
env:
1313
MIX_ENV: test
1414
strategy:
1515
fail-fast: false
1616
matrix:
1717
include:
18-
- pair:
19-
elixir: 1.8.2
20-
otp: 20.3.8.26
2118
- pair:
2219
elixir: 1.14.0
2320
otp: 25.1
2421
lint: lint
2522
steps:
26-
- uses: actions/checkout@v2
23+
- uses: actions/checkout@v3
2724

2825
- uses: erlef/setup-beam@v1
2926
with:
@@ -42,12 +39,9 @@ jobs:
4239
- run: mix format --check-formatted
4340
if: ${{ matrix.lint }}
4441

45-
- run: mix deps.unlock --check-unused
46-
if: ${{ matrix.lint }}
47-
48-
- run: mix credo --strict check
42+
- run: mix deps --check-unused
4943
if: ${{ matrix.lint }}
50-
44+
5145
- run: mix deps.compile
5246

5347
- run: mix compile --warnings-as-errors

CHANGELOG.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,21 @@
11
# Changelog
22

3-
## Unicode String v1.1.1
3+
## Unicode String v1.2.1
44

5-
This is the changelog for Unicode String v1.1.1 released on June 2nd, 2023. For older changelogs please consult the release tag on [GitHub](https://github.com/elixir-unicode/unicode_string/tags)
5+
This is the changelog for Unicode String v1.2.1 released on June 2nd, 2023. For older changelogs please consult the release tag on [GitHub](https://github.com/elixir-unicode/unicode_string/tags)
66

77
### Bug Fixes
88

99
* Resolve segments dir at runtime, not compile time. Thanks to @crkent for the report. Closes #4.
1010

11+
## Unicode String v1.2.0
12+
13+
This is the changelog for Unicode String v1.2.0 released on March 14th, 2023. For older changelogs please consult the release tag on [GitHub](https://github.com/elixir-unicode/unicode_string/tags)
14+
15+
### Enhancements
16+
17+
* Adds `Unicode.String.stream/2` to support streaming graphemes, words, sentences and line breaks.
18+
1119
## Unicode String v1.1.0
1220

1321
This is the changelog for Unicode String v1.1.0 released on September 21st, 2022. For older changelogs please consult the release tag on [GitHub](https://github.com/elixir-unicode/unicode_string/tags)
@@ -46,7 +54,7 @@ This is the changelog for Unicode String v0.2.0 released on July 12th, 2020. Fo
4654

4755
### Enhancements
4856

49-
This release implements the Unicode break rules for graphemes, words, lines and sentences.
57+
This release implements the Unicode break rules for graphemes, words, lines (word-wrapping) and sentences.
5058

5159
* Adds `Unicode.String.split/2`
5260

README.md

Lines changed: 86 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,100 @@
22

33
Adds functions supporting some string algorithms in the Unicode standard. For example:
44

5-
* `Unicode.String.fold/1,2` that applies the [Unicode Case Folding algorithm](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf)
5+
* The [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm to provide case-independent equality checking irrespective of language or script with `Unicode.String.fold/2` and `Unicode.String.equals_ignoring_case?/2`
66

7-
* `Unicode.String.equals_ignoring_case?/2` that compares two strings for equality after applying `Unicode.String.fold/2` to the arguments.
7+
* The [Unicode Segmentation](https://unicode.org/reports/tr29/) algorithm to detect, break, split or stream strings into grapheme clusters, words, sentences and line break points.
88

9-
## Examples
9+
* The [Unicode Line Breaking](https://www.unicode.org/reports/tr14/) algorithm to determine line breaks (as in breaks where word-wrapping would be acceptable).
1010

11-
iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
12-
true
11+
## Casing
1312

14-
iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
15-
true
13+
The [Unicode Case Folding](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) algorithm defines how to perform case folding. This allows comparison of strings in a case-insensitive fashion. It does not define the means to compare ignoring diacritical marks (accents). Some examples follow, for details see:
1614

17-
iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
18-
false
15+
* `Unicode.String.fold/2`
16+
* `Unicode.String.equals_ignoring_case?/3`
17+
18+
```elixir
19+
iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
20+
true
21+
22+
iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
23+
true
24+
25+
iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
26+
false
27+
```
28+
## Segmentation
29+
30+
The [Unicode Segmentation](https://unicode.org/reports/tr29/) annex details the algorithm to be applied with segmenting text (Elixir strings) into words, sentences, graphemes and line breaks. Some examples follow, for details see:
31+
32+
* `Unicode.String.split/2`
33+
* `Unicode.String.break?/2`
34+
* `Unicode.String.break/2`
35+
* `Unicode.String.splitter/2`
36+
* `Unicode.String.next/2`
37+
* `Unicode.String.stream/2`
38+
39+
```elixir
40+
# Split text at a word boundary.
41+
iex> Unicode.String.split "This is a sentence. And another.", break: :word
42+
["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]
43+
44+
# Split text at a word boundary but omit any whitespace
45+
iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
46+
["This", "is", "a", "sentence", ".", "And", "another", "."]
47+
48+
# Split text at a sentence boundary.
49+
iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
50+
["This is a sentence. ", "And another."]
51+
52+
# By default, common abbreviations are suppressed (ie
53+
# the do not cause a break)
54+
iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :word, trim: true
55+
["No", ",", "I", "don't", "have", "a", "Ph.D", ".", "but", "I", "don't",
56+
"think", "it", "matters", "."]
57+
58+
iex> Unicode.String.split "No, I don't have a Ph.D. but I don't think it matters.", break: :sentence, trim: true
59+
["No, I don't have a Ph.D. but I don't think it matters."]
60+
61+
# Sentence Break suppressions are locale sensitive.
62+
iex> Unicode.String.Segment.known_locales
63+
["de", "el", "en", "en-US", "en-US-POSIX", "es", "fi", "fr", "it", "ja", "pt",
64+
"root", "ru", "sv", "zh", "zh-Hant"]
65+
66+
iex> Unicode.String.split "Non, c'est M. Dubois.", break: :sentence, trim: true, locale: "fr"
67+
["Non, c'est M. Dubois."]
68+
69+
# Note that break: :line does NOT mean split the string
70+
# at newlines. It splits the string where a line break would be
71+
# acceptable. This is very useful for calculating where
72+
# to perform word-wrap on some text.
73+
iex> Unicode.String.split "This is a sentence. And another.", break: :line
74+
["This ", "is ", "a ", "sentence. ", "And ", "another."]
75+
```
76+
77+
## Segment Streaming
78+
79+
Segmentation can also be streamed using `Unicode.String.stream/2`. For large strings this may improve memory usage since the intermediate segments will be garbage collected when they fall out of scope.
80+
81+
```elixir
82+
iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true) ["this", "is", "a", "set", "of", "words"]
83+
84+
iex> Enum.map Unicode.String.stream("this is a set of words", trim: true),
85+
...> fn word -> %{word: word, length: String.length(word)} end
86+
[
87+
%{length: 4, word: "this"},
88+
%{length: 2, word: "is"},
89+
%{length: 1, word: "a"},
90+
%{length: 3, word: "set"},
91+
%{length: 2, word: "of"},
92+
%{length: 5, word: "words"}
93+
]
94+
```
1995

2096
## Installation
2197

22-
The package can be installed by adding `unicode_string` to your list of dependencies in `mix.exs`:
98+
The package can be installed by adding `:unicode_string` to your list of dependencies in `mix.exs`:
2399

24100
```elixir
25101
def deps do

lib/unicode/break.ex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,7 @@ defmodule Unicode.String.Break do
189189
# "S.A.", "Up.", "Job.", "Num.", "M.I.T.", "Ok.", "Org.", "Ex.", "Cont.", "U.",
190190
# "Mart.", "Fn.", "Abs.", "Lt.", "OK.", "Z.", "E.", "Kb.", "Est.", "A.M.",
191191
# "L.A.", ...]
192+
192193
defp suppressions_rule(locale, segment_type)
193194

194195
for locale <- Segment.known_locales() do

lib/unicode/segment.ex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ defmodule Unicode.String.Segment do
152152
end
153153

154154
@doc """
155-
Evaludates a list of rules against a given
155+
Evaluates a list of rules against a given
156156
string.
157157
158158
"""

0 commit comments

Comments
 (0)