-
Notifications
You must be signed in to change notification settings - Fork 3
Experimental combined / omnibus date parser #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
aefc8b8
Preliminary combined date parser
rlskoeser 2fc55f7
Move parser grammar files to common location; simplify combined parser
rlskoeser 25137cb
Add, document, & test omnibus converter
rlskoeser 7a99c5c
Add test case for unsupported serialization
rlskoeser e655936
Merge branch 'develop' into feature/combined-parser
rlskoeser 0691b12
Add tests for error cases
rlskoeser ce5baaa
Add brief overview docstring for converter module
rlskoeser File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,3 @@ | ||
| from undate.converters.base import BaseDateConverter as BaseDateConverter | ||
| from undate.converters.base import BaseDateConverter, GRAMMAR_FILE_PATH | ||
|
|
||
| __all__ = ["BaseDateConverter", "GRAMMAR_FILE_PATH"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| """ | ||
| **Experimental** combined parser. Supports EDTF, Hebrew, and Hijri | ||
| where dates are unambiguous. (Year-only dates are parsed as EDTF in | ||
| Gregorian calendar.) | ||
| """ | ||
|
|
||
| from typing import Union | ||
|
|
||
| from lark import Lark | ||
| from lark.exceptions import UnexpectedCharacters | ||
| from lark.visitors import Transformer, merge_transformers | ||
|
|
||
| from undate import Undate, UndateInterval | ||
| from undate.converters import BaseDateConverter, GRAMMAR_FILE_PATH | ||
| from undate.converters.edtf.transformer import EDTFTransformer | ||
| from undate.converters.calendars.hebrew.transformer import HebrewDateTransformer | ||
| from undate.converters.calendars.islamic.transformer import IslamicDateTransformer | ||
|
|
||
|
|
||
| class CombinedDateTransformer(Transformer): | ||
| def start(self, children): | ||
| # trigger the transformer for the appropriate part of the grammar | ||
| return children | ||
|
|
||
|
|
||
| # NOTE: currently year-only dates in combined parser are interpreted as | ||
| # EDTF and use Gregorian calendar. | ||
| # In future, we could refine by adding calendar names & abbreviations | ||
| # to the parser in order to recognize years from other calendars. | ||
|
|
||
| combined_transformer = merge_transformers( | ||
| CombinedDateTransformer(), | ||
| edtf=EDTFTransformer(), | ||
| hebrew=HebrewDateTransformer(), | ||
| islamic=IslamicDateTransformer(), | ||
| ) | ||
|
|
||
|
|
||
| # open based on filename so we can specify relative import path based on grammar file | ||
| parser = Lark.open( | ||
| str(GRAMMAR_FILE_PATH / "combined.lark"), rel_to=__file__, strict=True | ||
| ) | ||
|
|
||
|
|
||
| class OmnibusDateConverter(BaseDateConverter): | ||
| """ | ||
| Combination parser that aggregates existing parser grammars. | ||
| Currently supports EDTF, Hebrew, and Hijri where dates are unambiguous. | ||
| (Year-only dates are parsed as EDTF in Gregorian calendar.) | ||
|
|
||
| Does not support serialization. | ||
|
|
||
| Example usage:: | ||
|
|
||
| Undate.parse("Tammuz 4816", "omnibus") | ||
|
|
||
| """ | ||
|
|
||
| #: converter name: omnibus | ||
| name: str = "omnibus" | ||
|
|
||
| def __init__(self): | ||
| self.transformer = combined_transformer | ||
|
|
||
| def parse(self, value: str) -> Union[Undate, UndateInterval]: | ||
| """ | ||
| Parse a string in a supported format and return an :class:`~undate.undate.Undate` | ||
| or :class:`~undate.undate.UndateInterval`. | ||
| """ | ||
| if not value: | ||
| raise ValueError("Parsing empty/unset string is not supported") | ||
|
|
||
| # parse the input string, then transform to undate object | ||
| try: | ||
| parsetree = parser.parse(value) | ||
| # transform returns a list; we want the first item in the list | ||
| return self.transformer.transform(parsetree)[0] | ||
| except UnexpectedCharacters: | ||
| raise ValueError( | ||
| "Parsing failed: '%s' is not in a recognized date format" % value | ||
| ) | ||
|
|
||
| def to_string(self, undate: Union[Undate, UndateInterval]) -> str: | ||
| "Not supported by this converter. Will raise :class:`ValueError`" | ||
| raise ValueError("Omnibus converter does not support serialization") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,8 @@ | ||
| import pathlib | ||
|
|
||
| from lark import Lark | ||
|
|
||
| grammar_path = pathlib.Path(__file__).parent / "edtf.lark" | ||
| from undate.converters import GRAMMAR_FILE_PATH | ||
|
|
||
| grammar_path = GRAMMAR_FILE_PATH / "edtf.lark" | ||
|
|
||
| with open(grammar_path) as grammar: | ||
| edtf_parser = Lark(grammar.read(), start="edtf") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| %import common.WS | ||
| %ignore WS | ||
|
|
||
| start: (edtf__start | hebrew__hebrew_date | islamic__islamic_date ) | ||
|
|
||
| // Renaming of the import variables is required, as they receive the namespace of this file. | ||
| // See: https://github.com/lark-parser/lark/pull/973#issuecomment-907287565 | ||
|
|
||
| // All grammars are in the same file, so we can use relative imports | ||
|
|
||
| // relative import from edtf.lark | ||
| %import .edtf.edtf -> edtf__start | ||
|
|
||
| // relative import from hebrew.lark | ||
| %import .hebrew.hebrew_date -> hebrew__hebrew_date | ||
| %import .hebrew.day -> hebrew__day | ||
| %import .hebrew.month -> hebrew__month | ||
| %import .hebrew.year -> hebrew__year | ||
|
|
||
| // relative import from islamic.lark | ||
| %import .islamic.islamic_date -> islamic__islamic_date | ||
| %import .islamic.day -> islamic__day | ||
| %import .islamic.month -> islamic__month | ||
| %import .islamic.year -> islamic__year | ||
|
|
||
|
|
||
| // override hebrew date to omit year-only, since year without calendar is ambiguous | ||
| // NOTE: potentially support year with calendar label | ||
| %override hebrew__hebrew_date: hebrew__day hebrew__month hebrew__year | hebrew__month hebrew__year | ||
|
|
||
| // same for islamic date, year alone is ambiguous | ||
| %override islamic__islamic_date: islamic__day islamic__month islamic__year | islamic__month islamic__year |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| import pytest | ||
|
|
||
| from undate.converters.combined import parser, combined_transformer | ||
|
|
||
| from undate import Undate, UndateInterval | ||
|
|
||
| # test that valid dates can be parsed | ||
|
|
||
| testcases = [ | ||
| # EDTF | ||
| ("1984", Undate(1984)), | ||
| ("201X", Undate("201X")), | ||
| ("20XX", Undate("20XX")), | ||
| ("2004-XX", Undate(2004, "XX")), | ||
| ("1000/2000", UndateInterval(Undate(1000), Undate(2000))), | ||
| # Hebrew / Anno Mundi calendar | ||
| ("Tammuz 4816", Undate(4816, 4, calendar="Hebrew")), | ||
| # Islamic / Hijri calendar | ||
| ("Jumādā I 1243", Undate(1243, 5, calendar="Islamic")), | ||
| ("7 Jumādā I 1243", Undate(1243, 5, 7, calendar="Islamic")), | ||
| ("14 Rabīʿ I 901", Undate(901, 3, 14, calendar="Islamic")), | ||
| ] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("date_string,expected", testcases) | ||
| def test_transform(date_string, expected): | ||
| # test the transformer directly | ||
| transformer = combined_transformer | ||
| # parse the input string, then transform to undate object | ||
| parsetree = parser.parse(date_string) | ||
| # since the same unknown date is not considered strictly equal, | ||
| # compare object representations | ||
| transformed_date = transformer.transform(parsetree) | ||
| assert repr(transformed_date[0]) == repr(expected) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("date_string,expected", testcases) | ||
| def test_converter(date_string, expected): | ||
| # should work the same way when called through the converter class | ||
| assert repr(Undate.parse(date_string, "omnibus")) == repr(expected) | ||
|
|
||
|
|
||
| def test_no_serialize(): | ||
| with pytest.raises(ValueError, match="does not support"): | ||
| Undate("2022").format("omnibus") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RST directive syntax is malformed.
The note directive should use
.. note::(two dots, space, directive name, double colon). The current syntax will not render as a note block in Sphinx documentation.🔎 Proposed fix
📝 Committable suggestion
🤖 Prompt for AI Agents