|
| 1 | +# Geocoder NLP database format |
| 2 | + |
| 3 | +The geocoder database consists of several files which are expected to be in the |
| 4 | +same directory. All locations are described using singe coordinate to keep the |
| 5 | +files as small as possible. |
| 6 | + |
| 7 | +The files composing a database are: |
| 8 | + |
| 9 | +1. geonlp-primary.sqlite: SQLite database with location description and coordinate |
| 10 | +2. geonlp-normalized.trie: MARISA database with normalized strings |
| 11 | +3. geonlp-normalized-id.kch: Kyoto Cabinet database for linking MARISA and primary IDs |
| 12 | + |
| 13 | +## geonlp-primary.sqlite |
| 14 | + |
| 15 | +SQLite database contains location description, their organization into hierarchy |
| 16 | +of objects. |
| 17 | + |
| 18 | +Table `object_primary` keeps location description. In this table, objects are |
| 19 | +stored sequentially (in terms of their `id`) according to the positioning in the |
| 20 | +object hierarchy with the children stored after parents. Table `hierarchy` has a |
| 21 | +record for each item (`id` from `object_primary`) with the children consisting |
| 22 | +of parent ID (`prim_id`) and the ID of the last child (`last_subobject`). |
| 23 | + |
| 24 | +Object types are stored separately in `type` table with the type ID used in |
| 25 | +`object_primary`. |
| 26 | + |
| 27 | +Spatial queries are indexed using R-Tree with `box_id` used as a reference in |
| 28 | +`object_primary`. Namely, as all objects are stored as points, for storage |
| 29 | +efficiency, objects next to each other are set to have the same `box_id` and are |
| 30 | +found through `-rtree` tables. |
| 31 | + |
| 32 | +Table `meta` keeps database format version and is used to check version |
| 33 | +compatibility. |
| 34 | + |
| 35 | +## geonlp-normalized.trie |
| 36 | + |
| 37 | +All normalized strings are stored in MARISA database |
| 38 | +(https://github.com/s-yata/marisa-trie). Normalized strings are formed from |
| 39 | +`name` and other similar fields of `object_primary` table in |
| 40 | +`geonlp-primary.sqlite`. All strings are pushed into MARISA database that |
| 41 | +assigns its internal ID for each of the strings. |
| 42 | + |
| 43 | +## geonlp-normalized-id.kch |
| 44 | + |
| 45 | +Kyoto Cabinet (https://dbmx.net/kyotocabinet/) database for linking MARISA and |
| 46 | +primary IDs. Hash database variant is used where `key` is an ID provided by |
| 47 | +MARISA for a search string and value is an array of bytes consisting of |
| 48 | +`object_primary` IDs stored as `uint32_t` one after another. The array is stored |
| 49 | +using `std::string`. |
0 commit comments