Skip to content

Commit 9c1544d

Browse files
committed
Describe database format, fixes #47
1 parent 5e77167 commit 9c1544d

File tree

1 file changed

+49
-0
lines changed

1 file changed

+49
-0
lines changed

Database.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Geocoder NLP database format
2+
3+
The geocoder database consists of several files which are expected to be in the
4+
same directory. All locations are described using singe coordinate to keep the
5+
files as small as possible.
6+
7+
The files composing a database are:
8+
9+
1. geonlp-primary.sqlite: SQLite database with location description and coordinate
10+
2. geonlp-normalized.trie: MARISA database with normalized strings
11+
3. geonlp-normalized-id.kch: Kyoto Cabinet database for linking MARISA and primary IDs
12+
13+
## geonlp-primary.sqlite
14+
15+
SQLite database contains location description, their organization into hierarchy
16+
of objects.
17+
18+
Table `object_primary` keeps location description. In this table, objects are
19+
stored sequentially (in terms of their `id`) according to the positioning in the
20+
object hierarchy with the children stored after parents. Table `hierarchy` has a
21+
record for each item (`id` from `object_primary`) with the children consisting
22+
of parent ID (`prim_id`) and the ID of the last child (`last_subobject`).
23+
24+
Object types are stored separately in `type` table with the type ID used in
25+
`object_primary`.
26+
27+
Spatial queries are indexed using R-Tree with `box_id` used as a reference in
28+
`object_primary`. Namely, as all objects are stored as points, for storage
29+
efficiency, objects next to each other are set to have the same `box_id` and are
30+
found through `-rtree` tables.
31+
32+
Table `meta` keeps database format version and is used to check version
33+
compatibility.
34+
35+
## geonlp-normalized.trie
36+
37+
All normalized strings are stored in MARISA database
38+
(https://github.com/s-yata/marisa-trie). Normalized strings are formed from
39+
`name` and other similar fields of `object_primary` table in
40+
`geonlp-primary.sqlite`. All strings are pushed into MARISA database that
41+
assigns its internal ID for each of the strings.
42+
43+
## geonlp-normalized-id.kch
44+
45+
Kyoto Cabinet (https://dbmx.net/kyotocabinet/) database for linking MARISA and
46+
primary IDs. Hash database variant is used where `key` is an ID provided by
47+
MARISA for a search string and value is an array of bytes consisting of
48+
`object_primary` IDs stored as `uint32_t` one after another. The array is stored
49+
using `std::string`.

0 commit comments

Comments
 (0)