A simple C++ markov chain and HTTP server, designed to be used in tandem.
This is the main method of interacting with chains locally, and can be used to load, create and run chains.
Chains are implemented via C++ maps between a phrase and a word, and when one is queried, the query includes a hard/soft limit. Chains work on spaces - they include punctuation in words (e.g. 'complete.'). The soft limit determines the minimum word count the chain can run for before a full stop ends the output (this does not include instances where the chain runs out of data to spit out), such as 'I am a fish.', where a soft limit of 2 would lead to the output ending here. Similarly, the hard limit is an absolute cut-off point - once the hard limit is reach the ouput immediately ends (e.g. 'I am a...' with a hard limit of 3).
The number of words a chain has in its context is referred to as its 'length'. A chain with a length of 1 just knows what word follows another, a chain with a length of 2 knows that 'I am' follows 'a' and 'am a' follows 'fish.', in the sentence 'I am a fish.'. Furthermore, the words are case-sensitive, and include any punctuation (ie 'fish' is not a word, but 'fish.' is).
Directories are separated from the command character by a single space.
-
'n' create a new chain
-
'l' loads a chain from a given directory*
-
't' trains a chain on a given directory
-
's' saves a chain to a given directory*
-
'c' changes a chain option (e.g. default soft/hard limit)
-
- 's' for soft limit
-
- 'h' for hard limit
-
- 'd' sets the debug mode. Enter a 1 to enable it, or 0 to disable it
-
'd' displays information about the chain
-
'>' regurgitates the rest of the input
-
'q' quits the program
Otherwise, it is assumed that the text was meant to be as input to the chain for regurgitation.
* chains should be saved with an .jkc extension. If the chain has already been saved/loaded, it will remember this location, so you won't need to specify the location later
When initialising the server, do ./run-server (-c [server config path]) (d). The server config is optional and stores settings about the server. If a config cannot be found, values will default to:
port = 6678 model-directory = ../models/
'd' will run the server in debug mode, with a lot of info being printed to the terminal.
The container uses a two-stage build process (hence why .dockerignore is a bit bland) to create only the necessary files in the image (the server executable, config file, and models directory).
The http request contains up to 4 parameters - the model, the prompt, the hard limit, and the soft-limit. An example request would be 'http://localhost:6678/model=mdl1.jkc&prompt=My%20name520is&soft_limit=3'. Note that 'hard_limit' is not included, so will default to whatever value is inside the server's config file. In this prompt, the model queried would be (server.model-directory + '/mdl1.jkc').
The chain test suite first trains a model on a very small input, and tests it locally, including checking for random chance (one input phrase having multiple output words). Next, similar tests are done via the API through python and the requests package.
Sometimes the code will throw errorno 111, meaning the server didnt get the socket it was supposed to, so when the tests try to access the socket, they can't. Just give it a bit and try again later. This can also happen while doing ./run-server, though no error message will appear, so just stop the process, wait a bit, and try again.
I was originally inspired to make something like this after reading about Nepenthes, an anti-web-scraper tar pit, designed to trap (particularly gen-ai) crawlers that violated a website's robots.txt, and uses markov babble to poison/entertain them while doing this.
This was also a project I aimed to do with minimal ai assistance (the occasional pointer towards c++ syntax), and while using git in the CLI. This is why an embarassing number of the commits are along the lines of 'forgot x'. Setting up github actions also took some time, with the reason for most of the errors simply being that some of the files it was trying to transfer had been .gitignore'd, so weren't there. Hopefully, lesson learned.
The github actions file is originally courtesy of the University of Warwick Computing Society, and I am also using their services to host an example container - https://examplejarkov.containers.uwcs.co.uk/?model=example.jkc, trained on this file (except this last bit, unless I decide to retrain it (unlikely)).
*A combination of my name + markov, it was not intended to sound explicit.
Theres a few more features I could add, but I am happy with it's current state, being actually somewhat usable. One example would be finding an Accept header value that actually displays the result as json to my browser, though it's in the right format to be parsed.