diff --git a/assets/indices/contentIndex.json b/assets/indices/contentIndex.json index b371e44..2a78bb6 100644 --- a/assets/indices/contentIndex.json +++ b/assets/indices/contentIndex.json @@ -8,7 +8,7 @@ "/blog/a-c-programmer-s-guide-to-c": { "title": "A C Programmer's Guide to C++", "content": "Let's quickly go over some useful `stdlib` provided data structures and algorithms that come in handy in Competitive Programming.\n# Input / Output Streams (`cin`, `cout`)\nC++ introduces the concept of \"streams\" to supercede the older `printf` and `scanf` implementations in C. In short, streams are an abstraction of a construct used to refer to input / output sources of an unknown number of bytes. You can think of each source as a \"river\", that carries water (data) from some glacier (source, ex: keyboard, file, etc.) to a dam (buffer), from where you can consume water (parse and read input) at your convenience. Think of an user typing at a keyboard or data received over a socket connection, we do not know how many bytes we will receive or when the input will end, they're all streams we can listen to. In more \"data-structure\" terms, a stream is simply a queue. Input arrives at the end of the queue, and we read from the front of it. \n\nIt's important to note that the above explanation of a stream is just a concept. In C++ terms, streams are implemented as classes. `std::cin` and `std::cout` are just global objects of type `std::istream` and `std::ostream` which are guaranteed to be initialized and tied to the input and output streams respectively. So how exactly is the below code parsed and executed?\n\n```cpp\nint x = 10;\nstd::cout \u003c\u003c x \u003c\u003c '\\n';\n```\n\nThe `\u003c\u003c` operator, which you might recognize as the binary left shift operator has actually been overloaded to accept objects of the type `std::ostream` and `int` and also `char*`. Essentially, it translates to:\n\n```cpp\n((std::cout \u003c\u003c x ) \u003c\u003c '\\n');\n```\n\nThis is essentially the signature of what an overload for a generic type `T` would look like. \n\n```cpp\noperator \u003c\u003c (std::ostream \u0026os, T x){\n\tos \u003c\u003c x; // Obviously, do something else here with x\n\treturn os;\n}\n```\n\nSo in the above example, we first evaluate the inner expression `(std::cout \u003c\u003c x)`. `\u003c\u003c` acts on `cout` and `x`, printing the value of `x` to `stdout` and then returns the output stream `std::cout` again. The expression now becomes `std::cout \u003c\u003c '\\n`, which is evaluated by printing `'\\n'` to `stdout`. \n\nSimilarly, for reading input, we use `std::cin \u003e\u003e x`.\n## Why not stick with `scanf` / `printf`?\nSo why the new fancy streams? Why not stick to C-style `scanf` and `printf`? There are several reasons. We'll go over them below.\n### Type Safety\nBoth `printf` and `scanf` rely on a \"format string\" to provide the matching type for provided variadic arguments. This creates a huge decoupling in type information and it is very easy for the function to break due to incorrect format strings being provided. With `cin` and `cout`, the `\u003c\u003c` operator is overloaded by each type that is being passed to it and hence it automatically picks the right overload for the datatype it's instructed to print. If we pass a datatype that is neither primitive nor has this overload defined, we get a safe compile-error. \n### More \"Extensible\"\nI'll use the word 'extensible' instead of 'object oriented' here, but essentially, since `\u003c\u003c` and `\u003e\u003e` are just operators that accepts stream objects and the datatype to print, it is easy for any other larger `struct` or `class` object in C++ to overload this operator for printing purposes. The alternative in C would be to define a `print_type_a` function for each such type. Note that this is also not inheritable. Whereas in C++, as long as the class had it defined, we can just do `cout \u003c\u003c obj;`. Better for implementation hiding, etc. \n### Better Error Handling\nImagine you had this C code,\n\n```c\n#include\u003cstdio.h\u003e\nint main(void){\n\tint x; \n\twhile(scanf(\"%d\", \u0026x) \u003c 1);\n\tprintf(\"x = %d\", x);\n}\n```\n\nSupposedly, this is supposed to be safe. But if the user enters the string \"abc\", then a \"10\", you'll notice you're stuck in an infinite while loop. This is because `scanf` never reads any input that it fails to parse. In this case, since `stdin` has the string \"abc\" at the beginning, it fails to parse it using `%d` conversion to an integer and hence \"abc\" stays at the beginning of `stdin` for all further iterations of this loop. You might think the correct solution is to clear the buffer before calling `scanf` again, but this is horribly wrong. Because according to the C standard, `fflush` is meant to only be called on an output stream. Essentially, `fflush(stdin)` is undefined behavior ([But it works on my machine! - Beginners guide to Undefined Behavior in C++](/blog/but-it-works-on-my-machine-beginners-guide-to-undefined-behavior-in-c)). You'd need to read wrong input into a char buffer instead. But then you run into issues where the input string might exceed the length of your buffer, etc. \n\nThe summary is that `scanf` is very easy to use incorrectly. It's a powerful function that is very easy to get wrong. This is how you'd **correctly** use `scanf` for the above example:\n```c\n#include\u003cstdio.h\u003e\nint main(void){\n\tint x, ret; \n\t/** Since we want exactly one integer read, success for us is when ret == 1.\n\t * If ret is 0, we had an early matching failure and need to clear the rest of stdin.\n\t * To do this, we instruct scanf to **parse** everything left in stdin using \"%*[^\\n]\", \n\t * - %* tells scanf to do the reading and not store it anywhere. This is to avoid buffer overflow.\n\t * - [^\\n] tells scanf to parse everything until it encounters the newline character\n\t*/\n\twhile((ret = scanf(\"%d\", \u0026x)) == 0) scanf(\"%*[^\\n]\")\n\t// Now, we still might have to deal with errors\n\tif(ret == EOF){\n\t\tif(feof(stdin)) puts(\"End of file reached\");\n\t\telse perror(\"Read error\");\n\t}\n\telse printf(\"x = %d\", x);\n}\n```\n\nClearly, it's pretty easy to mess this up unless you spent a good amount of time carefully reading the `scanf` man page. And this is a very simple example, things can get more complicated when we have to parse more complicated input. With `cin`, things are easier (a bit).\n```cpp\n#include\u003ciostream\u003e\nint main(void){\n\tint x;\n\t/** On receiving bad input, cin automatically sets the cin.fail() flag to true.\n\t * Immediately all future calls to cin fail until this is cleared using cin.clear()\n\t * We then clear up the stream as before.\n\t * \n\t*/\n\twhile(!(cin \u003e\u003e x)){ \n\t\tcin.clear(); // Clear the error bit\n\t\tcin.ignore(numeric_limits\u003cstreamsize\u003e::max(),'\\n'); // Clear up stdin\n\t}\n\tcout \u003c\u003c \"x = \" \u003c\u003c x \u003c\u003c \"\\n\";\n}\n```\n\nThings are harder to get wrong. `cin` automatically fails future reads, and there are safe ways to clear the remnants of `stdin`. Further, thanks to overloading and inheritance, it is very easy to abstract away and hide these methods in implementation details when required. \n## What is `fast io`?\nWhen dealing with problems where your C++ program needs to read and write a lot of input, you are suggested to speed up IO by adding these two lines to the start of the program (before any calls to IO). What does it do and why does it magically \"speed up\" IO? Why isn't it on by default? Let's dissect each statement one at a time.\n```cpp\nios::sync_with_stdio(0);\ncin.tie(0);\n```\n### `ios::sync_with_stdio(0)`\nSince `cin` and `cout` were introduced as newer, more 'modern' ways to parse and print information, it was imperative that the developers allow users to migrate code bases from `stdio` based `scanf` \u0026 `printf` legacy code to more modern `cin` / `cout` code. This meant that there might exist stages in migration where the code bases uses both `scanf` and `cin` interchangeably to read input from `stdin`. These scenarios should be handled as expected and not result in nasty surprises for the developers of these code bases. This was an important point to consider when developing C++. This meant that `iostream` would have to spend extra operations synchronizing itself with the `stdio` buffer. \n\n\u003e In practice, this means that the synchronized C++ streams are unbuffered, and each I/O operation on a C++ stream is immediately applied to the corresponding C stream's buffer. This makes it possible to freely mix C++ and C I/O.\n\u003e \n\u003e In addition, synchronized C++ streams are guaranteed to be thread-safe (individual characters output from multiple threads may interleave, but no data races occur).\n\u003e \n\u003e If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.\n\u003e \n\u003e By default, all eight standard C++ streams are synchronized with their respective C streams. \n\u003e - [std::ios_base::sync_with_stdio - cppref](https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio) \n\n### `cin.tie(0)`\n\nFor this part, there's an excellent comment by user [-is-this-fft-](https://codeforces.com/profile/-is-this-fft-), on Codeforces that explains this line much better than I can. I'll just quote his explanation here:\n\n\u003eI think it would be good to reiterate what `tie` does because I see a lot of people copying these lines with only a vague understanding of what they do.\n\u003e\n\u003eEvery stream in C++ is tied to an output stream, which can be null.\n\u003e\n\u003eWhat does this mean? First of all, it's important to understand that when you write `std::cout \u003c\u003c \"asdf\"`, it is not necessarily immediately printed on the screen. It turns out that it is much better (in terms of performance) to collect it into a _buffer_ and then, at some point, _flush_ the buffer — i.e. empty its contents to the screen (or file, or any other \"device\"), all at once.\n\u003e\n\u003eBut now, consider the following. You are developing a console application and write something like:\n\u003e\n\u003e```\n\u003e1 std::cout \u003c\u003c \"Please enter your age: \";\n\u003e2 int age;\n\u003e3 std::cin \u003e\u003e age;\n\u003e```\n\u003e\n\u003eWhat would happen if `std::cout` _didn't_ get flushed before line 3? The program would expect a reply from the user, but the user hasn't even had a chance to read the question. In an application like this, it would be a good idea if `std::cout` was somehow automatically flushed when we try to read from `std::cin`.\n\u003e\n\u003eAnd this is exactly the problem `tie` solves. If you have two streams `fin` and `fout`, and `fin` is tied to `fout`, then `fout` is automatically flushed when you try to read from `fin`. So what are the defaults? What is tied to what? Quoting the C++ reference:\n\u003e\n\u003e \u003eBy default, the standard narrow streams cin and cerr are tied to cout, and their wide character counterparts (wcin and wcerr) to wcout. Library implementations may also tie clog and wclog. \n\u003e\n\u003eNow it becomes clear why people recommend using `cin.tie(0)` in competitive programming. Suppose you are solving a query problem like this.\n\u003e\n\u003e```\n\u003e1 MyAwesomeDataStructure ds;\n\u003e2 for (int i = 0; i \u003c queryc; i++) {\n\u003e3 Query q;\n\u003e4 cin \u003e\u003e q;\n\u003e5 cout \u003c\u003c ds.solve(q) \u003c\u003c '\\n';\n\u003e6 }\n\u003e```\n\u003e\n\u003eIf you didn't use `cin.tie(0)`, we would flush `cout` every time we hit line 4; this is essentially as bad as writing `endl` on line 5. Concerns about the user not seeing the question are also irrelevant now because we are given the entire input at once.\n\u003e\n\u003eThe reference doesn't explicitly say that `cout` isn't tied to anything, but on every C++ compiler I tried, `cout \u003c\u003c cout.tie() \u003c\u003c endl;` outputs 0. Also, even if you are using some strange compiler that ties `cout` to some other stream `sout`, you would only notice the performance hit if you wrote a lot to `sout` which doesn't really happen in competitive programming.\n\u003e- [-is-this-fft-'s comment on the CF Blog \"Ok, lets talk about cout.tie once and forever\"](https://codeforces.com/blog/entry/90775)\n\n# C++ Standard Template Library (STL)\nProbably the single most amazing feature C++ has to offer for competitive programmers. When solving problems, you'll often encounter situations where you will need to use data structures like balanced binary search trees, hash tables and priority queues or algorithms like ($nlogn$) sorting, binary search, etc. for coming up with efficient solutions. \n## Containers\nSTL provides you with the [Containers Library](https://en.cppreference.com/w/cpp/container) to help in these situations. Thank's to [Templates in C++](/blog/templates-in-c), they makes implementing such data structures for any kind of data extremely easy to quick to implement. And unlike certain other languages, C++ guarantees complexity requirements for all standard library implementations to follow, so you have portable code efficiency guarantees. Further, containers provide a great layer of abstraction for us to work with types more generically. For example, all containers (bar minor exceptions) provide constant time access to their `size` via the `.size()` member function. This returns the number of elements in the container. They also provide access to `.begin()` and `.end()` which provide constant time accesses to the first, and one-after-the-last elements of the container via iterators ([What are Iterators in C++?](/blog/what-are-iterators-in-c)). In C++, iterators serve as a bridge between containers and algorithms in the STL. They abstract away container-specific details, allowing algorithms to work with any container type. By focusing only on the iterator type, STL algorithms ensure the right operations are performed efficiently (choosing the right complexity overloads using [SFINAE](/blog/sfinae)), regardless of the container's underlying data structure \u0026 implementation details.\nYou can check out all the other details regarding container requirements on either [cppreference](https://en.cppreference.com/w/) or from [N3797](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3797.pdf) (from pg 741 in the PDF). \n\nWe'll start by discussing the most popular ones in short. \n### `std::array`\nAn `array` in C++ is something like a simple C++ template wrapper around C-style arrays. As a consequence, `std::array` can only be used to create **fixed size** arrays. Further, it is also **stack allocated**. A common trap here is to sometimes create C-style arrays or `std::array` in deep recursive functions. In such situations it's usually fine for recursion depth if each function frame isn't very large, but due to the these arrays being stack allocated, each frame contains a large stack allocated array, making the frame size pretty large and causing the recursive function to break the stack limit. Can fix by increasing stack limit (`ulimit -s unlimited`) or by using heap allocated containers.\n\n\u003eThis container is an aggregate type with the same semantics as a struct holding a [C-style array](https://en.cppreference.com/w/cpp/language/array \"cpp/language/array\") T[N] as its only non-static data member. Unlike a C-style array, it doesn't decay to T* automatically. - [array - cppref](https://en.cppreference.com/w/cpp/container/array)\n\n```cpp\nstd::array\u003cint, 4\u003e A;\nA[0] = 1;\nA[3] = 2;\nauto [a, b, c, d] = a;\ncout \u003c\u003c a \u003c\u003c ' ' \u003c\u003c c \u003c\u003c '\\n';\n// Output: 1 2\n```\n\nNote that in the above code block, $A$ is **default-initialized**. For primitive types like `int`, this means that the values are **indeterminate** / **undefined** until set. If we want to **value-initialize** them, we can declare the array using `std::array\u003cint, 4\u003e A{}` instead. For integers, this would initialize all elements to 0. \n### `std::vector`\nA `vector` in C++ is a **sequence container**.\n\n\u003e Sequence containers implement data structures which can be accessed sequentially. - [Container - cppref](https://en.cppreference.com/w/cpp/container)\n\n`std::vector` is very similar to `std::array` in that they are both made to model contiguous homogeneous array data, with one key difference. While `std::array` is fixed size and stack allocated, `std::vector` can dynamically expand and shrink as required, at the expense of slightly increased memory and time usage. Note that the time complexity for all functions is still the same as that of `std::array` (**amortized**), but it has slightly higher constant factor. This is because unlike `std::array`, `std::vector` is heap allocated, which means creating a `std::vector` involves making system calls to the underlying memory management system (`brk` / `mmap`). Further, to allow dynamic expansion / shrinking (from the end) while allowing for amortized constant time complexity, it needs to allocate some extra space and carry out a few copy operations at chosen intervals. You can watch this wonderful video by [MIT OCW - Erik Demaine - Data Structures and Dynamic Arrays](https://youtu.be/CHhwJjR0mZA?si=5rMS24sndCDTPbsR\u0026t=2034) to understand how this amortized constant time complexity is achieved. Further, this is implied but `std::vector` supports constant time random accesses.\n#### Useful functions\n1. **`push_back`**: Inserts an element `x` at the end of the `vector`, $O(1)$ amortized.\n2. `pop_back`: Deletes the element at the end of `vector`, $O(1)$ amortized.\n\n```cpp\nstd::vector\u003cint\u003e A(4); // Automatically initializes to 0. Equivalent to A(4, 0).\nA[0] = 1, A[3] = 2;\n// auto [a, b, c, d] = A \u003c- Does not work since size of A is not a compile time constant\nA.push_back(5);\nA.push_back(5);\nA.pop_back();\nfor(auto \u0026x : A) cout \u003c\u003c x \u003c\u003c ' ';\n// Output: 1 0 2 0 5\n```\n### `std::deque`\n\n\u003e`std::deque` (double-ended queue) is an indexed sequence container that allows fast insertion and deletion at both its beginning and its end. In addition, insertion and deletion at either end of a deque never invalidates pointers or references to the rest of the elements. - [deque - cppref](https://en.cppreference.com/w/cpp/container/deque)\n\nWith `std::vector`, we can insert / delete from the end of the dynamic array in constant time, but insertion / deletion from the front is linear. `std::deque` tries to solve this problem by allowing fast insertion and deleting at both ends. Further, unlike with `std::vector`, it is not amortized complexity. Both insertion and deletion at either end are constant $O(1)$. \n\nSo what's the catch? It's \"constant\", in quotes. In practice, most `std::deque` implementations store sequences of individually allocated fixed-size arrays. Combing some sort of hashtable-esque bucket structure to fixed size arrays. This means that there is a significant amount of extra bookkeeping and pointer dereferences to perform. You can read this wonderful answer by [Konrad Rudolph](https://stackoverflow.com/users/1968/konrad-rudolph) on [StackOverflow](https://stackoverflow.com/) to get a better idea of how it's implemented.\n\n\u003e A deque is somewhat recursively defined: internally it maintains a double-ended queue of _chunks_ of fixed size. Each chunk is a vector, and the queue (“map” in the graphic below) of chunks itself is also a vector.\n\u003e\n\u003e![pasted-image-20240607063753](/images/pasted-image-20240607063753.webp)\n\u003eThere’s a great analysis of the performance characteristics and how it compares to the `vector` over at [CodeProject](https://www.codeproject.com/Articles/5425/An-In-Depth-Study-of-the-STL-Deque-Container).\n\u003e\n\u003eThe GCC standard library implementation internally uses a `T**` to represent the map. Each data block is a `T*` which is allocated with some fixed size `__deque_buf_size` (which depends on `sizeof(T)`).\n\u003e- [What really is a deque in STL? - StackOverflow](https://stackoverflow.com/questions/6292332/what-really-is-a-deque-in-stl)\n\nYou can take a look at this report by [Nitron](https://www.codeproject.com/Members/Nitron) at the above quoted link, [An In-Depth Study of the STL Deque Container - Code Project](https://www.codeproject.com/Articles/5425/An-In-Depth-Study-of-the-STL-Deque-Container) to get a better idea of just how slow the constant factor is. In practice, there have been instances where an Iterative segment tree solution (with it's extra $log(n)$ factor) was able to get a faster execution time than the linear time `std::deque`. This could've been poor tests, but also serves as a reminder to only use this data structure when absolutely required. \n#### Useful functions\n1. **`push_back`**: Inserts an element `x` at the end of the `std::deque`, constant $O(1)$.\n2. `pop_back`: Deletes the element at the end of the `std::deque`, constant $O(1)$.\n3. `push_front`: Inserts an element $x$ at the beginning of the `std::deque`, constant $O(1)$.\n4. `pop_front`: Deletes an element at the beginning of the `std::deque`, constant $O(1)$. \n5. `front`: Returns a reference to the first element in the container, constant $O(1)$.\n6. `back`: Returns a reference to the last element in the container, constant $O(1)$.\n```cpp\nstd::deque\u003cint\u003e dq;\ndq.push_back(2);\ndq.push_front(3);\ndq.push_front(5);\ndq.pop_front();\nfor(auto \u0026x : dq) cout \u003c\u003c x \u003c\u003c ' ';\n// Output: 3 2\n```\n#### `std::queue`\n`std::queue` is well, a queue. It uses `std::dequeue` as it's underlying container and just acts as a wrapper getting rid of the functions which allow insertion in the beginning and deletion at the end. \n\n### `std::set`\nAlright, we're done with all the important sequence containers now. We'll now be dealing with **associative containers**. \n\n\u003eAssociative containers implement sorted data structures that can be quickly searched (O(log n) complexity).\n\u003e- [Container - cppref](https://en.cppreference.com/w/cpp/container)\n\n`std::set` is an associative container that somewhat models the mathematical definition of a set. Therefore, at all points, a `std::set` will only contain unique elements. `std::set` is usually implemented as some variant of a balanced binary search tree (red-black trees are a popular choice). However, note that all implementations are very heavy since they often involve multiple pointer dereferences and balancing operations in the background. This means that you should avoid this data structure if possible. But regardless of underlying data structure, `std::set` provides you with the following functions and efficiency guarantees:\n#### Useful Functions\n1. `insert`: Allows insertion into the sorted set in $O(log(n))$ time. (Amortized $O(1)$ if you provide the iterator to the element just less than $x$ in the set)\n2. `erase`: Allows deletion from the sorted set in $O(log(n))$ time. ($O(1)$ if you provide the iterator to the element).\n3. `find`: Allows searching the set for a matching element $x$ in $O(log(n))$ time\n4. `count`: Returns the count of the elements in the set. Since a set contains only unique elements, this is $log(n)$ and the result is always 0 or 1. \n5. `begin`: Returns an iterator to the first element of the `std::set`. Constant $O(1)$ time. \n6. `end`: Returns an iterator to the element following the last element of the `set`. Constant $O(1)$ time. \n7. `lower_bound`: Returns an iterator pointing to the first element that is _not less_ than (i.e. greater or equal to) the given element $x$. $O(log(n))$ time.\n8. `upper_bound`: Returns an iterator pointing to the first element that is _greater_ than the given element $x$. $O(log(n))$ time.\n```cpp\nstd::set\u003cint\u003e s = {10, 4, 3, 6};\ns.insert(6);\ns.erase(10);\nfor(auto \u0026x : s) cout \u003c\u003c x \u003c\u003c ' ';\n// Output: 3 4 6\ncout \u003c\u003c bool(s.find(5) != s.end()) \u003c\u003c '\\n';\n// Output: false\ncout \u003c\u003c (*s.begin()) \u003c\u003c '\\n';\n// Output: 3\ncout \u003c\u003c (*prev(s.end())) \u003c\u003c '\\n';\n// Output: 6\ncout \u003c\u003c (*s.lower_bound(2)) \u003c\u003c '\\n';\n// Output: 3\n```\n#### `std::multiset`\nA multi-set is a variant of `std::set` which let's you store multiple copies of equal elements. There is primarily only one important change to note:\n`count:` Returns the count of the elements in the multi-set. The time complexity is now logarithmic in the size of the container **plus linear in the number of elements found**. That is, $O(log(n) + f(x))$ where $f(x)$ is the frequency of element $x$ in the multi-set. \n### `std::map`\nThis is also a sorted associative container similar to `std::set`. However, instead of storing single element keys, it stores *key-value* pairs. The structure stores the keys in sorted order. Hence all the same useful functions of `std::set` are applicable for a `std::map`, just that accessing the element with $key = k_1$, returns the value $v_1$ associated to key $k_1$. \n#### Useful Functions\n1. `insert`: Allows insertion into the sorted map in $O(log(n))$ time. (Amortized $O(1)$ if you provide the iterator to the element just less than $x$ in the map)\n2. `erase`: Allows deletion from the sorted map in $O(log(n))$ time. ($O(1)$ if you provide the iterator to the element).\n3. `find`: Allows searching the map for a matching element $x$ in $O(log(n))$ time\n4. `count`: Returns the count of the elements in the map. Since a map contains only unique keys, this is $log(n)$ and the result is always 0 or 1. \n5. `operator[key]` : Returns a reference to the value that is mapped to the given $key$, performing an insertion if such key does not already exist. $log(n)$ time.\n```cpp\nstd::map\u003cint, string\u003e m;\nm.insert({42, \"Hello\"});\nm[20] = \"World\";\nfor(auto \u0026[k, v] : m) cout \u003c\u003c k \u003c\u003c \" \" \u003c\u003c v \u003c\u003c \"\\n\";\n// Output: \n// 20 World\n// 42 Hello\ncout \u003c\u003c bool(s.find(5) != s.end()) \u003c\u003c '\\n';\n// Output: false\nauto it = m.lower_bound(36);\ncout \u003c\u003c it-\u003efirst \u003c\u003c ' ' \u003c\u003c it-\u003esecond;\n// Output: 42 Hello\n```\n### `std::unordered_map`\nThis is the C++ equivalent of a hash table. The syntax is extremely similar to that of `std::map`, however, it unlike `std::map`, `std::unordered_map` does **not** maintain sorted order. It uses a default hash function to hash objects to buckets like a hash table. Note that this means, it has the exact same syntax (barring declaration and functions like `lower_bound` and `upper_bound`) as `std::map`, but it performs insertion, deletion and search in **average** constant time complexity. Here, the cost of these operations is the cost of hashing the key and inserting, deleting or searching the bucket it was hashed into. Note that for a hash function that is not dependent on runtime randomness, you can always reverse engineer the hash function used to create tests that blow up the runtime time complexity to $O(n)$ per insert / delete / search query. You can read more about how to blow up the standard `std::unordered_map` hash function here in this blog by [neal](https://codeforces.com/profile/neal) on CF: [Blowing up unordered_map, and how to stop getting hacked on it](https://codeforces.com/blog/entry/62393).\n### `std::priority_queue`\nThis is not a new container per-say, but still deserves a mention here. It is a [container adapter](https://en.cppreference.com/w/cpp/container#Container_adaptors) that transforms the underlying container (by default, `std::vector`) to perform heap operations. This means, it uses the underlying `std::vector` container to implement a heap that supports constant $O(1)$ time look up of the largest element and $log(n)$ insertion / deletion (**only from the top**). Compared to `std::set`, it does not support $log(n)$ search or $log(n)$ deletion of a random element in the priority queue. It's functions are strictly a subset of what `std::set` can do. So why bother using it? Because it's much much faster than `std::set` in practice. Since the underlying container is a `std::vector` by default, it presumably uses $2\\cdot i$ \u0026 $2\\cdot i+1$ to access node $i$'s children and adjusts structure by iterating over parents using $p = \\frac{i}{2}$. This makes it's constant factor much faster than that of `std::set`. If you only need the subset of operations provided by `std::priority_queue`, only use `std::priority_queue`.\n```cpp\npriority_queue\u003cint\u003e pq; // Use priority_queue\u003cint, vector\u003cint\u003e, greater\u003c\u003e\u003e for a min-heap\npq.push(2);\npq.push(10);\ncout \u003c\u003c pq.top() \u003c\u003c '\\n'; // Output =\u003e 10\npq.pop();\npq.push(1);\ncout \u003c\u003c pq.top() \u003c\u003c '\\n'; // Output =\u003e 2;\n```\n## Algorithms\nYou can find the comprehensive list here, [Algorithms Library - cppref](https://en.cppreference.com/w/cpp/algorithm). We'll go over only a few useful (in CP) ones. \n### `std::sort`\nEasily one of the most used. `sort(a.begin(), a.end())` sorts the elements in the range $[begin, end)$ in $O(nlog(n))$ comparisons. It uses a combination of quick / heap sort. Implementation can vary between standard libraries, but it's definitely very low constant. Accepts custom comparators as an additional lambda argument.\n```cpp\nvector\u003cT\u003e a; \n// populate a...\nsort(a.begin(), a.end(), [\u0026](T \u0026x, T \u0026y){\n\t// some custom sorting logic. I'll just fill it in with the default comparator for this example.\n\treturn x \u003c y;\n});\n```\n**VERY IMPORTANT:** The comparator **MUST** return `false` for equality test. For a more detailed explanation, read this blog by [ouuan](https://codeforces.com/profile/ouuan) on CF: [One thing you should know about comparators — Strict Weak Ordering](https://codeforces.com/blog/entry/72525). \n### `std::reverse`\n`reverse(a.begin(), a.end())` reverses the elements in the range $[begin, end)$ in $O(n)$ time. \n### `std::unique`\n`unique(a.begin(), a.end())` removes all except the first element from every consecutive group of equivalent elements from the range $[first, last)$ and returns a past-the-end iterator for the new end of the range. Popular use is as an alternative to using `std::set` to counting the unique elements in a vector.\n```cpp\nvector\u003cint\u003e v = {1, 2, 2, 3, 4};\nset\u003cint\u003e s(v.begin(), v.end());\ncout \u003c\u003c s.size() \u003c\u003c '\\n'; // Output =\u003e 4\nsort(v.begin(), v.end());\nv.resize(unique(v.begin(), v.end()) - v.begin()); // Output =\u003e 4\n```\nThey are both equivalent. (**Note: You need to call `std::sort` before using `std::unique` for this use case.**)\n### `std::rotate`\nPerforms a left rotation on a range of elements. Useful trick to avoid wasting time figuring out indices \u0026 code. Linear time. \n# Lambda's\nYou can read more about lambda's [here](https://stackoverflow.com/questions/7627098/what-is-a-lambda-expression-and-when-should-i-use-one) and [here](https://en.cppreference.com/w/cpp/language/lambda) until I get the time to fill this section up sometime in the future. I'll leave below my absolute favorite use of lambdas in competitive programming.\n```cpp\nint main(void){\n\t// Reading input ----------------------------\n\tint n, m; cin \u003e\u003e n \u003e\u003e m;\n\tvector\u003cvector\u003cint\u003e\u003e adj(n); \n\tfor(int i=0; i \u003c m; i++){\n\t\tint u, v; cin \u003e\u003e u \u003e\u003e v; u--, v--;\n\t\tadj[u].push_back(v); adj[v].push_back(u);\n\t}\n\t// A simple dfs -----------------------------\n\tvector\u003cint\u003e vis(n);\n\tfunction\u003cvoid(int)\u003e dfs = [\u0026](int v){\n\t\tif(vis[v]) return;\n\t\tvis[v] = true;\n\t\tfor(auto \u0026to : adj[v]) dfs(to);\n\t};\n\tdfs(0);\n\t// - Very simple and concise. \n\t// - No need to declare global variables (risk of not clearing, ugly, etc.)\n\t// - Only need to pass the changing state to the lambda. All other constant \"metadata\" are captured automatically.\n}\n```", - "lastmodified": "2026-01-11T05:05:40.772904451+05:30", + "lastmodified": "2026-01-11T10:58:59.291540764+05:30", "tags": [ "domain-cs-languages-cpp" ] @@ -16,7 +16,7 @@ "/blog/a-deep-dive-into-the-knapsack-problem": { "title": "A Deep Dive Into the Knapsack Problem", "content": "After discussing [DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn), [Levenshtein Edit Distance](/blog/levenshtein-edit-distance) \u0026 [Chain Matrix Multiplication](/blog/chain-matrix-multiplication) we are finally here.\n# The Knapsack Problem\nThe Knapsack is probably one of the most famous problems used to introduce Dynamic Programming to new learners. It asks the following question, _\"Given a list of $n$ elements, each of which have some value $v_i$ and weight $w_i$ associated with them, what is the maximum value of elements I can fit into my Knapsack given that my Knapsack can only hold at max a weight of $W$ capacity?\"_\n\nThere are two variations of the above problem as well. The simpler one assumes that we have an infinite quantity of each element. That is, we can pick an element as many times as we wish. The harder version does not assume this. Each element can only be picked once.\n\n## A toy example\n\nFor the sake of illustration, we'll assume we are attempting to solve the Knapsack for the given inputs\n\n![knapsack-1](/images/knapsack-1.webp)\n\nWe have 4 items with their respective $v_i$ and $w_i$ values. Our Knapsack has a maximum capacity of $W = 10$.\n\n### With repetition\n\nIf repetition is allowed, we can solve the problem using a very simple approach. All we need to observe is that to compute the maximum value for a bag of capacity $W$, we can simply brute force over all elements with a simple recurrence.\n\nLet $F(W)$ be the maximum value obtainable for a bag of capacity $W$. Then,\n\n$$ \n\\begin{aligned}\nF(W) = max(v_1+F(W-w_1), \\ \\dots \\, v_n+F(W-w_n)) \\\\ \\text{In our example, this corresponds to the following computation } \\implies \\\\ F(10) = max(30+F(10-6), 14+F(10-3), 16+F(10-4), 9+F(10-2)) \\\\ \\implies F(10) = max(30 + F(4), 14+F(7), 16+F(6), 9+F(8)) \n\\end{aligned}\n$$\n\nThe idea behind this recurrence is as follows. At any capacity $W$, we are simply picking every possible element and asking what is the maximum value I can achieve **after** picking each element. It's more or less just a brute force that considers picking every element for each capacity $W$.\n\nIt is easy to see that we are computing the answer for $W$ such sub-problems from $W_i = 1 \\to W$. And at each sub-problem, we are iterating over $n$ elements.\n\nIt is also important to note that we do not consider including the element in our brute force when we reach a state where $W-w_i \\lt 0$. This is an impossible/unreachable state. The base case is when we no longer have any elements which we can fit into the bag.\n\n1. Hence we have $W$ sub-problems.\n \n2. We are doing $O(n)$ computation at every node.\n \n3. The recurrence is as described above.\n \n4. The DAG structure is also easy to reason about. It's simply just a linear chain from state\n \n $W_i = 1 \\to 2 \\to \\dots \\to W$\n \n5. Therefore, our final algorithm will have $O(nW)$ complexity.\n \n\nFurther, since there are only $O(W)$ subproblems, we only need $O(W)$ space to store the DP table.\n\n### Without repetition\n\nNotice that our previous solution will not work here. Because we cannot choose elements multiple times. However, the order of choosing the elements does not matter either. But because of this condition, notice that it is not enough to simply consider subproblems defined by just one characteristic.\n\nThat is, a subproblem in the previous case was simply identified by $W$, the size of the Knapsack. Here, this is no longer the case. A \"state\" or \"subproblem\" has at **least** two changing variables. Both the number of elements we are including into the Knapsack **and** the weight of the Knapsack.\n\n#### The new DP state\n\nThat is, we must change the definition of our DP to a 2-d DP where $DP[i][j]$ represents the state where we are considering the **first** $i$ elements among the list of available elements and our Knapsack is of size $j$.\n\n1. **Number of subproblems**\n \n Since we have $n$ possible prefixes which we will consider and $W$ possible values for the weight, we have of the order $O(nW)$ subproblems to compute\n \n2. **Finding the solution at some node**\n \n Notice that since we changed the definition of our DP to storing the best possible answer to the problem given that our Knapsack has size $W$ and we are only considering the first $i$ elements, when computing $DP[i][j]$, notice that we are only trying to **include** the $i^{th}$ element wherever it maximizes our answer.\n \n This has the important implication that we do not need to brute force over $n$ elements at some state $[i, j]$. We only need to check the states $[i-1, W-w_i]$. This is $O(1)$ computation at every node.\n \n3. **Coming up with the recurrence**\n \n We are essentially trying to answer the question\n \n _\"At some capacity $W$, when considering the $i^{th}$ element, does including it in the Knapsack help increase the previously obtained score at capacity $W$ when considering only $i-1$ elements?\"_\n \n Writing this recurrence formally,\n \n $F(i, W) = max \\{ F(i-1, W), F(i-1, W-w_i) \\}$\n \n The first term in the max represents the previously obtained score at capacity $W$. The second term is the value we would get if we tried including element $i$ when considering a bag of size $W$.\n \n4. The **DAG structure** for this problem is very similar to the structure obtained when solving the Edit distance problem. It is simply a graph where each state $[i, W]$ depends on the state $[i-1, W-w_i]$.\n \n5. We have an algorithm that requires us to perform $O(1)$ computation for each of the $O(nW)$ subproblems. Hence the total running time will be $O(nW)$. However, since there are $nW$ subproblems, we will also require $O(nW)$ space.\n \n\n#### Can we do better?\n\nThis time, we actually can! Notice that just like how we did in the Edit distance problem, the DP state at any $[i, W]$ is **ONLY** dependent on the DP states exactly one level below it. That is, every DP value in row $i$ is only dependent on the DP values in row $i-1$.\n\nThis means that again, we can do the exact same thing and use **Single Row Optimization** to reduce the space complexity of our DP from $O(nW)$ to just $O(W)$. For small values of $W$, we might even consider this linear!\n\n# Pseudo-polynomial-time algorithms\n\nAt first glance, it is very easy to write off the Knapsack problem as belonging to the $P$ complexity class ([Introduction to Complexity Theory](/blog/introduction-to-complexity-theory)). After all, it seems to just be quadratic right?\n\nBut this is not true. We define the complexity of algorithms based on input size $n$.\n\nTo be more precise: _Time complexity measures the time that an algorithm takes as a function of the **length in bits** of its input._\n\nHowever, notice that in this case, the complexity of our algorithm relies on both $n$ and $W$. $W$ is the **value** of an input. If we consider $W$ in binary, we would require $log_2(W)$ bits to represent $W$. If the input is in binary, the algorithm becomes **exponential.**\n\nWhy?\n\nWe will try to explain this by means of a nice example.\n\n1. Let's say we are trying to solve the problem for $n = 3$ and $W = 8$. Keep in mind that $W = 1000$ in binary. That is, $W$ is **4 bits** long.\n \n Hence total complexity = $O(nW) \\implies O(3 \\times 8) = O(24)$\n \n2. Now let's increase $n$ to $n = 6$. We have linearly multiplied it by $2$. Notice that this still gives us\n \n Time complexity: $O(nW) \\implies O(6 \\times 8) = O(48)$. It is the expected increase by 2.\n \n3. Now let us increase $W$ by a factor of 2. **Notice that this means we double the length of W in bits. Not the value of W itself.** This means $W$ will now be represented by $W = 8$ bits. This means $W$ is now equal to $10000000$ in binary.\n \n This gives us a complexity of $O(nW) \\implies O(3 \\times 2^8) = O(768)$. That is, there is an exponential increase in complexity for a linear increase in $W$ .\n \n\n## Knapsack is NP-Complete\n\nThe Knapsack problem is in fact, an **NP-Complete** problem. There exists no known polynomial-time algorithm for this problem. However, it is nice to know that is it often classes as _\"Weakly np-complete.\"_\n\nThat is, for small values of $W$we can indeed solve the optimization problem in polynomial time. If we give input $W$ in the form of smaller integers, it is weakly NP-Complete. But if the value $W$ is given as rational numbers, it is no longer the case.\n\n# Alternate Version of the Knapsack problem\n\nWhile we solved the Knapsack problem in the standard manner by defining $DP[i][j]$ as the maximum value achievable when considering the first $i$ elements and a bag of capacity $j$, what do we do if the value of $W$ is large, but the value of $\\sum_{i}^{n}v_i$ is small?\n\nConsider the following two problems from the [AtCoder Educational DP contest.](https://atcoder.jp/contests/dp/tasks)\n\n## Knapsack - 1\n\nThe [first problem](https://atcoder.jp/contests/dp/tasks/dp_d) is simply the standard Knapsack problem.\n\nThe constraints for it were as follows,\n\n$$\n\\begin{aligned}\n1 \\leq N \\leq 100 \\\\ 1 \\leq W \\leq 10^5 \\\\ 1 \\leq w_i \\leq W \\\\ 1 \\leq v_i \\leq 10^9 \n\\end{aligned}\n$$\n\nA $O(nW)$ solution would take around $1e7$ operations which should pass comfortably.\n\nHere's a link to my submission: [Submission Link](https://atcoder.jp/contests/dp/submissions/19493344)\n\n## Knapsack - 2\n\nThe [second problem](https://atcoder.jp/contests/dp/tasks/dp_e) is a little different. It asks the same question, but for different constraints.\n\n$$ \n\\begin{aligned}\n1 \\leq N \\leq 100 \\\\ 1 \\leq W \\leq 10^9 \\\\ 1 \\leq w_i \\leq W \\\\ 1 \\leq v_i \\leq 10^3 \n\\end{aligned}\n$$\n\nNotice that $W$ is now $10^9$. $O(nW)$ would now take 1e11 operations. This would practically have a very slow running time in comparison to our previous ~1e7 operation solution.\n\nWe will have to think of something different.\n\nNotice that for this problem, the values $v_i$ are much smaller. In fact, considering $n=100$ elements, the maximum value obtainable is just $max(v_i)\\times n = 10^5$.\n\nNow, we can exploit this by doing the same Knapsack DP, but this time, instead of storing the maximum value achievable in max capacity $j$ when considering the first $i$ elements, we redefine the dp as follows.\n\n$DP[i][j]$ will now store the minimum weight required to achieve value $j$ when considering just the first $i$ elements. We can now simply pick the maximum $j$ in row $i=n$ which satisfies the condition $DP[i][j] \\leq W$.\n\nThis solution runs in $O(n \\times \\sum_{i}^{n}v_i)$ which gives us $\\approx1e5$ operations. This is much faster than the standard approach.\n\nHere's a link to my submission: [Submission Link](https://atcoder.jp/contests/dp/submissions/19494460)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:05:40.791503111+05:30", + "lastmodified": "2026-01-11T10:59:55.63031142+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-cs-algorithms-complexity" @@ -24,8 +24,8 @@ }, "/blog/activity-selection-huffman-encoding": { "title": "Activity Selection \u0026 Huffman Encoding", - "content": "# Greedy Algorithms\n\nAs discussed previously, greedy algorithms are an amazing choice when we can prove that they do indeed give the optimal solution to some given problem. This signifies the importance of being able to prove the optimality of greedy solutions. In general, if the following two conditions hold, we can certainly say that a greedy strategy will give us the globally optimal answer.\n\n## Locally optimum choice\n\nGiven some problems, can we focus on its local state and solve for a solution that produces the most locally optimal solution at that state? In other words, we should be able to take _one step_ towards the optimum solution.\n\n## Optimum substructure property\n\nOnce this step is taken, even after the change in state after taking that step, are we able to restate the problem such that the new problem is the same as the original problem, albeit for a smaller input?\n\nNotice that if the answer to the above two questions is **yes**, then it is possible to prove that repeatedly taking the locally optimal choice will indeed give us the optimal solution. This is easily proven via induction.\n\nTake the optimal step at any given step $i$, now restate the problem as a smaller version of the original problem, and again take the locally optimal step at $i+1$. We can inductively repeat till this is the final state where we can again take the optimal choice. Since we can solve each subproblem independently simply by taking the best choice at each step, the solution **must** be optimal.\n\n# Activity Selection\n\nConsider the famous activity selection problem. The problem is as follows,\n\n_Given a set of activities $S=\\{a_1, a_2,\\dots,a_n\\}$, where activity $a_i$ takes time $[s_i, f_i)$ to finish. Find the maximum number of activities that can be picked such that there are zero overlaps, i.e., pick the subset of the maximum size where all activities are disjoint._\n\nThe naïve solution would be to brute force over all $n!$ different permutations in linear time to find the optimal answer. This is obviously far too slow to be of much use to us. So, how can we do better? Would a **greedy** solution work?\n\n## Greedy #1\n\n**Sort the intervals by duration $|f_i-s_i|$ and greedily pick the shortest ones**\n\nDoes this satisfy our two properties? The answer is... no. Notice that by picking the shortest interval activity, we cannot restate the problem for a smaller input the same way. We do not have optimum substructure. Consider the below case.\n\n![greedy-1](/images/greedy-1.webp)\n\nGreedily we would pick the middle activity, but this removes two activities for the next step. This problem has no optimum substructure. The optimal solution would be to pick both the large intervals.\n\n## Greedy #2\n\n**Greedily pick the activities that start the earliest**\n\nThat approach follows neither property. Consider this case,\n\n![greedy-2](/images/greedy-2.webp)\n\nWe are neither picking a locally optimum choice nor maintaining an optimum substructure. The greedy solution gives 1 whereas the answer is clearly, 3.\n\n## Greedy #3\n\n**Greedily pick the activities that end the earliest**\n\nDoes this approach satisfy both criteria? The answer is... yes.\n\nLet us pick the activity that ends the earliest. If this is not part of the optimal solution and the activity it overlaps with is part of the optimal solution, notice that because the activity we picked ends earlier, our activity cannot have any other overlap. Both contribute +1 to the solution and hence our activity is locally optimal. Further, since we have picked the earliest ending activity (which is optimal) we can cross off overlaps and restate the problem for smaller input. This approach maintains both properties! It **must** be right.\n\n### A more formal proof\n\nLet us suppose that we know the answer. Let the answer be $A$. Let us sort $A$ by finish time such that $\\forall a_{i\u003cn}\\in A$, $f_i \\lt f_{i+1}$\n\nNow, let our optimal choice activity be $x$. By virtue of our greedy choice, we know that\n\n$f_{x} \\leq f_{a_i} \\forall a_i \\in A$\n\nConsider $f_{a_0}$. If $x = a_0$, we are done. But if $x \\neq a_0$, notice that $f_x \\leq f_{a_0}$. This means that $x$ cannot overlap with any more activities in the set $A$ than $a_0$. And the set $A$ is disjoint by definition. Our solution can be written as\n\n$$ B = A-\\{x\\}\\cup \\{a_0\\} $$\n\nNotice that $x$ cannot overlap with any element in $A$. This is because they're the first choice to be picked, there is no overlap on the left. And $f_x \\leq f_{a_0}$ implies there is no overlap on the right and both provide a $+1$ to the final answer. Hence $x$ **must** be an optimal choice.\n\nThis solution is **much better** than our $O(n!)$ solution and can find the optimal answer in just $O(nlogn)$. The $nlogn$ comes from the sorting requirement.\n\n# Huffman Encoding\n\n## The compression problem\n\nLet's think about how computers store text. A lot of the text on machines is stored in ASCII. ASCII is a character encoding used by our computers to represent the alphabet, punctuations, numbers, escape sequence characters, etc. Each and every ASCII character takes up _exactly_ one byte or 8 bits. The encoding chart can be found [here](https://asciichart.com/)\n\nOftentimes, computers need to communicate with one another, and sending large volumes of text is not an uncommon occurrence. Communication over networks, however, have their own cost and speed disadvantages that make sending smaller chunks of data a _very_ favorable option. This is one of the times when ranking an algorithm by **space** is preferred over ranking algorithms by **time**. As our metric for comparison between algorithms changes, so does our problem statement.\n\n_\"What is the most optimal way to losslessly compress data such that it takes up minimum space?\"_\n\nNotice that unlike video or audio compression, ASCII text compression must be **lossless**. If we lose _any_ data, we have also lost the character. This means we can no longer figure out what the original ASCII character was. These requirements give us a few basic requirements that our algorithm **must** meet.\n\n### Prefix-free property\n\nThe idea of compression is to reduce the size of the data being compressed. But ASCII requires 8 bytes. This means that we must try to encode data in fewer than 8 bytes based on the frequency of occurrence. This will allow us to dedicate fewer bits for more commonly occurring characters and more bits for characters that occur almost never, thus helping us compress our data. However, this implies that we need some form of **variable-length** encoding for our characters. One variable-length encoding that might work is the binary system.\n\nHowever, notice that the following assignment will fail.\n\n$$ Space \\to 0 \\\\ e \\to 1 \\\\ t \\to 00 \\\\ \\dots $$\n\nWhen we encounter the encoding $00$ in the compressed data, we no longer know whether it is \"two spaces\" or one \"t\" character. We have lost information in our attempt to compress data. This implies that our algorithm **must** fulfill the prefix-free property. That is while reading the compressed data, based on the prefix, we must be able to **uniquely** identify the character that it is representing. If this is not possible then we will not have an injective mapping and data will be lost.\n\n## A little detour to information theory\n\nBack in the world of information theory, Shannon laid out the 4 key axioms regarding information.\n\n1. **Information $I(x)$ and probability $P(x)$ are inversely related to each other**\n \n Consider the following thought experiment.\n \n 1. The kings of the Middle East are rich\n 2. The man washing cars here is a rich man\n \n The second sentence conveys a lot more information than the first. The first statement is highly probable and hence does not convey as much information as the second.\n \n2. **$I(x) \\geq 0$**\n \n Observing an event never causes a loss in information\n \n3. $P(x)=1\\implies I(x) = 0$\n \n If an event is 100% certain to occur then there is no information to be gained from it\n \n4. $P(x\\cap y)=P(x).P(y) \\implies I(x\\cap y)=I(x)+I(y)$\n \n Two independent events if observed separately, give information equal to the sum of observing each one individually\n \n\nIt can be proven that the only set of functions that satisfy the above criteria are\n\n$$ I(x) = log_b(\\frac{1}{P(x)})=-log_bP(x) $$\n\nHe then went on to define a term called Information Entropy. It is a quantity that aims to model how \"unpredictable\" a distribution is. It is defined as the weighted average of the self-information of each event.\n\n$$ H(x) = \\sum_{i=1}^{n}P(x_i).I(x_i) = \\sum_{i=1}^{n}-P(x_i).log_2P(x_i) $$\n\nAn intuitive way to think of it is as follows. If an event that has a high self-information value has a high frequency, then this will increase the entropy. This makes sense as we are essentially saying that there is some event that is hard to predict which occurs frequently. Vice versa, if low self-information (something predictable) has a high frequency then the entropy of the distribution is lesser.\n\n\u003e An interesting fact to note behind the coining of the term \"Entropy\" in information theory. Shannon initially planned on calling it \"uncertainty.\" But after an encounter with John von Neumann who told him \"No one knows what entropy really is, so in a debate, you'll always have the advantage.\" he changed the term to \"Entropy\"\n## Back to algorithms!\n\nLet's say we have some encoding $E$ for our data $D$. We can measure the compression of our data by the \"Average expected length per symbol.\" This quantity is essentially just the weighted average of the lengths of each symbol in $D$ in our encoding $E$. Let's call the average length per symbol $L$.\n\nShannon discovered that the fundamental lower bound on $L$ is given as $L \\geq H(x)$. No matter what we do, we cannot compress the data to an average length lower than the information entropy of each data point occurrence.\n\nConsider the case where the letters `A`, `B`, `C`, `D` occur in our data with a frequency of $0.25$ each. We can divide the decoding process into a simple decision tree as follows,\n\n![huffman-1](/images/huffman-1.webp)\n\n### Representing the encoding as binary trees\n\nIn the above image, if we replace every **left** branch with 1 and every **right** branch with 0, we get a very interesting encoding. We get a **prefix-free** encoding that maps every character to a unique encoding. Given some bit string, all we have to do is start at the node and follow the bit string along the tree till we reach a leaf node. Every path to a leaf node in a tree is unique and hence our encoding is unique. Further, since it is on a tree and we stop only after reaching the leaf node, there can be **no ambiguity**. This means that the encoding is prefix-free!\n\nIn fact, for the above data, we can do no better than the encoding above. However, when we get to work with varying probabilities, things change. Shannon and Fano came up with an encoding that used the same concept of representing the encoding on binary trees to ensure they maintain the uniqueness and prefix-free requirements.\n\nTheir algorithm began by sorting the frequency of every event and then splitting the tree into two halves such that the prefix and suffix sum on either side of our division was as close to each other as possible. This had the intended effect of relegating lesser used symbols to the bottom (greater depth and hence longer encoding) and more frequently used symbols to shorter encodings. This was a big achievement and was known as the Shannon-Fano encoding for a long period of time. It was a good heuristic and performed well but it was **not** optimal.\n\nNotice that with this greedy strategy, we **cannot** prove that it is taking the most optimal choice at the local level. This algorithm is **not** optimal.\n\nAt the same time, the Shannon-Fano encoding achieved both a unique representation of our data and more importantly, a prefix-free encoding that performed really well. Perhaps we can build upon their idea to obtain a prefix-free encoding with optimal compression.\n\n### Enter Huffman\n\nContrasting the top-down approach used by Shannon and Fano, Huffman viewed the problem with a _slight_ change in perspective. Instead of trying to start at the root, he claimed that if we picked the least two probable events, then they **must** be at the bottom of the tree.\n\n### Locally optimal choice\n\nWe want lesser used symbols to have longer encodings. If the above was not true, then that would imply that there is a symbol with a higher frequency of occurrence that is now given a longer encoding. This increases the size of the compression and is hence not an optimal choice. We now know for a fact that the least used symbols must belong to the bottom of the tree.\n\n### Optimum Substructure\n\nWe can write $L = \\sum_{i=1}^{n} p_i.d_i$ where $d_i$ is the depth of the $ith$ node in the tree. Note that this quantity $L$ is actually the same as the sum of the probabilities of every node except the root node in our tree. Consider the following example, notice that in the expanded view, the probability of each symbol gets included as many times as its depth in the tree.\n\n![huffman-2](/images/huffman-2.webp)\n\nRemember that our goal is to minimize L. Let our symbols have probabilities/frequency $p_1, p_2, \\dots, p_k$ each and let us assume $p_1\\leq p_2\\leq\\dots \\leq p_k$. Using our optimal greedy choice, we can choose the bottommost nodes as $p_1+p_2$ and then restate the equation as follows.\n\n$$ L(p_1,\\dots,p_k) = p_1+p_2+L((p_1+p_2), p_3,\\dots, p_k) $$\n\nThat is, we have managed to express the next problem as a smaller version of the original problem for which we realize that again, the greedy choice holds. We have managed to obtain the optimum substructure in our problem.\n\nThis implies that therefore, our greedy algorithm is indeed correct. **This** is the Huffman encoding.\n\nGiven some text data in the form of $(data, frequency/probability)$ tuples we can build the Huffman tree by using the greedy logic described above. Always greedily pick the smallest two probabilities to form the leaf node, then repeat. This is guaranteed to give us the optimal solution.\n\nIt is interesting to note its similarity to Shannon-Fano encoding, sometimes, all you need is the slightest shift in perspective to solve some of the world's unsolved problems :)\n\nHuffman was able to **prove** that his encoding gives us the most optimal solution for encoding any set of $(data, probability)$ pairs as given. But... _can we do even better?_ Theoretically no, but there are algorithms that can reduce the size of our data even more. The primary idea used by these algorithms is to chunk data into multiple byte chunks and then applying Huffman encoding. Note that while we mostly referred to ASCII text, Huffman encoding can be used to losslessly compress any form of binary data.\n\nThe following video was referenced while making this diary and is the source of some of the illustrations above, highly recommend watching [this video](https://www.youtube.com/watch?v=B3y0RsVCyrw).\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr)\n\n---\n\n## Related Reading\n\n- [More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union)\n- [Set Cover \u0026 Approximation Algorithms](/blog/set-cover-approximation-algorithms)\n- [DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)\n", - "lastmodified": "2026-01-11T05:05:41.427889446+05:30", + "content": "# Greedy Algorithms\n\nAs discussed previously, greedy algorithms are an amazing choice when we can prove that they do indeed give the optimal solution to some given problem. This signifies the importance of being able to prove the optimality of greedy solutions. In general, if the following two conditions hold, we can certainly say that a greedy strategy will give us the globally optimal answer.\n\n## Locally optimum choice\n\nGiven some problems, can we focus on its local state and solve for a solution that produces the most locally optimal solution at that state? In other words, we should be able to take _one step_ towards the optimum solution.\n\n## Optimum substructure property\n\nOnce this step is taken, even after the change in state after taking that step, are we able to restate the problem such that the new problem is the same as the original problem, albeit for a smaller input?\n\nNotice that if the answer to the above two questions is **yes**, then it is possible to prove that repeatedly taking the locally optimal choice will indeed give us the optimal solution. This is easily proven via induction.\n\nTake the optimal step at any given step $i$, now restate the problem as a smaller version of the original problem, and again take the locally optimal step at $i+1$. We can inductively repeat till this is the final state where we can again take the optimal choice. Since we can solve each subproblem independently simply by taking the best choice at each step, the solution **must** be optimal.\n\n# Activity Selection\n\nConsider the famous activity selection problem. The problem is as follows,\n\n_Given a set of activities $S=\\{a_1, a_2,\\dots,a_n\\}$, where activity $a_i$ takes time $[s_i, f_i)$ to finish. Find the maximum number of activities that can be picked such that there are zero overlaps, i.e., pick the subset of the maximum size where all activities are disjoint._\n\nThe naïve solution would be to brute force over all $n!$ different permutations in linear time to find the optimal answer. This is obviously far too slow to be of much use to us. So, how can we do better? Would a **greedy** solution work?\n\n## Greedy #1\n\n**Sort the intervals by duration $|f_i-s_i|$ and greedily pick the shortest ones**\n\nDoes this satisfy our two properties? The answer is... no. Notice that by picking the shortest interval activity, we cannot restate the problem for a smaller input the same way. We do not have optimum substructure. Consider the below case.\n\n![greedy-1](/images/greedy-1.webp)\n\nGreedily we would pick the middle activity, but this removes two activities for the next step. This problem has no optimum substructure. The optimal solution would be to pick both the large intervals.\n\n## Greedy #2\n\n**Greedily pick the activities that start the earliest**\n\nThat approach follows neither property. Consider this case,\n\n![greedy-2](/images/greedy-2.webp)\n\nWe are neither picking a locally optimum choice nor maintaining an optimum substructure. The greedy solution gives 1 whereas the answer is clearly, 3.\n\n## Greedy #3\n\n**Greedily pick the activities that end the earliest**\n\nDoes this approach satisfy both criteria? The answer is... yes.\n\nLet us pick the activity that ends the earliest. If this is not part of the optimal solution and the activity it overlaps with is part of the optimal solution, notice that because the activity we picked ends earlier, our activity cannot have any other overlap. Both contribute +1 to the solution and hence our activity is locally optimal. Further, since we have picked the earliest ending activity (which is optimal) we can cross off overlaps and restate the problem for smaller input. This approach maintains both properties! It **must** be right.\n\n### A more formal proof\n\nLet us suppose that we know the answer. Let the answer be $A$. Let us sort $A$ by finish time such that $\\forall a_{i\u003cn}\\in A$, $f_i \\lt f_{i+1}$\n\nNow, let our optimal choice activity be $x$. By virtue of our greedy choice, we know that\n\n$f_{x} \\leq f_{a_i} \\forall a_i \\in A$\n\nConsider $f_{a_0}$. If $x = a_0$, we are done. But if $x \\neq a_0$, notice that $f_x \\leq f_{a_0}$. This means that $x$ cannot overlap with any more activities in the set $A$ than $a_0$. And the set $A$ is disjoint by definition. Our solution can be written as\n\n$$ B = A-\\{x\\}\\cup \\{a_0\\} $$\n\nNotice that $x$ cannot overlap with any element in $A$. This is because they're the first choice to be picked, there is no overlap on the left. And $f_x \\leq f_{a_0}$ implies there is no overlap on the right and both provide a $+1$ to the final answer. Hence $x$ **must** be an optimal choice.\n\nThis solution is **much better** than our $O(n!)$ solution and can find the optimal answer in just $O(nlogn)$. The $nlogn$ comes from the sorting requirement.\n\n# Huffman Encoding\n\n## The compression problem\n\nLet's think about how computers store text. A lot of the text on machines is stored in ASCII. ASCII is a character encoding used by our computers to represent the alphabet, punctuations, numbers, escape sequence characters, etc. Each and every ASCII character takes up _exactly_ one byte or 8 bits. The encoding chart can be found [here](https://asciichart.com/)\n\nOftentimes, computers need to communicate with one another, and sending large volumes of text is not an uncommon occurrence. Communication over networks, however, have their own cost and speed disadvantages that make sending smaller chunks of data a _very_ favorable option. This is one of the times when ranking an algorithm by **space** is preferred over ranking algorithms by **time**. As our metric for comparison between algorithms changes, so does our problem statement.\n\n_\"What is the most optimal way to losslessly compress data such that it takes up minimum space?\"_\n\nNotice that unlike video or audio compression, ASCII text compression must be **lossless**. If we lose _any_ data, we have also lost the character. This means we can no longer figure out what the original ASCII character was. These requirements give us a few basic requirements that our algorithm **must** meet.\n\n### Prefix-free property\n\nThe idea of compression is to reduce the size of the data being compressed. But ASCII requires 8 bytes. This means that we must try to encode data in fewer than 8 bytes based on the frequency of occurrence. This will allow us to dedicate fewer bits for more commonly occurring characters and more bits for characters that occur almost never, thus helping us compress our data. However, this implies that we need some form of **variable-length** encoding for our characters. One variable-length encoding that might work is the binary system.\n\nHowever, notice that the following assignment will fail.\n\n$$ Space \\to 0 \\\\ e \\to 1 \\\\ t \\to 00 \\\\ \\dots $$\n\nWhen we encounter the encoding $00$ in the compressed data, we no longer know whether it is \"two spaces\" or one \"t\" character. We have lost information in our attempt to compress data. This implies that our algorithm **must** fulfill the prefix-free property. That is while reading the compressed data, based on the prefix, we must be able to **uniquely** identify the character that it is representing. If this is not possible then we will not have an injective mapping and data will be lost.\n\n## A little detour to information theory\n\nBack in the world of information theory, Shannon laid out the 4 key axioms regarding information.\n\n1. **Information $I(x)$ and probability $P(x)$ are inversely related to each other**\n \n Consider the following thought experiment.\n \n 1. The kings of the Middle East are rich\n 2. The man washing cars here is a rich man\n \n The second sentence conveys a lot more information than the first. The first statement is highly probable and hence does not convey as much information as the second.\n \n2. **$I(x) \\geq 0$**\n \n Observing an event never causes a loss in information\n \n3. $P(x)=1\\implies I(x) = 0$\n \n If an event is 100% certain to occur then there is no information to be gained from it\n \n4. $P(x\\cap y)=P(x).P(y) \\implies I(x\\cap y)=I(x)+I(y)$\n \n Two independent events if observed separately, give information equal to the sum of observing each one individually\n \n\nIt can be proven that the only set of functions that satisfy the above criteria are\n\n$$ I(x) = log_b(\\frac{1}{P(x)})=-log_bP(x) $$\n\nHe then went on to define a term called Information Entropy. It is a quantity that aims to model how \"unpredictable\" a distribution is. It is defined as the weighted average of the self-information of each event.\n\n$$ H(x) = \\sum_{i=1}^{n}P(x_i).I(x_i) = \\sum_{i=1}^{n}-P(x_i).log_2P(x_i) $$\n\nAn intuitive way to think of it is as follows. If an event that has a high self-information value has a high frequency, then this will increase the entropy. This makes sense as we are essentially saying that there is some event that is hard to predict which occurs frequently. Vice versa, if low self-information (something predictable) has a high frequency then the entropy of the distribution is lesser.\n\n\u003e An interesting fact to note behind the coining of the term \"Entropy\" in information theory. Shannon initially planned on calling it \"uncertainty.\" But after an encounter with John von Neumann who told him \"No one knows what entropy really is, so in a debate, you'll always have the advantage.\" he changed the term to \"Entropy\"\n## Back to algorithms!\n\nLet's say we have some encoding $E$ for our data $D$. We can measure the compression of our data by the \"Average expected length per symbol.\" This quantity is essentially just the weighted average of the lengths of each symbol in $D$ in our encoding $E$. Let's call the average length per symbol $L$.\n\nShannon discovered that the fundamental lower bound on $L$ is given as $L \\geq H(x)$. No matter what we do, we cannot compress the data to an average length lower than the information entropy of each data point occurrence.\n\nConsider the case where the letters `A`, `B`, `C`, `D` occur in our data with a frequency of $0.25$ each. We can divide the decoding process into a simple decision tree as follows,\n\n![huffman-1](/images/huffman-1.webp)\n\n### Representing the encoding as binary trees\n\nIn the above image, if we replace every **left** branch with 1 and every **right** branch with 0, we get a very interesting encoding. We get a **prefix-free** encoding that maps every character to a unique encoding. Given some bit string, all we have to do is start at the node and follow the bit string along the tree till we reach a leaf node. Every path to a leaf node in a tree is unique and hence our encoding is unique. Further, since it is on a tree and we stop only after reaching the leaf node, there can be **no ambiguity**. This means that the encoding is prefix-free!\n\nIn fact, for the above data, we can do no better than the encoding above. However, when we get to work with varying probabilities, things change. Shannon and Fano came up with an encoding that used the same concept of representing the encoding on binary trees to ensure they maintain the uniqueness and prefix-free requirements.\n\nTheir algorithm began by sorting the frequency of every event and then splitting the tree into two halves such that the prefix and suffix sum on either side of our division was as close to each other as possible. This had the intended effect of relegating lesser used symbols to the bottom (greater depth and hence longer encoding) and more frequently used symbols to shorter encodings. This was a big achievement and was known as the Shannon-Fano encoding for a long period of time. It was a good heuristic and performed well but it was **not** optimal.\n\nNotice that with this greedy strategy, we **cannot** prove that it is taking the most optimal choice at the local level. This algorithm is **not** optimal.\n\nAt the same time, the Shannon-Fano encoding achieved both a unique representation of our data and more importantly, a prefix-free encoding that performed really well. Perhaps we can build upon their idea to obtain a prefix-free encoding with optimal compression.\n\n### Enter Huffman\n\nContrasting the top-down approach used by Shannon and Fano, Huffman viewed the problem with a _slight_ change in perspective. Instead of trying to start at the root, he claimed that if we picked the least two probable events, then they **must** be at the bottom of the tree.\n\n### Locally optimal choice\n\nWe want lesser used symbols to have longer encodings. If the above was not true, then that would imply that there is a symbol with a higher frequency of occurrence that is now given a longer encoding. This increases the size of the compression and is hence not an optimal choice. We now know for a fact that the least used symbols must belong to the bottom of the tree.\n\n### Optimum Substructure\n\nWe can write $L = \\sum_{i=1}^{n} p_i.d_i$ where $d_i$ is the depth of the $ith$ node in the tree. Note that this quantity $L$ is actually the same as the sum of the probabilities of every node except the root node in our tree. Consider the following example, notice that in the expanded view, the probability of each symbol gets included as many times as its depth in the tree.\n\n![huffman-2](/images/huffman-2.webp)\n\nRemember that our goal is to minimize L. Let our symbols have probabilities/frequency $p_1, p_2, \\dots, p_k$ each and let us assume $p_1\\leq p_2\\leq\\dots \\leq p_k$. Using our optimal greedy choice, we can choose the bottommost nodes as $p_1+p_2$ and then restate the equation as follows.\n\n$$ L(p_1,\\dots,p_k) = p_1+p_2+L((p_1+p_2), p_3,\\dots, p_k) $$\n\nThat is, we have managed to express the next problem as a smaller version of the original problem for which we realize that again, the greedy choice holds. We have managed to obtain the optimum substructure in our problem.\n\nThis implies that therefore, our greedy algorithm is indeed correct. **This** is the Huffman encoding.\n\nGiven some text data in the form of $(data, frequency/probability)$ tuples we can build the Huffman tree by using the greedy logic described above. Always greedily pick the smallest two probabilities to form the leaf node, then repeat. This is guaranteed to give us the optimal solution.\n\nIt is interesting to note its similarity to Shannon-Fano encoding, sometimes, all you need is the slightest shift in perspective to solve some of the world's unsolved problems :)\n\nHuffman was able to **prove** that his encoding gives us the most optimal solution for encoding any set of $(data, probability)$ pairs as given. But... _can we do even better?_ Theoretically no, but there are algorithms that can reduce the size of our data even more. The primary idea used by these algorithms is to chunk data into multiple byte chunks and then applying Huffman encoding. Note that while we mostly referred to ASCII text, Huffman encoding can be used to losslessly compress any form of binary data.\n\nThe following video was referenced while making this diary and is the source of some of the illustrations above, highly recommend watching [this video](https://www.youtube.com/watch?v=B3y0RsVCyrw).\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr)", + "lastmodified": "2026-01-11T10:59:55.610389109+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -33,7 +33,7 @@ "/blog/algorithms-for-approximate-string-matching-alignment": { "title": "Algorithms for Approximate String Matching - Alignment", "content": "# Preface \u0026 References\nI document topics I've discovered and my exploration of these topics while following the course, [Algorithms for DNA Sequencing, by John Hopkins University](https://www.coursera.org/learn/dna-sequencing) on [Coursera](https://www.coursera.org/). The course is taken by two instructors [Ben Langmead](https://scholar.google.com/citations?user=2JMaTKsAAAAJ\u0026hl=en) and [Jacob Pritt](https://www.coursera.org/instructor/jacobpritt).\n\nWe will study the fundamental ideas, techniques, and data structures needed to analyze DNA sequencing data. In order to put these and related concepts into practice, we will combine what we learn with our programming expertise. Real genome sequences and real sequencing data will be used in our study. We will use Boyer-Moore to enhance naïve precise matching. We then learn indexing, preprocessing, grouping and ordering in indexing, K-mers, k-mer indices and to solve the approximate matching problem. Finally, we will discuss solving the alignment problem and explore interesting topics such as De Brujin Graphs, Eulerian walks and the Shortest common super-string problem. \n\nAlong the way, I document content I've read about while exploring related topics such as suffix string structures and relations to my research work on the STAR aligner.\n# Algorithms for Approximate Matching\n\nAs we saw previously, due to sequencing errors and the fact that while another genome of the same species might have a $99\\%+$ but not perfect match with the genome we're reconstructing, the reads we are trying to sequence together might suffer severely if we simply attempt exact matching. Here we rely on techniques of approximate matching to tell us where these short reads might fit together in the final puzzle. \n\n[Levenshtein Edit Distance](/blog/levenshtein-edit-distance) is a string metric which is used to quantify how different two strings (such as words) are from one another. It is calculated by calculating the smallest number of operations needed to change one string into the other. These operations are very similar to operations which might happen in real DNA which causes these changes. Substitution could be the errors in sequencing, insertions and deletions along with substitutions could model gene-splicing and related operations.\n\n## Global Alignment\n\nThe [Levenshtein Edit Distance](/blog/levenshtein-edit-distance) is what we use to solve the **global alignment** problem in DNA sequencing. Global alignment is pretty much equivalent to the edit distance problem, except for a minor change in the scoring system which we'll discuss at the end of this section. If we define a function $F$ to be the edit distance between two strings.\n\n## Local Alignment\n\nLocal alignment is similar, but instead of searching for the match score between two sub-sequences, it is more suited to working with short reads in a bigger sequence. That is, it is good at finding positions in a bigger text where a smaller pattern could've occurred using approximate matching. This is pretty similar to our exact pattern finding algorithms except that it is more versatile in how it detects its matches and assigns them scores instead of binary exact matching. The recurrence here is pretty simple, we use the same global alignment recurrence except we change one of the base cases to:\n\n$$F(0, j) = 0$$\n\n![local-alignment](/images/local-alignment.webp)\n\nThis lets us solve the local alignment problem in the same time complexity as global alignment.\n\n## The Scoring Matrix\n\nFor edit distance, the scoring is pretty much just $\\pm1$ for all operations. For DNA sequences however, take the example of the human genome:\n\n![human-genome-scoring](/images/human-genome-scoring.webp)\n\nSimply listing the possibilities reveals that there are twice as many different types of transversion as there are different types of transitions. We may thus assume that transversions will occur twice as frequently as transitions. However, it turns out that transversions are only slightly more common than transitions when we look at the replacements that separate the genomes of two unrelated individuals. So, contrary to what we may think, it is the opposite way around. Therefore, we should penalize transgressions more severely than transitions in our penalty system. Further, indels are less frequent than substitutions. So we might want to penalize indels more than substitutions. So we modify our scoring matrix to reflect these real world statistics in practice.\n\n![penalty-matrix](/images/penalty-matrix.webp)\n\n# Combining Both Approximate and Exact Matching\n\nIt seems like approximate matching is the solution we've been waiting for and a useful tool that will give us good matches for placing our short reads and thus help us reconstruct the sequence. This is true, but the problem herein lies in the fact that the approximate matching algorithms, while versatile, are much slower than their exact matching counterparts. While most of the exact matching algorithms run in linear or with an extra logarithmic factor, the approximate matching algorithms run in quadratic time and are usually also hard to vectorize or speed-up due to the dependency between their states. \n\nIf we simply ran local alignment between each of the short reads (which we usually have a billion or so off) and the human genome (which is also a billion characters in length), the computational task is infeasible for even today's most powerful compute nodes to solve quickly. Therefore we have to come up with a match of both approximate and exact matching algorithms to solve the overall problem quicker. Exact matching ([Booyer-Moore \u0026 Knuth-Morris-Pratt for Exact Matching](/blog/booyer-moore-knuth-morris-pratt-for-exact-matching)) is useful in pinpointing a few specific locations where we can then go and run approximate matching algorithms on. Consider the following figure:\n\n![exact_approximate_matching](/images/exact_approximate_matching.webp)\n\nWe begin by querying the k-mer index table for a query which allows us to **rapidly** home in on small set of candidate needles which are the only places in the entire sequence we really need to run our powerful but slower approximate matching algorithms on. \n\nThus, both concepts work well **together**, kind of making up for each other's shortcomings while still accomplishing their goals. On the one side, the index is highly quick and effective at reducing the number of locations to check, but it completely lacks a natural way to manage mismatches and gaps. However, dynamic programming does handle inconsistencies and gaps rather nicely. But it would be incredibly slow if we simply used dynamic programming.\n\n", - "lastmodified": "2026-01-11T05:05:42.328518357+05:30", + "lastmodified": "2026-01-11T10:59:00.192553156+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-science-bioinformatics" @@ -42,7 +42,7 @@ "/blog/amdahl-s-law-gustafson-s-law": { "title": "Amdahl's Law \u0026 Gustafson's Law", "content": "# Amdahl's law\n\nBefore attempting to parallelize a program it’s a good idea to first measure the theoretical max speedup we can achieve by parallelizing our program. Further, note that the maximum speedup we can achieve depends on the amount of computing hardware available to run the parallel code. If I have 4 cores available I can only speed up the parallel code by 4 times. Amdahl’s law provides us with a function $S(n)$ which returns the theoretical speedup expected given expected $n$ speedup from computing resources.\n\nLet our program consist of some code that executes serially and some code that executes in parallel. If we denote the serial part by $s$ and the parallel part by $p$, note that $s+p = 1 \\implies s = 1 - p$.\n\nNow, speedup $(S(n))$ is basically how much faster the program becomes, so if we consider the original execution time $T$ as $1$ unit of time then we can write the execution time on a parallel machine with $n$ cores as $T' = s + \\frac{p}{n}$. Then the speedup,\n\n$$ S(n) = \\frac{T}{T'} = \\frac{s+p}{s + \\frac{p}{n}} = \\frac{1}{s + \\frac{p}{n}} \\\\ \\implies S(n) = \\frac{1}{(1-p) + \\frac{p}{n}} $$\n\nNote that the speedup $S(n)$ is bounded by $S(n) \\leq \\frac{1}{1-p}$.\n\nPlotting Amdahl’s law for different values of $p$ gives us a graph that looks like this. Even with infinite computing power to instantly run all parallel code, our speedup will be bottlenecked by the serial portion of our program.\n\nHowever, this fails to capture the general tendency most programmers have to increase problem size when given access to more computing power. This shortcoming was addressed by Gustafson’s law.\n\n![amdeded-1](/images/amdeded-1.webp)\n\n# Gustafson’s Law\n\nGustafson's law instead proposes that programmers tend to increase the size of problems to fully exploit the computing power that becomes available as the resources improve. Hence the speedup doesn’t necessarily “cap out” like predicted by Amdahl’s law. Programmers increase the problem size to benefit more from the increased parallel compute power.\n\nIf we increase the problem size, the portion of our program executing parallel code generally increases and hence benefits more. The speedup does not just cap out at some maximum if we do **not** assume a fixed problem size.\n\nHe proposed that $s+p=1$ be the fraction of time the program spends executing serial and parallel code respectively on a **parallel** machine. Then we have $T = s+np$. This gives us a speedup\n\n$$ S(n) = \\frac{T}{T'} = \\frac{s+np}{s+p} = \\frac{s+np}{1} \\\\ \\implies S(n) = 1 + (n-1)p $$\n\nWhen the problem and system scale, the serial part (statistically) does not scale with them. Hence we get a linear relation between processor count and speedup. Quoting the Wiki,\n\n\u003e The impact of Gustafson's law was to shift research goals to select or reformulate problems so that solving a larger problem in the same amount of time would be possible. In a way, the law redefines efficiency, due to the possibility that limitations imposed by the sequential part of a program may be countered by increasing the total amount of computation.\n\n![gstf-1](/images/gstf-1.webp)\n\nWe'll discuss some cooler ways to extend these ideas in the case of task parallelism in [Brent's Theorem \u0026 Task Level Parallelism](/blog/brent-s-theorem-task-level-parallelism). \n# References\nThese notes are quite old, and I wasn't rigorously collecting references back then. If any of the content used above belongs to you or someone you know, please let me know, and I'll attribute it accordingly.", - "lastmodified": "2026-01-11T05:05:42.621167271+05:30", + "lastmodified": "2026-01-11T10:59:00.490557254+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -50,7 +50,7 @@ "/blog/analyzing-fibonacci-karatsuba-multiplication": { "title": "Analyzing Fibonacci \u0026 Karatsuba Multiplication", "content": "# Fibonacci \nLet's try analyzing a few different algorithms for computing $F_n$, the $n^{th}$ Fibonacci number\n\n**Note:** The number doubles rapidly and hence grows at an exponential rate. In fact, $F_n \\approx 2^{0.694}$, which implies that we need around 0.694 bits to represent the $n^{th}$ Fibonacci number. This number can grow very large, very quickly. Not it is no longer accurate to consider the addition of 2 numbers as a constant time operation. For large values of n, the number of bits required to represent $F_n$ grows larger than any machine's word size and addition becomes a *linear-*time operation.\n\n\u003e 👾 **How do we know that $F_n \\approx 2^{0.694}$?** We can solve the recurrence relation to get the following formula:\n\u003e$$ F_n = \\frac{1}{\\sqrt5}((\\frac{1+\\sqrt5}{2})^n-(\\frac{1-\\sqrt5}{2})^n) $$\n\u003e\n\u003e For large n, the second term is negative and hence tends to 0. So for large n, the equation simplifies to $F_n = \\frac{\\Phi^n}{\\sqrt5}$, taking log here, we get $log_2(F_n) = n\\times0.694 - \\sqrt5$. Again, for larger n the change produced by $\\sqrt5$ reduces and the dominant term is simply $0.694n$.\n\u003e\n\u003e From this, we can infer that the $n^{th}$ Fibonacci number, especially for larger and larger n will require about $0.694n$ bits to represent in binary\n\u003e \n\u003e Bonus, running the below C++ program allows us to verify that the above relation even holds for smaller values of n. And with increasing n, the equation only grows more accurate and hence is a very good approximation of $F_n$\n\u003e```cpp\n\u003eint main(void){\n\u003e\tlong long n;\n\u003e\tcin\u003e\u003en;\n\u003e\n\u003e\tlong long p_2 = 0;\n\u003e\tlong long p_1 = 1;\n\u003e\tfor(int i=0; i\u003cn; i++){\n\u003e\t\tlong long fib = p_2 + p_1;\n\u003e\t\tcout\u003c\u003cfib\u003c\u003c\" \";\n\u003e\t\tswap(fib, p_2);\n\u003e\t\tswap(p_1, fib);\n\u003e\t}\n\u003e\tcout\u003c\u003cendl;\n\u003e\n\u003e\tlong double phi = 1.61803398875;\n\u003e\tfor(int i=0; i\u003cn; i++){\n\u003e\t\tlong long fib = round(pow(phi, i+1) / sqrt(5));\n\u003e\t\tcout\u003c\u003cfib\u003c\u003c\" \";\n\u003e\t}\n\u003e\tcout\u003c\u003cendl;\n\u003e}\n\u003e```\n\n## Algorithm 1 for computing $F_n$\n\n```bash\nif n = 0: return 0\nif n = 1: return 1\nreturn fib1(n-1) + fib1(n-2)\n```\n\nProving **correctness** for this particular algorithm is relatively straightforward as this algorithm is pretty much the exact definition of the Fibonacci function. This is one of the most powerful features that recursion is able to offer.\n\nIt is however important to keep track of space usage analysis as recursion stacks may grow very large and potentially overflow the stack.\n\n\u003e🛠 This is one of the reasons functional programming is a powerful idea. Functional languages have the inherent property that all code is expressed in a functional manner. This allows the code to pretty much express its own correctness proof.\n\n### Recurrence relation\n\n$T(n) = T(n - 1) + T(n - 2) + A\\ for \\ n \u003e 2$ where $A$ is the complexity for addition of two numbers Therefore, the time complexity for adding two numbers via this algorithm is $O(2^nA)$. We can visualize the branching like a tree and every node branching into two child nodes at every step of the recursion. And at every node, we perform $A$ operations for addition. For large $F_n$, since addition is linear in the number of bits and since $F_n \\approx 2^{0.694n}$, our final time complexity evaluates to $O(n2^n)$.\n\n## Algorithm 2 for computing $F_n$\n\nThe key idea used here is converting the recursion to iteration. Just keeping track of $f_{i-1}$ and $f_{i-2}$ for computing $f_i$ is enough. This idea is a very basic application of the concept of **dynamic programming.**\n\nBelow is an algorithm that keeps track of the computed Fibonacci numbers for all $i \\leq n$\n\n```bash\nif n = 0: return 0\ncreate an array f[0..n]\nf[0] = 0, f[1] = 1\nfor i = 2...n:\n\t\tf[i] = f[i-1] + f[i-2]\nreturn f[n]\n```\n\n**Note** that this is however not a linear time algorithm\n\nWhile the loop itself is linear, $F_n$ is about $0.694n$ bits long, and each addition is an $O(n)$ operation when we are dealing with arbitrarily large numbers. Therefore the overall complexity is $O(n^2)$ in the size of the input\n\n\u003e📈 We can also observe that the space complexity for the above algorithm is also evaluated to about $O(n^2)$. Arbitrarily large numbers can occupy $0.694n$ bits in memory, and we are storing all values of $F_n$ from $i \\dots n$.\n\u003e\n\u003eHowever, a simple optimization will help us reduce the space complexity to simply $O(n)$. We only ever need the previous two values of $F_i$ to compute it. That is, we only need to keep track of $F_{i-1}$ and $F_{i-2}$ to compute $F_i$. The rest of the values of $F_{j \\lt i-2}$ are not required. Keeping track of just the 3 values allows us to reduce space complexity by simply storing the number of bits in $F_n, F_{n-1}$ and $F_{n-2}$, which is linear in the input size. Space complexity: $O(n)$\n\u003e\n\u003eThe **key** realization here is just observing that our algorithm calculates all values of $F_i$ for **all $0\\leq i\\leq n$.** This is a redundancy. We only require to calculate the $n^{th}$ Fibonacci number. This realization will help us reduce the time complexity even further, as we will see below.\n\n## Algorithm 3 for computing $F_n$\n\nMotivated by our realization to eliminate the redundancy, we can attempt to make our computation even faster.\n\nLet us assume that we know $F_{i-1}$ and $F_{i-2}$ and we are attempting to compute $F_i$. Notice that to compute $F_i$, our equation looks like $F_i=1\\times F_{i-1} + 1\\times F_{i-2}$. This gives us $F_i$. Now from $F_i$, to get $F_{i+1}$, we need the term $F_{i-1}$ as well.\n\nWe get the following equations\n\n$$ F_i = F_{i-1}+F_{i-2} \\\\ F_{i-1} = 0\\times F_{i-2} + 1\\times F_{i-1} $$\n\nNotice that these set of equations can be represented nicely in a matrix form which lets us write\n\n$$ \\begin{pmatrix} F_{i-1} \\\\ F_{i} \\\\ \\end{pmatrix} =\\begin{pmatrix} 0 \u0026 1 \\\\ 1 \u0026 1 \\end{pmatrix} \\begin{pmatrix} F_{i-2} \\\\ F_{i-1} \\end{pmatrix} $$\n\nNotice that by simply left-multiplying the RHS with our constant matrix, we calculate any $F_n$ that we desire. This allows us to come up with the following beautiful equation.\n\n$$ \\begin{pmatrix} F_n \\\\ F_{n+1} \\\\ \\end{pmatrix} =\\begin{pmatrix} 0 \u0026 1 \\\\ 1 \u0026 1 \\end{pmatrix}^n \\begin{pmatrix} F_0 \\\\ F_1 \\end{pmatrix} $$\n\nTo calculate any $F_n$, we only need to know the values of $F_0, F_1$ , and a constant matrix exponentiated to some $n$. Exponentiation of a constant to some power $n$, can be solved via _binary exponentiation._ Therefore the time complexity of this Algorithm comes out to be $O(M(n)log(n))$ where $M(n)$ is the time complexity for multiplying two n-bit integers\n\n## Algorithm 4 for computing $F_n$ (Direct formula)\n\n$F_n = \\frac{1}{\\sqrt{5}}(\\frac{1 + \\sqrt{5}}{2})^{n} - \\frac{1}{\\sqrt{5}}(\\frac{1 - \\sqrt{5}}{2})^{n}$t\n\nWe can also attempt to compute $F_n$ using the direct formula we obtain by solving the recurrence relation. However, notice that there are irrational and divisions involved. This might give us accuracy issues depending on machine type and whatnot. This makes it very difficult to prove accuracy of the algorithm on machines.\n\nFurther, we can note that we still need to compute some value to the power n. This requires $log_2(n)$ operations for the exponentiation and $M(n)$ operations for multiplication. This algorithm is essentially equivalent to our previous algorithm in terms of time complexity.\n\nWe also see that the eigenvalues of the matrix we use in Algorithm #3 appear in the direct formula. Therefore, it's better if we just use Algorithm #3 as we don't have to deal with irrational numbers, hence no accuracy issues arise. They are essentially 2 forms of the same algorithm.\n\n\u003eIn fact, if we calculate the eigenvalues of the matrix obtained in our 3rd algorithm, we get $\\lambda_1 = \\frac{1+\\sqrt5}{2} \\\\ \\lambda_2 = \\frac{1-\\sqrt5}{2}$\n\u003e\n\u003eThis further solidifies our suspicion that both algorithms 4 and 3 are essentially two different ways of expressing the same idea. One is a more mathematical method to compute $F_n$ and the other, a matrix represented technique that will be easier to implement on computers. They both have equal time complexity. But the matrix method is preferred as we do not have to deal with accuracy issues.\n\nNote that in all the above algorithms, the derived complexity involved the function $M(n)$. This is the number of operations required for multiplying 2 n-bit numbers.\n\nNotice that the naive algorithm for implementing n-bit multiplication is of the order of $n^2$. This makes our algorithms 3 and 4 worse than 1 and 2 as they become $n^2logn$ in the order of input size. However, if we are able to reduce the complexity of the multiplication operation, we will be able to do better than algorithms 1 and 2.\n\n# Karatsuba Multiplication\n\n_Can we do better than the order of $n^2$ operations per multiplication?_\n\nThis problem is an **open** question. We know an algorithm that can do better than $n^2$, but we have not been able to prove the optimality of this algorithm.\n\n**Intuition**:\n\nMultiplying two complex numbers.\n\nTo compute $(a+ib) \\times (c+id)$, we require 4 steps naively. $(ac-bd)+i(ad+bc)$\n\nIt is, however, possible to compute this in just 3 steps.\n\n- Compute $a \\times c$\n- Compute $b\\times c$\n- Compute $(a+b)\\times(c+d)$\n\nNotice that $(ad+bc) = (a+b)\\times(c+d)-ac-bd$\n\nLet us try to realize this same concept while multiplying 2 n-bit integers.\n\nSay we have some n-bit integer $X$. This implies that there are n-bits in its binary representation. This also means that we can divide every n-bit integer into 2 sets of $\\frac{n}{2}$ bits each (+-1).\n\n$X = 01001011 \\implies x_1 = 0100, x_0 = 1011$\n\nThat is, we can write $X = 2^{\\frac{n}{2}}x_1+x_0$. Notice that multiplying by $2^x$ is the same as shifting the binary by $x$ steps to the left. Hence shifting can be considered a constant operation.\n\n**Note** that this is true for any base. Multiplying by $k$ for any number in base $k$ is equivalent to shifting.\n\nThis is essentially all we need to know for coming up with the Karatsuba algorithm ourselves.\n\n## The algorithm\n\nTo multiply any two n-bit integers,\n\n1. Add two $\\frac{1}{2}n$ bit integers\n2. Multiply three $\\frac{1}{2}n$ bit integers\n3. Add, subtract, and shift $\\frac{1}{2}n$ bit integers to obtain the answer\n\n$$ X = 2^{\\frac{n}{2}}.x_1 + x_0 \\\\ Y = 2^{\\frac{n}{2}}.y_1 + y_0 \\\\ X.Y = (2^{\\frac{n}{2}}.x_1 + x_0)\\times(2^{\\frac{n}{2}}.y_1 + y_0) \\\\ = 2^n.x_1.y_1 + 2^{\\frac{n}{2}}((x_0+x_1)(y_0+y_1)-x_1.y_1-x_0.y_0) + x_0.y_0 $$\n\nNotice that the last step of the expansion is essentially the same as the constructive change we put forward in the multiplication of complex numbers idea to reduce multiplications required from 4 to **3**. This allows us to multiply 2 n-bit integers with an algorithm that recursively divides its input into $\\frac{n}{2}$ bit chunks and requires only 3 multiplications per $\\frac{n}{2}$ bit chunk.\n\nLet's suppose that our algorithm takes $T(n)$ steps to compute. At every step, we need to calculate the following terms.\n\n1. $x_1.y_1$ which can be done in $T(\\frac{n}{2})$\n2. $x_0.y_0$ which can be done in $T(\\frac{n}{2})$\n3. $(x_0+x_1)(y_0+y_1)$. Notice that the addition of two $\\frac{n}{2}$ bit numbers _can_ be a $1+\\frac{n}{2}$ bit number. Hence this will take us $T(\\frac{n}{2}+1)$ steps.\n4. Finally, once the shifts are done, we have a few $O(n)$ additions to be done.\n\nThis gives us the following result,\n\n**[Karatsuba-Ofman, 1962]** Can multiply two n-digit integers in $O(n^{1.585})$ bit operations.\n\n$$ T(n) \\leq T(\\frac{n}{2})+T(\\frac{n}{2})+T(\\frac{n}{2}+1) +\\Theta(n) \\\\ T(n) = O(n^{log_2(3)})=O(n^{1.585}) $$\n\n## Can we do _better_?\n\nWe shall cover the algorithm in detail later, but there does indeed exist an algorithm that can do it better.\n\n- The [The Fast Fourier Transform (FFT)](/blog/the-fast-fourier-transform-fft) based algorithms are able to compute this operation in $O(n\\ log(n)\\ log(log(n)))$\n- In 2007, we discovered a new method that computes it in $O(n\\ logn\\ 2^{log*n})$\n- The **best** (with proof of optimality) algorithm is still... an **open** problem\n\nThe fastest known algorithm till now is of the order of $O(nlogn)$ [by Harvey and van der Hoeven, 2019]. This is the [relevant paper](https://hal.archives-ouvertes.fr/hal-02070778v2/document). It begins by introducing the previously known algorithms and then deep dives into the math behind proving its upper bound. We were able to prove a lower bound on sorting as seen in [How to analyze algorithms? Proving a lower bound for comparison based sorting](/blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting), can we do the same for this problem?\n\nNo, we do **not know** if this algorithm is the best at the time of writing this note. The theoretical lower bound we know of is $\\Omega(n)$ as the very least we require to do is process every bit of the input. There may or may not exist an algorithm better than $nlog(n)$, but we do not know of any such algorithms.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:05:42.630040389+05:30", + "lastmodified": "2026-01-11T10:59:00.499557377+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -58,7 +58,7 @@ "/blog/are-there-computational-problems-that-computers-cannot-solve": { "title": "Are There Computational Problems That Computers Cannot Solve?", "content": "Are there computational problems that computers cannot solve? How do we find the answer to this question? Turns out there's a very simple way to answer this question, even without defining what an \"algorithm\" is ([Church-Turing Hypothesis](/blog/church-turing-hypothesis)).\n\nNotice that if we are able to prove that there are **uncountable many** computational problems and only **countably many** computer programs. Then this would imply that there must exist uncountable many problems for which, **no computational program solution exists**.\n\n\u003e 🧮 **Countable sets** An infinite set is countable if there is a bijection $f:N\\to S$ from natural numbers to S **Uncountable sets** An infinite set is countable if it is not possible to construct a bijection $f:N\\to S$ from natural numbers to S. A common proof method is cantor's diagonalization which first assumes that it is possible to construct such a bijection and then proves that for every such bijection, we can always create a new element in the set that was not mapped before. Thus disproving that any such bijection can be created.\n\u003e \n# Proving that the set of all programs is countable\n\nNow, notice that every single program that we write, must be encoded to some subset in the set of all finite-length bit strings, i.e., some subset of $\\{ 0, 1 \\}^*$. We can draw an analogy here to how every compiled C program, for example, has its own unique binary file which can be used to represent it as a finite-length bit string.\n\nTheoretically, it is true that every possible program that we can write can be uniquely encoded as some finite-length binary string. We also know that the subset of an infinite countable set must be countable. Therefore, it suffices to prove that the set $\\{ 0, 1 \\}^*$ is countably infinite for the first part of our proof.\n\nEvery finite length binary string is just a natural number encoded in binary. This allows us to uniquely map such a bijection from the natural numbers to the set $\\{0,1\\}^*$.\n\n$0\\to0, 1\\to1, 10\\to2, 11\\to3, 100\\to4 \\ \\dots$\n\nThis implies that the set of all finite-length binary strings $\\{ 0, 1 \\}^*$ must be countably infinite. Therefore, since the set of all finite-length programs is a subset of this set, it must also be countable.\n\n# Proving that the set of all computational problems is uncountably infinite\n\nLet us prove that $P( \\{0, 1\\}^*)$, i.e., the power set of all finite-length bit strings is uncountable. Notice that every problem is modeled as a decision problem. And every decision problem is characterized by a set. Or it's \"language.\" Therefore, every possible subset of the set of all finite-length binary strings, actually represents a problem. Each subset is a unique language and each of them characterizes unique problems.\n\nTherefore, counting the total number of computational problems essentially reduces to calculating the cardinality of the power set $P(\\{0, 1\\}^*)$\n\nConsider the following function $f:\\{0, 1\\}^*\\to\\{0, 1\\}$ which maps the set of all finite-length binary strings to a subset. Let us pick some subset $S \\subset \\{0, 1\\}^*$. Then the function is defined as follows:\n\n$$ f(x)= \\begin{cases} 1 \\ \\forall \\ x \\in S \\\\ 0 \\ \\forall \\ x \\notin S \\\\\n\n\\end{cases} $$\n\nNow let us calculate $f(x)$ for every such language and write it in the form of a table\n\n![cantor-diagonalization-table](/images/cantor-diagonalization-table.webp)\n\nLet us assume that we have enumerated an infinite number of such languages. Now we will use diagonalization to prove that there will always exist some language $L_x$ that does not belong to our set.\n\nWe construct $L_x$ as follows. We move along the diagonal and flip the value of $L_i$ for each element $i$ of the set.\n\n$L_x(\\epsilon) = 0, L_x(0) = 1, L_x(1) = 1, L_x(00) = 0, L_x(01) = 0 \\ \\dots$\n\nWe notice that such a language $L_x$ does not belong to the set as it differs from each $L_i$ belonging to our bijection at the $i^{th}$ element. This means we have successfully proved the existence of a language that does not belong to our bijection. No matter how many times we repeat the process of finding such a language and adding it to the bijection, we will always be able to prove the existence of such a new language that does not belong to the bijection. Hence we have proved that the power set $P(\\{0, 1\\}^*)$ is indeed, uncountably infinite.\n\n**This implies that the cardinality of the set of all computational problems is greater than the set of all possible computer programs. This in turn implies that there are uncountably many computational problems that we cannot find computational solutions for.**\n\nThat is sad. But we might still hope that most of these computational problems that we **cannot** solve are also problems that we are **not interested** in solving. This is, however, **not** true. Consider the following problem,\n## Program equivalence problem\n**Definition:** _Write a program that takes two programs as input and checks whether both the programs solve the same problem._\n\nWe will prove this in further lectures, but for the sake of intuition, notice that there are many many different ways to program an algorithm to solve a particular computational problem. It is not intuitively possible for us to write a program that can take two finite-length bit strings and deterministically say whether they both solve the same problem.\n\nThis is a useful program as it allows us to check the accuracy of programs easily. However, since this is not a problem we can solve, we have resorted to probabilistic solutions which test two programs by running them on a large collection of sample test cases and checking if their outputs are the same. However, note that this is a **probabilistic** solution and not a **deterministic** solution.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:05:42.903949577+05:30", + "lastmodified": "2026-01-11T10:59:00.7775612+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-cs-algorithms-complexity" @@ -67,7 +67,7 @@ "/blog/basics-of-supervised-learning-linear-regression": { "title": "Basics of Supervised Learning - Linear Regression", "content": "# Preface \nBefore I get started... most of what follows is inspired by, and adapted from, [notes I originally wrote back in high school (2019)](https://github.com/akcube/notes/tree/main/machine-learning-andrew-ng), now refreshed and digitized. These notes were first put together while following the now very famous [Machine Learning specialization](https://www.coursera.org/specializations/machine-learning-introduction) by Andrew Ng on Coursera, albeit a very old version. I would also highly recommend going over [3Blue1Brown](https://www.youtube.com/@3blue1brown)'s lecture series on [Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi), they're a delight to visual learners trying to understand back propagation better. I suppose I don't really need to pitch Grant's work much :') but it's genuinely amazing. \n\n# History\nWhen learning a new topic, I always like to start with some history to understand the premise and \"purpose\" which lead to the creation of the given topic or field. In this case, \"Artificial Intelligence\" \u0026 \"Machine Learning\" have been fields of research since the 1950s. \n\n## The Turing Test\nThe world's first computer was built in 1946, the ENIAC. But theoreticians like Turing had already been theorizing about (in his 1936 paper, [On Computable Numbers, with an Application to the Entscheidungsproblem](https://londmathsoc.onlinelibrary.wiley.com/doi/abs/10.1112/plms/s2-42.1.230)) a general purpose \"universal machine\" that could solve \"computable\" problems. It was also Turing who might have \"kicked off\" this field when he published his most cited paper, [COMPUTING MACHINERY AND INTELLIGENCE](https://courses.cs.umbc.edu/471/papers/turing.pdf) in 1950, proposing the question *\"Can machines think?\"*\n\nIn this paper, he introduced the \"Imitation Game\" (now known as the Turing Test) as a practical way to assess if a machine could \"think.\" It remains a benchmark even today, while we continue to debate between \"AI\", \"AGI\", \"ASI\", etc. His ideas drove a lot of curiosity into answering this question, \"can machines think?\"\n\n## The Dartmouth Conference\nThis is publicly recognized as the most well-known birthplace of AI. In 1956, John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon organized a large workshop to bring together leading researchers and formally established the field of AI as a dedicated area of study. This meeting marked the \"official\" birth of AI as a research field. John McCarthy is credited with coining the term \"artificial intelligence.\"\n\n## The Perceptron\nFollowing this conference, in 1957, Frank Rosenblatt built the world's first perceptron. It was a (in today's terms) single-layer neural network, which performed **binary** classification based on weighted inputs and a threshold. You can think of the first perceptron as something as simple as the following function: \n\n![pasted-image-20250817021443](/images/pasted-image-20250817021443.webp)\n\nEach input above ($x_i$) is assigned a weight ($w_i$). The perceptron calculates the weighted sum $\\sum x_i \\cdot w_i$. The red circle there is an \"activation function.\" For example, a simple binary classification function can be the following:\n\n$$\nf(x) = \\begin{cases} \n0 \u0026 \\text{if } x \\leq 0 \\\\\n1 \u0026 \\text{if } x \\gt 0\n\\end{cases}\n$$\n\nThe perceptron can \"learn\" weights by adjusting its weights based on the \"error\" between its predicted output and the desired output. It was intended to be used in image recognition. This was a huge achievement at that time and sparked a lot of excitement about AI. However, people soon discovered that it could not learn more complex functions (for example, any non-linear function like the XOR), and no breakthroughs following the perceptron for many years to follow led to a period in tech known now as the \"AI winter,\" when interest and funding for AI research declined and very little progress was made. \n\n## From AI Winter to Deep Learning\nFollowing this, several other breakthroughs were made in tech during the \"AI winter.\" Notably, the internet and the age of big data. Oh, also GPUs \u0026 Nvidia. Computing power increased at an exponential scale ([Moore's law](https://en.wikipedia.org/wiki/Moore%27s_law)), the world went online and huge amounts of data became widely available. This re-sparked the AI revolution. People were able to build much larger multi-layer perceptron networks and they were able to obtain the compute and data required to train them to compute much more complex functions now. We got Go \u0026 Chess engines better than any human in the world, and now we have the age of LLMs. \n\n# Linear Regression\nNow that we know the history \u0026 motivation for \"what\" we're trying to compute, let's ground ourselves with a simple problem. One closely related to the perceptron's early attempts at learning patterns. \n\n*Can we predict the relationship between a dependent variable ($y$, what we're predicting) and one or more independent variables ($x_i$, the variables $y$ depends on), by fitting a linear equation to some observed data?*\n\n## A Toy Problem\nFor example, let's say we believe that housing prices are linearly dependent on the size of the plot. If we plot some data points of house sales, we may end up with a chart that looks as follows:\n\n![pasted-image-20250817032023](/images/pasted-image-20250817032023.webp)\n\nLooking at the above data, it's reasonable to assume that housing prices are expected to linearly increase with increase in plot size. But if I wanted to know the best \"expected\" house price for a plot of size exactly $x_i$, how could I answer that question?\n\n![pasted-image-20250817032013](/images/pasted-image-20250817032013.webp)\n\nThere's no pre-existing data point with the exact value for plot size $x_i$, so I can't regurgitate a known data point. Even if there was, it might be an outlier. I could find the best nearby $x_i$ and try to make a prediction, but what if I asked for a *very large* plot size? One which I did not have \"nearby\" pre-existing data points for? Like we mentioned previously, we could observe that the price $y$ appears to be linearly increasing with price $x$. We could then try to compute the \"best-fit\" linear equation to model this relationship. Let's suppose we knew this \"best-fit\" line, given by some $f(x) = mx + c$. \n\n![pasted-image-20250817032005](/images/pasted-image-20250817032005.webp)\n\nWe could then easily compute the best expected price for any given $x_i$. Awesome, but how do we compute this best fit line from our data? What does \"best fit\" even mean anyway? \n\n## Formalizing The Ideas\nLet's formalize some ideas from our discussion on solving the above problem. In the toy problem, we said that we were trying to predict housing prices as a function of plot size. In this case, housing prices are the value we want to predict. \n\nHowever, note that this value $y$ could be modeled to depend on $n$ different input variables. Think plot size, distance from the city, reputation of the builder, etc. Each of these input data points is called a **feature**. A feature is an individually measurable property or characteristic of the data that is used by the model to make predictions. \n\nA single **training example**, $x^{(i)}$ is modeled as a vector of features. For example, $x^{(0)} = [1760 \\text{ sqft.}, 11 \\text{ km}, 4.6, \\dots]$. Henceforth, we will use $x^{(i)}$ to refer to the **$i^{th}$ training data point** and $x_j$ to refer to the **$j^{th}$ feature of training data point $x$.** You can also use both notations together. So $x^{(i)}_j$ would refer to the **$i^{th}$ training data point's $j^{th}$ feature**. $x$ itself will be a **vector of features**. $X$ will refer to the **matrix of all $n$ training data points**, where each row of $X$ is $x^{(i)}$. \n\n### The Hypothesis Function\nWe can now define the **hypothesis function** $H_{\\theta}(x)$ as the linear *model* we're trying to learn. For a given input $x$ with $n$ features, our prediction is a **linear** combination of those features, weighted by our learned **parameters** $\\theta$. \n\n$$\nH_\\theta(x) = \\theta_0x_0 + \\theta_1x_1 + \\cdots + \\theta_nx_n = \\sum_{i=0}^n \\theta_ix_i\n$$\n\nIf we imagined $x$ and $\\theta$ to both be 1-D vectors of size $n$, then the above equation is simply the dot product of both the vectors. So we can simplify the above equation to just:\n\n$$\nH_\\theta(x) = \\theta \\cdot x\n$$\n\nIn the above example, we have $n = 1$. So our hypothesis function $H_\\theta(x)$ is simply $\\theta_0 x_0 + \\theta_1 x_1$. For simplicity, we make the convention to always set $x_0 = 1$. This gives us the simplified hypothesis function, $H_\\theta(x) = \\theta_0 + \\theta_1 x_1$. In 2 dimensions, $\\theta_0$ is simply the y-intercept and $\\theta_1$ is the slope of the line. \n\n### The Cost Function\nTo learn the optimal parameters $\\theta$, we need a way to measure how well our model is performing. Going back to the last question we raised when discussing the toy problem, \"*What does \"best fit\" even mean anyway?*\" In our toy problem, if we draw a couple of random lines onto the graph of data points, \n\n![pasted-image-20250817031949](/images/pasted-image-20250817031949.webp)\n\nIt's easy to visually observe and claim that the orange line better \"fits\" the data than the green line. But how can we quantify this notion of \"fit\"? To solve this problem, we need to come up with a **cost function**. A cost function takes the training data points, and a predicted line of best fit as input, and outputs a quantifiable value for how \"close\" the line's predicted values are to the actual training data points. Well one simple idea could be to simply compute the predicted value $y^{(i)}$ for each training data point $x^{(i)}$ using $H_\\theta(x^{(i)})$ and compute the difference between the two values (well, the sum of the absolute values of the differences to be specific). \n\n![pasted-image-20250817032413](/images/pasted-image-20250817032413.webp)\n\nThe cost function $J(\\theta)$ would then be:\n\n$$\nJ(\\theta) = \\text{Cost}(H_\\theta(x)) = \\frac{1}{n} \\cdot \\sum_i^n(|H_\\theta(x^{(i)}) - y^{(i)}|)\n$$\n\nNote that we multiply the cost by $\\frac{1}{n}$ here to normalize the error with respect to the number of available training data points. This function is called the **Mean Absolute Error (MAE)** and is a perfectly valid cost function. However, this function is mathematically / analytically a not-so-nice function to use to define the cost for linear regression. To \"learn\" the best fit parameters $\\theta$, we usually use an algorithm that involves the computation of the **differential** of our cost function. The absolute value function $|x|$'s graph has a sharp corner at $x = 0$, which means its derivative is undefined at $x = 0$ and piecewise constant otherwise ($\\pm 1$). This makes mathematically reasoning about it and gradient based optimization difficult. (It's not possible to derive a simple closed-form solution for it, gradient based optimization might become unstable / harder to converge due to the sharp corner at $x = 0$.) Further, MAE would penalize all errors linearly. However, in most practical applications, we usually prefer penalizing \"large\" errors more \"strongly.\"\n\nDue to these reasons, most popular implementations of linear regression implement a slightly different cost function called the **Mean Squared Error (MSE)**. In principle, it's very similar to MAE. \n\n$$\nJ(\\theta) = \\text{Cost}(H_\\theta(x)) = \\frac{1}{2n} \\cdot \\sum_i^n(H_\\theta(x^{(i)}) - y^{(i)})^2\n$$\n\nWe simply swap the absolute value function $|x|$ for $x^2$. In contrast to MAE, this function's differential is continuous and smooth. However, note that MSE will \"punish\" large errors more strongly than MAE. You'll also notice that the denominator of our normalization fraction is now $2n$ instead of $n$. No significant reason for this. It's just slightly more mathematically convenient to compute the derivative for this later. \n\nWe now have a simple, mathematical method to **quantify** how \"good\" or \"bad\" a set of learned parameters $\\theta$ are for predicting some $y$ based on some training data $X$. That's well and good, but we still need to solve the last part of this puzzle. How do we \"learn\" the best fit line, or the best set of parameters $\\theta$ for minimizing the cost?\n\n### Gradient Descent\nFinding the best set of parameters $\\theta$ now just means finding the best values for $\\theta$ that minimizes the value of the the cost function $J(\\theta)$. So how do you find such a set of inputs $\\theta$? \n\nConsider the function $y = (x-3)^2 + 2$. We can find *a* minimum by differentiating it and setting $\\frac{dy}{dx} = 0$. This gives us $\\frac{dy}{dx} = \\frac{d(x^2 - 6x + 9 + 2)}{dx} = 2x - 6 = 0 \\implies x = 3$. At $x = 3$, $y = 2$. By double differentiating it, we get $\\frac{d^2y}{dx^2} = 2 \\gt 0$ which means it's a minimum. Since this curve is concave-up shaped, it has just one minimum and hence it is the *global* minimum. \n\n![pasted-image-20250818072726](/images/pasted-image-20250818072726.webp)\n\nHowever, this same approach isn't very feasible for more complicated functions. Sometimes solving for all possible values of $\\frac{dy}{dx} = 0$ is difficult (or impossible). Checking the double derivative for complex functions might often be inconclusive and we may need to check higher order functions or use numerical methods. When we're dealing with multiple variables and higher dimensional functions, the computations can get extremely complex and difficult to compute. So instead, people rely on iterative numerical optimization algorithms. \n\n**Gradient Descent** is one such iterative optimization algorithm used to find the minimum of a function. We start with some random initial values for $\\theta$ and repeatedly update them by taking small steps in the direction of the steepest descent of the cost function. Consider this more complex function below:\n\n![pasted-image-20250818072742](/images/pasted-image-20250818072742.webp)\n\nGiven any point $w_0$, we can find out the answer to *\"Which direction should I move in to reduce the value of the function?\"* by computing the derivative (slope) of the function at that point $w_0$. If the slope is positive, we should move left to reduce the value of the function. If it's negative, we should move right. If we do this repeatedly, we'll eventually approach \u0026 reach some **local minimum** of the function. The visualization that really helps sell this idea is that of a ball rolling down the 2D hills (curves generated by the function). If we generate *enough* random initial points (or balls) and perform this procedure, we should eventually hit a very good local minimum. \n\n\u003e ![pasted-image-20250817171326](/images/pasted-image-20250817171326.webp)\n\u003e \n\u003e ![pasted-image-20250817171344](/images/pasted-image-20250817171344.webp)\n\u003e \n\u003e - Pictures credit 3b1b: [Gradient descent, how neural networks learn | Deep Learning Chapter 2](https://www.youtube.com/@3blue1brown)\n\nAlso note that this is a general explanation of gradient descent for potentially non-convex functions. In the case of linear regression on the MSE function (which is **always** convex), we are pretty much always guaranteed to hit the global minimum since the function is intuitively shaped like a curve with one minimum area at the \"center.\"\n#### Gradient of A Function\nThis idea extends to $n$ dimensional spaces as well. \n\n\u003e ![pasted-image-20250817171511](/images/pasted-image-20250817171511.webp)\n\u003e - Pictures credit 3b1b: [Gradient descent, how neural networks learn | Deep Learning Chapter 2](https://www.youtube.com/@3blue1brown)\n\nLet's formalize how we compute this gradient descent step for a multi-variate scalar function. Here are some terms to know:\n\n1. **Scalar Function:** A function whose output is a single number, even if the input is multi-dimensional. For example:\n\t - $f(x) = x^2$ \n\t - $f(x, y) = x^2 - y^2$\n\t - $f(x, y, z) = sin(x) + e^{y-z}$ \n2. **The Gradient of a Scalar Function**: You can think of it as the $n$ dimensional (multi-variate) generalized case of the derivative of a 1D function. For a scalar function $f(x_1, \\dots, x_n)$, the gradient is a **vector** representing the direction of **steepest ascent** of a function. Note that for 1D functions, the gradient was equivalent to the slope. However, even for a 2D function, notice that the \"slope\" or \"gradient\" must be a vector, since it has to point in a **direction** and have some magnitude associated with it. We compute the gradient of a multi-variate function as follows:\n \n $$\n \\nabla f(x_1, x_2, \\dots, x_n) = \\Big( \\frac{\\partial f}{\\partial x_1}, \\frac{\\partial f}{\\partial x_2}, \\cdots, \\frac{\\partial f}{\\partial x_N} \\Big)\n $$\n \n In short, we take the partial derivative of $f$ with respect to each of the input dimensions. This final vector points to the direction of steepest ascent and its magnitude tells you the strength or *steepness* of that slope. \n\nNote that we'll cover the case of a non-scalar function later, when we cover [Basics of Neural Networks](/blog/basics-of-neural-networks). \n\n#### The Algorithm\nGoing back to the original problem, we have our cost function:\n\n$$\nJ(\\theta) = \\frac{1}{2n} \\cdot \\sum_i^n(H_\\theta(x^{(i)}) - y^{(i)})^2\n$$\n\nAnd we're trying to minimize it with gradient descent. The first step is to compute the gradient for this function, $\\nabla_\\theta J(\\theta)$. If the number of parameters was only 1, then this would just be $\\frac{dJ}{d\\theta}$. Since $\\theta$ is actually a vector of parameters, we need to compute it's gradient, which is defined as the vector:\n\n$$\n\\nabla_\\theta J(\\theta) =\n\\begin{bmatrix}\n\\frac{\\partial J}{\\partial \\theta_0} \\\\\n\\frac{\\partial J}{\\partial \\theta_1} \\\\\n\\vdots \\\\\n\\frac{\\partial J}{\\partial \\theta_n}\n\\end{bmatrix}\n$$\n\nSo what we need to compute here is the term $\\frac{\\partial J}{\\partial \\theta_i}$. \n\n$$\n\\frac{\\partial J}{\\partial \\theta_j} = \\frac{\\partial \\Big(\\frac{1}{2n} \\cdot \\sum_i^n(H_\\theta(x^{(i)}) - y^{(i)})^2\\Big)}{\\partial \\theta_j} = \\frac{\\frac{1}{2n} \\cdot \\sum_i^n\\partial (H_\\theta(x^{(i)}) - y^{(i)})^2}{\\partial \\theta_j}\n$$\n\nTo compute the partial derivative here, we will use the chain rule. As a refresher, the chain rule states that if we have two functions $f$ and $g$ which are composed like $y = f(g(x))$, then the differential $\\frac{dy}{dx} = \\frac{dy}{dg} \\cdot \\frac{dg}{gx}$. In other words, the derivative is equal to the the derivative of the outer function evaluated at the inner function times the derivative of the inner function. Applying this here, we get:\n\n$$\n\\frac{\\partial J}{\\partial \\theta_j} = \\frac{1}{2n} \\cdot \\sum_i^n \\Big(2 \\cdot(H_\\theta(x^{(i)}) - y^{(i)})\\Big) \\cdot \\frac{\\partial\\Big(H_\\theta(x^{(i)}) - y^{(i)}\\Big)}{\\partial \\theta_j}\n$$\n\nTo compute the last remaining partial derivative, notice that $y^{(i)}$ does not depend on $\\theta_j$. Hence it's derivative is $0$ with respect to $\\theta_i$. Furthermore, if we expand $H_\\theta(x^{(i)}) = \\theta_0x^{(i)}_0 + \\theta_1x^{(i)}_1 + \\cdots + \\theta_nx^{(i)}_n$, we can notice that most of these terms will go to 0. The partial derivative is therefore just $x^{(i)}_j$ ($\\frac{d(cx)}{dx} = c$). With this, our final equation simplifies down to:\n\n$$\n\\frac{\\partial J}{\\partial \\theta_j} = \\frac{1}{n} \\cdot \\sum_i^n \\Big(H_\\theta(x^{(i)}) - y^{(i)}\\Big) \\cdot x^{(i)}_j\n$$\n\nNow, all that remains is to define the \"update\" step that our algorithm will use to nudge the parameters $\\theta$ in some direction based on the gradient of the cost function. We'll define $\\alpha$ to be the **learning rate**. It will be used to control the *size* of each step in our gradient descent. We can then define the update step as simply $\\theta_j \\coloneqq \\theta_j - \\alpha \\cdot \\frac{\\partial J}{\\partial \\theta_j}$ which when expanded is:\n\n$$\n\\theta_j \\coloneqq \\theta_j - \\alpha \\cdot \\frac{1}{n} \\cdot \\sum_{i}^n\\Big(H_\\theta(x^{(i)}) - y^{(i)}\\Big)\\cdot x^{(i)}_j\n$$\n\nBy varying the size of $\\alpha$, we can control how \"large\" the steps are that we take when attempting to find the local minima. With very small $\\alpha$, we will make very incremental and slow progress towards the minima. With very large $\\alpha$, we run the risk of missing the local optima altogether. But sometimes, larger $\\alpha$ overshooting a local minimum might be useful to determine a better local minimum. In practice, we usually run several runs with different randomized initializations of $\\theta$, and vary the step size from initially large values to smaller ones towards the end of the gradient descent process. \n\n## A Vectorized Implementation\nRemember that $\\theta$, $x^{(i)}$ and $y^{(i)}$ are all vectors. Computing each value by looping over each entry one by one is extremely slow. Each computation would occupy the use of 3 registers on the CPU, and we would likely need a **lot** of memory accesses. Instead, we have a lot of specialized hardware that is purpose built to compute operations such as these where we want to apply the exact same operation on a large amount of data. These operations fit nicely under SIMD from [Flynn's Taxonomy](/blog/flynn-s-taxonomy). In particular, we have a lot of purpose built libraries which are written to make maximum utilization of such specialized hardware for computing matrix-vector operations. Some more context on this in [Mega-Project - kBLAS (Writing a Benchmark library in C \u0026 Optimizing L1, L2 Basic Linear Algebra Subprograms)](/blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms). I also intend to write a blog on [GPUs](/blog/gpus) soon. In short, we should try to write our operations as matrix-vector computations whenever possible. Let's do this for linear regression. \n\nFirst, let's understand the dimensionality of our input \u0026 parameter vectors. Each of our training data points is a vector $x^{(i)}$ of dimensions $1 \\times m$. We can encapsulate all of our training data inputs into a single matrix $X$ of dimensions $n \\times m$. Here, each row of our matrix $X$ represents one of the training inputs. Each training input consists of $m$ features. For each training input $x^{(i)}$, we also have the training example's correct output $y^{(i)}$ which is a vector of dimensions $n \\times 1$. Note that whether it's a row or column vector is just a choice that'll help simplify the future expressions. Similarly we can represent our vector of hyper-parameters $\\theta$ as a $m \\times 1$ column vector (Thanks to [Claude](https://claude.ai/) yet again for awesome visualizations...).\n\n![pasted-image-20250818054022](/images/pasted-image-20250818054022.webp)\n\n1. Our hypothesis function $H_\\theta$ can be simply written as $H_\\theta(x) = x \\times \\theta$. This would in essence, compute the $1 \\times 1$ predicted output for a single input vector $x$. We can similarly model the computation for the entire training data matrix $X$ in one operation as well. \n \n ![pasted-image-20250818050535](/images/pasted-image-20250818050535.webp)\n \n ![pasted-image-20250818054135](/images/pasted-image-20250818054135.webp)\n \n2. The cost function $J(\\theta)$ can be written as $J(\\theta) = \\frac{\\sum(H_\\theta(x) - y)^2}{2n}$. $H_\\theta(X)$ can be computed by multiplying the matrix $X$ with $\\theta$ to give us the $n \\times 1$ vector. After this, computing the numerator is a simple vector subtraction operation. We can then compute the square vector by computing $x^T \\times x$. When $x$ is a row vector, it gives the sum of squares of the components of $x$. This is also known as the *squared Euclidean norm* of $x$. \n \n ![pasted-image-20250818052049](/images/pasted-image-20250818052049.webp)\n \n ![pasted-image-20250818054207](/images/pasted-image-20250818054207.webp)\n \n3. The gradient, $\\nabla J(\\theta)$ is then written as $\\nabla J(\\theta) = \\frac{1}{n} \\cdot (X^T \\times (X \\times \\theta - y))$. \n \n ![pasted-image-20250818052139](/images/pasted-image-20250818052139.webp)\n \n ![pasted-image-20250818054242](/images/pasted-image-20250818054242.webp)\n\n## A `PyTorch` Implementation\nI'm also attempting to learn `PyTorch` for the first time here, so I'll be leaving some snippets here which I used to test and verify these implementations using `PyTorch`. To start off, let's import the necessary libraries and set the `seed` for them to `42`, just to make sure all experiments / findings from here are completely reproducible. \n\n```python\nimport torch\nimport torch.nn as nn # We'll need PyTorch for running the above algorithms\nimport numpy as np # Numpy for helping with plotting\nimport matplotlib.pyplot as plt # Matplotlib to actually plot graphs\n\ntorch.manual_seed(42)\nnp.random.seed(42)\n```\n\nNext up, I'm going to create some sample training data by using a base function, say $y = 2.5x + 1$ and by adding some random noise to it. If you want to know what the function `plot_xy` does, I made a bunch of helper functions to quickly visualize plots. They're mostly irrelevant. \n\n```python\n# Let's make our sample data\nNUM_SAMPLES = 100\nNUM_FEATURES = 1\nREAL_WEIGHT = 2.5\nREAL_BIAS = 1.0\n\nX = torch.randn(NUM_SAMPLES, NUM_FEATURES)\ny = REAL_WEIGHT * X + REAL_BIAS + torch.randn(NUM_SAMPLES, 1) * 0.5 # The randn * 0.5 is our noise term\n\nplot_xy(X, y, REAL_WEIGHT, REAL_BIAS)\n```\n\nThe plot generated is as follows:\n\n![pasted-image-20250818090357](/images/pasted-image-20250818090357.webp)\n\nSo far so good. We now need to create our model parameters, the model itself and pick the loss function and optimizer we want to use to learn our parameters. To do this in `PyTorch` we do the following:\n\n```python\n# Initialize model, loss, and optimizer\nmodel = nn.Linear(NUM_FEATURES, 1)\nloss_function = nn.MSELoss()\noptimizer = torch.optim.SGD(model.parameters(), lr=0.01)\n```\n\nLet's train our model now for a 100 iterations of gradient descent. \n\n```python\n# Let's train the model now\nloss_log = []\nNUM_EPOCHS = 100\nfor epoch in range(NUM_EPOCHS):\n # Forward pass\n y_pred = model(X)\n loss = loss_function(y_pred, y)\n\n # Backward pass\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n # Log loss\n loss_log.append(loss.item())\n\nlearned_weight = model.weight.item()\nlearned_bias = model.bias.item()\n\nprint(f'\\nReal parameters: weight={REAL_WEIGHT}, bias={REAL_BIAS}')\nprint(f'Learned parameters: weight={learned_weight:.4f}, bias={learned_bias:.4f}')\nplot_loss(loss_log)\n```\n\nThe output I get is:\n\n```\nReal parameters: weight=2.5, bias=1.0\nLearned parameters: weight=2.2193, bias=1.0283\n```\n\nNot bad, but if I train for 1000 epochs instead of just 100, I'll see a near perfect fit set of learned parameters instead:\n\n```\nReal parameters: weight=2.5, bias=1.0\nLearned parameters: weight=2.5059, bias=1.0178\n```\n\nPlotting the loss (cost function) for our model over each epoch gives us the output we'd expect:\n\n![pasted-image-20250818100406](/images/pasted-image-20250818100406.webp)\n\nAs we can see, a 100 epochs don't seem to be enough to minimize the loss. We seem to have hit the minimum somewhere under the 200th epoch. \n\n### An Explanation Of The Code\nI believe it's also worthwhile to go over what some of these `PyTorch` functions do and how we're using them to do linear regression. There's a few things to note here:\n\n1. [torch.randn](https://docs.pytorch.org/docs/stable/generated/torch.randn.html)\n \n \u003e Returns a tensor filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution). That is, $\\text{out}_i \\sim \\mathcal{N}(0, 1)$\n \n In short, when we write `torch.randn(a, b)`, we're creating a tensor of size $a \\times b$ where each element of this tensor is sampled from the standard normal distribution with mean $0$ and variance $1$. \n\n2. [torch.nn.Linear](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)\n \n \u003e Applies an affine linear transformation to the incoming data: $y=xA^T+b$ \n \n `Linear` accepts the dimensions of the `input` feature vector and the `output` feature vector. In our case, our input features (say plot size) was just 1 and the size of the output feature (say house price) was also 1. By default, the argument `bias=True` is set. This means the `Linear` class automatically maintains a `bias` term in the network (technically `Linear` is actually implementing Neural Network (see: [Basics of Neural Networks](/blog/basics-of-neural-networks)) which implements an affine transform), but for our use case, they're equivalent. The bias term is our `+ c` term. This is useful because it lets the model better fit data that is not normalized around the origin. The matrix $A$ is the parameter matrix (a vector for 1D output features).\n \n The model stores the parameters internally and makes them accessible via `model.weight` and `model.bias`. These are `torch.Parameter` objects, which are special tensors that PyTorch knows to track for gradient calculations.\n\n3. [nn.MSELoss](https://docs.pytorch.org/docs/stable/generated/torch.nn.MSELoss.html)\n \n \u003eCreates a criterion that measures the mean squared error (squared L2 norm) between each element in the input $x$ and target $y$.\n \n This just defines / implements the same MSE error function we defined earlier. Note that `PyTorch` likely does not use the $\\frac{1}{2n}$ term to normalize and uses $\\frac{1}{n}$ instead. You can also pass in `reduction='sum'` to have the cost function skip division (which can be unstable / slow) and just compute the sum of the MSE terms instead. However, with sum, note that increasing training data / batch size implies an increase in loss / gradient size and would likely need tuning of the learning rate $\\alpha$ dependent on the batch size to work well. \n\n4. [torch.optim.SGD](https://docs.pytorch.org/docs/stable/generated/torch.optim.SGD.html)\n \n \u003eImplements stochastic gradient descent (optionally with momentum).\n \n There's not much to explain here. Given that we've defaulted `weight_decay` and all the other fancy modifiers to $0$, it's implementing exactly the same gradient descent algorithm we've described above. \n \n However, there are some important implementation details to note. `model.parameters()` returns an iterator over all learnable parameters in the model. The optimizer stores references to these parameters and will update them in-place during `optimizer.step()`. One more thing, you may have noticed that we have a line `optimizer.zero_grad()` before we perform the `step()`. This is because `PyTorch` accumulates gradients by default. There's some reasons for this, which I hope to go over in [Basics of Neural Networks](/blog/basics-of-neural-networks). \n \n5. **The computation graph:** There's one part of the above snippets that might look weird / unrelated to new learners. \n \n ```python\n\t# Forward pass\n y_pred = model(X)\n loss = loss_function(y_pred, y)\n\n # Backward pass\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\t```\n\t\n\t`loss` appears to be computed independently and seems to be a simple output tensor. We then call `loss.backward()`... which should do nothing? And then we call `optimizer.step()`? Will it still work if I just remove the `loss.backward()` line? How does `loss` even inject itself into the code flow / path of the `optimizer`? \n\t\n\tTurns out, every tensor in `PyTorch` can track how it was created. `y_pred.grad_fn` would contain a reference to the objects it was computed from. `PyTorch` essentially maintains a **dynamic computation graph** built on-the-fly as you do operations. We can actually visualize this graph using `torchviz`. Let's execute this piece of code:\n\t\n\t```python\n\tfrom torchviz import make_dot\n\timport torch\n\timport torch.nn as nn\n\t\n\tmodel = nn.Linear(1, 1)\n\tX = torch.tensor([2.0](/blog/2-0))\n\ty = torch.tensor([5.0](/blog/5-0))\n\t\n\ty_pred = model(X)\n\tloss = nn.MSELoss()(y_pred, y)\n\t\n\t# Create visualization\n\tdot = make_dot(loss, params=dict(model.named_parameters()))\n\tdot.render(\"graph\", format=\"png\") # creates graph.png\n\t```\n\t\n\tWe get the following `graph.png`: \n\t\n\t![pasted-image-20250818174715](/images/pasted-image-20250818174715.webp)\n\t\n\tSo when you call `loss.backward()`, it actually goes backward through this graph and computes the gradient of `y_pred` with respect to weight and bias and stores these gradients in the `.grad` attribute of each parameter. These are then picked up by the optimizer when we run `optimizer.step()`. There's more details to this, but maybe in a future blog :) \n\n### Some More Experiments\nAs we saw above, tweaking the number of epochs we trained for gave us significantly better results than without. You'll notice there are some more arbitrary constants sprinkled into the code. What about the value of the learning rate? How do these values all affect our final set of learned parameters $\\theta$? To answer this, we can run a few experiments. Let's first modularize our training code:\n\n```python\n# Let's train the model now\ndef train_and_log_params(lr=0.01, num_epochs=100):\n # Reinitialize model / optimizer \u0026 create log stores\n model = nn.Linear(NUM_FEATURES, 1) \n optimizer = torch.optim.SGD(model.parameters(), lr=lr)\n loss_function = nn.MSELoss()\n loss_log, weights_log, biases_log = [], [], []\n\n for epoch in range(num_epochs):\n # Forward pass\n y_pred = model(X)\n loss = loss_function(y_pred, y)\n\n # Log values\n loss_log.append(loss.item())\n weights_log.append(model.weight.item())\n biases_log.append(model.bias.item())\n\n # Backward pass\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n return loss_log, weights_log, biases_log\n```\n\nNow, we can vary a bunch of stuff and see how loss and our parameters vary with them. For example, let's try varying the learning rate $\\alpha$. \n\n```python\n# Let's try varying learning rates\nlearning_rates = [0.001, 0.01, 0.1]\ndata = {}\nfor lr in learning_rates:\n _, weights_log, biases_log = train_and_log_params(lr)\n data[lr] = (weights_log, biases_log)\n\nplot_param_evolution(data, REAL_WEIGHT, REAL_BIAS)\n```\n\nHere's what we get. For very small learning rate and just 100 epochs, both $\\alpha = 0.001$ and $\\alpha = 0.01$, the weights and bias fail to converge to the \"real\" values. We need the much higher rate $0.1$ to converge quickly. \n\n![pasted-image-20250818100912](/images/pasted-image-20250818100912.webp)\n\nIf we increased the number of epochs to 1000 however,\n\n![pasted-image-20250818101238](/images/pasted-image-20250818101238.webp)\n\nYou'll notice that $\\alpha = 0.01$ is sufficient, but $\\alpha = 0.001$ still fails to converge! Looking at the evolution further, you'll notice that \n\n![pasted-image-20250818101615](/images/pasted-image-20250818101615.webp)\n\nThe larger learning rate has chosen a slightly worse optimum, but it did reach there a lot faster than $\\alpha = 0.01$. That's the tradeoff we make here. \n\nAnd that's about it for linear regression. Next up, in [Basics of Supervised Learning - Logistic Regression](/blog/basics-of-supervised-learning-logistic-regression), we'll be expanding the ideas we learnt in this blog to train **logical** classifiers, where we'll see how a simple modification to our linear model - adding a non-linear activation function - transforms our regression problem into a powerful classification tool that will help us bridge the gap between linear models and the neural networks that eventually conquered the limitations of Rosenblatt's perceptron.", - "lastmodified": "2026-01-11T05:05:46.243835292+05:30", + "lastmodified": "2026-01-11T10:59:54.964002854+05:30", "tags": [ "domain-cs-ai-ml-machine-learning", "domain-math-probability-statistics" @@ -76,7 +76,7 @@ "/blog/booyer-moore-knuth-morris-pratt-for-exact-matching": { "title": "Booyer-Moore \u0026 Knuth-Morris-Pratt for Exact Matching", "content": "# Preface \u0026 References\nI document topics I've discovered and my exploration of these topics while following the course, [Algorithms for DNA Sequencing, by John Hopkins University](https://www.coursera.org/learn/dna-sequencing) on [Coursera](https://www.coursera.org/). The course is taken by two instructors [Ben Langmead](https://scholar.google.com/citations?user=2JMaTKsAAAAJ\u0026hl=en) and [Jacob Pritt](https://www.coursera.org/instructor/jacobpritt).\n\nWe will study the fundamental ideas, techniques, and data structures needed to analyze DNA sequencing data. In order to put these and related concepts into practice, we will combine what we learn with our programming expertise. Real genome sequences and real sequencing data will be used in our study. We will use Boyer-Moore to enhance naïve precise matching. We then learn indexing, preprocessing, grouping and ordering in indexing, K-mers, k-mer indices and to solve the approximate matching problem. Finally, we will discuss solving the alignment problem and explore interesting topics such as De Brujin Graphs, Eulerian walks and the Shortest common super-string problem. \n\nAlong the way, I document content I've read about while exploring related topics such as suffix string structures and relations to my research work on the STAR aligner.\n# Algorithms for Exact Matching\n\n## The Naive Algorithm\n\nThe naive algorithm is trivial and simply scans the main text $S$ for the pattern $T$ in a quadratic manner by just iterating over each character of the main text for the starting position and comparing it with the pattern text $T$ character by character. This is clearly extremely inefficient and has a worst case running time complexity of $O(nm)$. A simple example where such a run-time is possible is the strings.\n\n$$T = aaaaa$$ $$S = aaaaaaaaaaaaaaa \\dots$$\n\n![pasted-image-20240530083916](/images/pasted-image-20240530083916.webp)\n\n## Boyer-Moore Pattern Matching\n\nBoyer Moore algorithm starts matching from the last character of the pattern. It uses two main heuristics to solve the problem. \n\n- **Bad Character Rule:** If we do some character comparisons, and we find a mismatch, we will skip all alignment checks until one of two things happens. Either the mismatch becomes a match, or the pattern $T$ moves all the way past the mismatched text character.\n\n ![bad-character-rule](/images/bad-character-rule.webp)\n\n ***Explanation:*** *Our mismatching character is \\\"C\\\". We then search $T$ for the last occurrence of \\\"C\\\". Then we will shift $T$ by $3$ such that \\\"C\\\" is aligned between $S$ and $T$*\n\n- **Good Suffix Rule:** Let $t$ represent the longest common suffix matched by our pattern $T$ with the portion of $S$ we are checking for a match with. We can now skip all comparisons until either there are no mismatches between $S$ and $t$ or $S$ moves past $t$. This can be done relatively fast with some pre-processing.\n\n ![good-suffix-rule](/images/good-suffix-rule.webp)\n\t\n ***Explanation:*** *We have a sub-string $t$ of $T$ matched with pattern $S$ (in green) before a mismatch. We then find an occurrence of $t$ in $S$. After finding this, we jump checks to align $t$ in $S$ with $t$ in $T$.*\n\nThe algorithm simply tries both heuristics and picks the maximum skip distance returned by both.\n\n### Booyer-Moore Performance:\n\n- **Worst-case performance:** $\\Theta(m)$ pre-processing $+ O(mn)$ matching.\n\n- **Best-case performance:** $\\Theta(m)$ pre-processing $+ \\Omega(\\frac{n}{m})$ matching.\n\n- **Worst-case space complexity:** $\\Theta(k)$\n\nThe traditional Boyer-Moore technique has the drawback of not performing as well on short alphabets like DNA. Because sub-strings recur often, the skip distance tends to cease increasing as the pattern length increases. One can however acquire longer skips over the text at the cost of remembering more of what has already been matched. \n\n## Knuth-Morris-Pratt Pattern Matching (KMP)\n\nKMP is a string matching algorithm that reduces the worst case time complexity of the pattern finding problem in a given text to $O(n+m)$. The idea behind KMP is pretty simple. We discuss it in the following sections.\n\n### Prefix Function\n\nGiven a string $s$ such that $|s| = n$, we define the **prefix function** of $s$ as a function $\\pi$ where $\\pi(i)$ is the length of the longest proper prefix of the prefix sub-string $s[0:i]$. Here $s[0:i]$ refers to the sub-string starting at (zero-indexed) index $0$ and ending at index $i$, both inclusive. A prefix that is distinct from the string itself is a proper prefix. We define $\\pi(0) = 0$. We usually compute the prefix function as an array $\\pi$ where $\\pi[i]$ stores the value of $\\pi(i)$. \n\nMore formally, we define the prefix function as:\n$$\\pi[i] = \\max_{k = 0 \\rightarrow i} \\{k : s[0 \\rightarrow k-1] = s[i-(k-1) \\rightarrow i] \\}$$\n\nFor example, prefix function of string *\"abcabcd\"* is  $[0, 0, 0, 1, 2, 3, 0]$ , and prefix function of string *\"aabaaab\"* is  $[0, 1, 0, 1, 2, 2, 3]$ .\n\nThe naive way to compute this array is to simply iterate on each prefix starting position, the prefix length and then compare sub-strings. This gives us a worst case time complexity of $O(n^3)$ which is clearly pretty poor.\n\n**Optimizations**\n\n- Prefix function values can only increase by a maximum of one between consequent indices.\n\n *Proof by contradiction:* If $\\pi[i + 1] \\gt \\pi[i] + 1$, we may take the suffix ending in position $i + 1$ with the length $\\pi[i + 1]$ and delete the final character from it. We then get a suffix that ends in position $i$ and has the length $\\pi[i + 1] - 1$, which is preferable to $\\pi[i]$.\n\n The prefix function's value can therefore either increase by one, remain unchanged, or drop by a certain amount when going to the next slot. The function can only increase by a total of $n$ steps and can only decrease by a total of $n$ steps. This means that we only really need to perform $O(n)$ string comparisons. This reduces our time complexity to $O(n^2)$.\n\n- We use dynamic programming to store the information computed in previous steps. Let's say we have computed all values of $\\pi$ till $i$ and now want to compute $\\pi[i+i]$. Now we know that the suffix at position $i$ of length $\\pi[i]$ is the same as the prefix of length $\\pi[i]$. We get two cases:\n\n 1. If $s[i+1] = s[\\pi[i]]$ , this implies that $\\pi[i+1] = \\pi[i] + 1$.\n\n 2. If  $s[i+1] \\neq s[\\pi[i]]$ , we know we have to compare shorter strings. We want to move quickly to the longest length $j \\lt \\pi[i]$ , such that the prefix property at position $i$  holds ( $s[0 \\dots j-1] = s[i-j+1 \\dots i]$ ). This value ends up being  $\\pi[j-1]$ , which was already calculated.\n\nThe final algorithm looks something like this:\n\n![pasted-image-20240530091949](/images/pasted-image-20240530091949.webp)\n\n### Efficient Pattern Matching\n\nTo do this task fast, we simply apply the prefix function we discussed above. Given the pattern $t$ and main text $s$, we generate the new string $t + \\# + s$ and compute the prefix function for this string. \n\nBy definition  $\\pi[i]$ is the largest length of a sub-string that coincides to the prefix and ends in position $i$. Here this is just the largest block that coincides with  $s$  and ends at position  $i$ , this is a direct implication of our separation character $\\#$. Now, if for some index $i$, $\\pi[i] = n$  is true, it implies that  $t$  appears completely at this position, i.e. it ends at position  $i$ . \n\nIf at some position  $i$  we have $\\pi[i] = n$ , then at position $i - (n+1) - n + 1 = i - 2n$ in the string  $s$  the string  $t$  appears. Therefore we just need to compute the prefix function for our generated string in linear time using the above mentioned algorithm to solve the string matching problem in linear time.\n\nTime complexity: $O(|s|+|t|)$\n\n## Comparison of Both\n\nSource: [*What are the main differences between the Knuth-Morris-Pratt and Boyer-Moore search algorithms?* - **StackOverflow**](https://stackoverflow.com/questions/12656160/what-are-the-main-differences-between-the-knuth-morris-pratt-and-boyer-moore-sea)\n\n- **Boyer-Moore's** approach is to try to match the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. This allows for \\\"big jumps\\\" therefore BM works better when the pattern and the text you are searching resemble \\\"natural text\\\" (i.e. English) \n\t[*Boyer-Moore example run through*](https://www.cs.utexas.edu/~moore/best-ideas/string-searching/fstrpos-example.html)\n\n- **Knuth-Morris-Pratt** searches for occurrences of a \\\"word\\\" W within a main \\\"text string\\\" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.\n [*KMP example run through*](https://www.cs.utexas.edu/~moore/best-ideas/string-searching/kpm-example.html)\n\nThis means KMP is better suited for small sets like DNA (ACTG)\n\n## Offline Algorithms: Indexing and k-mers\n\nA few more ideas we can use when working with pattern matching problems is grouping and indexing. In essence, we can enumerate all sub-strings of some constant length (say 5) from the main text and store this in an vector for example. We call these sub-strings of some constant length $k$ a \\textbf{k-mer}. The stored index of the k-mer contains the indices at which it is found in the main string. \n\nNow as for how we quickly search for a k-mer in this pre-processes data, we have two options.\n\n- **Binary search.** We store the pre-processed data in sorted order. Can be done online during the pre-process stage or later be sorted after collection. Once done, finding a query k-mer simply involved binary searching on this sorted array. Note that strings can always be sorted by lexicographic order, this means that they will form a monotonous sequence for our query k-mer which means that our binary search is guaranteed to succeed in $O(log(n))$ comparisons. Each comparison can at worst case be $O(k)$. Hence we have a total time complexity of $O(klog(n))$. Here $n$ is the number of k-mer's we have pre-processed. \n\n- **Hashing.** Here, we hash each k-mer using some well known hashing function such as djb2, murmur, Fibonacci hashing, etc. to quickly query their existence into a hash-table. Various types of hash-tables can be used, Cuckoo, Robin-Hood, etc. This is pretty much just using the already existing plethora of literature on the subject of hashing to help speed up the algorithm. Worst case time complexity is $O(k*c)$ where $c$ is the constant involved with hashing. Note that this constant can have varying degrees of performance based on factors such as the number of k-mers stored, etc. \n\n- **Tries.** Another data structure that I believe would be useful here is a Trie. Tries are a type of tree data structure that are used for efficient data insertion, searching, and deletion. Tries are also known as prefix trees, because they store data in a way that allows for fast search operations based on prefixes of keys. A k-mer trie is a specific type of trie data structure that is used to store and search for k-mers, which are sub-strings of a fixed length $k$ in a given string. The k-mer trie data structure allows for efficient searching of k-mers within a string, and can be used in applications such as sequence alignment and gene prediction. Time complexity is $O(k)$ for each query operation.\n\n## Further k-mer optimizations\n\nNotice that one of the main bottlenecks here is the number of k-mers we have to store in our pre-processed data structure. Since the k-mers have a lot of overlap, one idea might be to reduce the number of k-mers we have to store in half by storing only those k-mers which start at odd indices. A consequence of this however, is that now our match success rate is only $50\\%$. However, we can get back to a $100\\%$ success rate by realizing that all we have to do is query indices which cover the entire field $\\mod \\ 2$ around the query index $q_i$. \n\n![kmer-index-variants](/images/kmer-index-variants.webp)\n\nFor example, if we store only $\\frac{1}{3}^{rd}$ the number of k-mers, one way to do it would be to store every kmer which starts at a position $0 \\ \\mod \\ 3$. Now we just query indices around $q_i$ which give us the entire field $Z_3$ when taking indices $\\mod \\ 3$. Now we just check for existence of prefixes and suffixes as required for the required types of kmers and verify existence of the actual query string. \n\nHere we are paying extra penalty during the query phase for reducing the pre-process time. Further, pre-processing larger number of kmers usually requires more memory and this consequently leads to major penalty during the query phase due to terrible caching of the pre-processed data. So often it is worth the extra penalty paid during query operations to reduce the size of the pre-processed data simply to benefit from cache optimizations.\n", - "lastmodified": "2026-01-11T05:05:46.575560543+05:30", + "lastmodified": "2026-01-11T10:59:01.109896042+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-science-bioinformatics" @@ -85,7 +85,7 @@ "/blog/brent-s-theorem-task-level-parallelism": { "title": "Brent's Theorem \u0026 Task Level Parallelism", "content": "Suggestion for digging deeper into HPC ideas: [https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/pages/syllabus/](https://ocw.mit.edu/courses/6-172-performance-engineering-of-software-systems-fall-2018/pages/syllabus/)\n# Modeling task-level parallelism as a DAG (Directed Acyclic Graph)\n\nOne major issue we had with the OMP parallel execution model was the use of an implicit barrier at the end of every parallel block. Essentially, there is a period of time where the CPU spends time waiting for all the parallelly executing threads to `join`. This gives us moments of time where we are sitting idling and wasting precious compute time. One idea to solve this problem is to allow threads to keep executing without waiting for a `join`. Note that a thread $A$ only needs to wait on a thread $B$ when $B$ is still computing some data that $A$ needs to continue execution.\n\nHence $A$ can keep executing until it hits such a dependence barrier. Following this idea, we can always construct a dependency DAG of all the parallel workloads in a problem.\n\n![tlp-1](/images/tlp-1.webp)\n\nIt allows us to come up with the following theoretical formulation of parallelization and speedup.\n\nLet’s say each node in the graph represented one unit of work. In the above graph, I have $18$ units of work. If I executed the entire program serially, $T_1 = W = 18$. Assuming 1 unit work per 1 unit time, $T_1$ which is the serial execution time is $18$. To compute speedup, I want to be able to compute $\\frac{T_1}{T_p}$, where $T_p$ is the execution time on $p$ processors.\n\nHere, we establish 2 lower bounds on the value of $T_p$. First, assuming uniform compute power across all processors, $T_p \\geq \\frac{T_1}{p}$. If each core took up an equal portion of the work it would need at least this much time to compute.\n\nNext, we introduce the idea of the **critical path** in the graph. The critical path is the length of the longest dependency chain in the graph. Note that no matter what, dependencies must be processed in a topological ordering and we cannot skip serially executing this portion of our code. This allows us to establish another lower bound $T_p \\geq T_\\infty$.\n\nHere $T_\\infty$ is the **span** or **critical path length** of the graph. Essentially the minimum time required if we had access to infinite parallel processing which eliminated the cost of processing everything except the critical path.\n\nWe call these the **Work** law and the **Span** law.\n\n$$ T_p \\geq \\frac{T_1}{p} \\\\ T_p \\geq T_\\infty \\\\ T_p \\geq max(\\frac{T_1}{p}, T_\\infty) $$\n\nNow, we compute speedup as just $S = \\frac{T_1}{T_p}$. The maximum speedup is the case where we have infinite parallel computing power and here we are only limited by the span law. Hence maximum possible speedup is\n\n$$ S_{max} = \\frac{T_1}{T_\\infty} $$\n\nIntroducing some notation, we will now refer to the work required to be done as $W(n)$ and the span of the dependency DAG of this work as $D(n)$. The time for completion is $T(n)$. We now introduce the concept of average parallelism. Intuitively we can think of this as the amount of parallel work we can get done per critical path vertex.\n\n$$ Avg. \\ Parallelism = \\frac{W(n)}{D(n)} $$\n\nIt is possible to prove that there always exists an alignment of nodes in a DAG such that we can essentially partition the DAG into levels separated by the nodes on the critical path in the DAG.\n\n![tlp-2](/images/tlp-2.webp)\n\nThis is what we’d like to define as “average parallelism.”\n\nIf we define $W_i$ as the work done in ‘each phase’ then it follows that $W(n) = \\sum_i^{D(n)} W_i$ and $T(n) = \\sum_i^{D(n)} \\lceil \\frac{W_i}{p} \\rceil$.\n\n# Brent’s Theorem\n\nThese two formulations essentially give us Brent’s theorem\n\n$$ W(n) = \\sum_i^{D(n)} W_i \\\\ T(n) = \\sum_i^{D(n)} \\lceil \\frac{W_i}{p} \\rceil \\\\ \\implies T(n) \\leq \\sum_i^{D(n)} (\\frac{W_i - 1}{p} + 1) = \\frac{W(n) - D(n)}{p} + D(n) \\\\ \\implies T(n) \\leq \\frac{W(n)}{p} + \\frac{D(n)(p-1)}{p} \\\\ \\implies T(n) \\leq \\frac{W(n)}{p} + \\approx D(n) $$\n\nFor the last step, we approximate the fraction to be $1$. This is the **upper bound** given to us on $T(n)$ by Brent’s law. Combining both laws, we get\n\n$$ max(\\frac{T_1}{p}, T_\\infty) = max(\\frac{W(n)}{p}, D(n) \\leq T(n) \\leq \\frac{W(n)}{p} + D(n) $$\n\nThis tells us that $T(n)$ must be within a factor of two from $\\frac{W(n)}{p}$.\n\nWe can also write Brent’s theorem as\n\n$$ T(n) \\leq \\frac{T_1(n)}{p} + T_\\infty $$\n\nIf our sped-up time $T(n)$ is greater than this upper bound then it implies that we are not doing a great job at parallelizing our workload. Whereas, if it is within the given bounds then we can say that the work is provably optimal.\n\n## Speedup and Work Optimality\n\nWhen measuring speed up, it becomes very important that we take into consideration the **best** time of execution by a serial algorithm and compare it against parallel performance.\n\nEssentially, we will write speedup as $S = \\frac{T_*(n)}{T_p(n)} = \\frac{W_*(n)}{T_p(n)}$ where the subscript `*` means best score by a serial algorithm for the same task.\n\nThis lets us get the following derivation,\n\n$$\n\\begin{aligned}\nS_p(n) = \\frac{W_*(n)}{T_p(n)} \\\\ \\implies S(n) \\geq \\frac{W_*(n)}{\\frac{W_p(n)}{p} + D_p(n)} = \\frac{W_* \\times p}{W_p + D_p \\cdot p} \\\\ \\implies S(n) \\geq \\frac{p}{\\frac{W_p}{W_*} + \\frac{D_p \\cdot p}{W_*}}\n\\end{aligned}\n$$\n\nFor ideal speedup, we want the denominator to be as close to 1 as possible. This means $W_p \\approx W_*$ is very good for us. Further, $\\frac{W_*}{p}$ must steadily grow with $D_p$.\n\nThe first implication is fairly intuitive, we shouldn’t have to do extra work as we parallelize more. That is, parallel work $W_p$ shouldn’t scale with $W_*$ otherwise we’ll have to keep doing more work as we increase parallelization.\n\nThe second implication intuitively means that the ‘work per processor’ should grow proportional to span. That is, we shouldn’t be in a situation where $D_p$ (the span) increases but work per processor doesn’t, otherwise a lot of parallel compute power is wasted over waiting for the span to finish execution.", - "lastmodified": "2026-01-11T05:05:46.866563436+05:30", + "lastmodified": "2026-01-11T10:59:01.402238086+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -93,7 +93,7 @@ "/blog/building-a-type-safe-tool-framework-for-llms-in-scala": { "title": "Building a Type-Safe Tool Framework for LLMs in Scala", "content": "# Tool Calling\nI came across a fun problem at work today where I wanted to define a clean, idiomatic way to define functions in Scala and auto-generate the function schema for these functions to pass to LLM APIs. For some context, LLMs are incredibly powerful at reasoning through and generating text, but to really have them interact with the environment, they use external tool calling. OpenAI calls this [Function Calling](https://platform.openai.com/docs/guides/function-calling?api-mode=chat) and Anthropic calls it [Tool Use](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview#next-steps). This image from OpenAI best summarizes the idea:\n\n![pasted-image-20250616173651](/images/pasted-image-20250616173651.webp)\n\nYou as a developer pass in function schemas to let the LLM know that it has access to call `xyz` function. The schema may look like this:\n\n```json\n{\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_weather\",\n \"description\": \"Retrieves current weather for the given location.\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"City and country e.g. Bogotá, Colombia\"\n },\n \"units\": {\n \"type\": \"string\",\n \"enum\": [\n \"celsius\",\n \"fahrenheit\"\n ],\n \"description\": \"Units the temperature will be returned in.\"\n }\n },\n \"required\": [\n \"location\"\n ],\n \"additionalProperties\": false\n },\n \"strict\": true\n }\n}\n```\n\nHere, you're telling the LLM that it has access to a `get_weather` function which retrieves the current weather for a given location. Further, you let it know that this function takes in two arguments:\n\n1. `location: string` =\u003e The argument `location`, of type `string`. This is the city and country the LLM is trying to get the current weather for.\n2. `units: \"celsius\" | \"fahrenheit\"` =\u003e The argument `units`, which is an enum that can either be `\"celsius\"` or `\"fahrenheit\"`.\n\nYou also define that `location` is a required argument, `units` is not. It's more or less equivalent to a precise documentation of an API that a human would read to understand how to call said API. In the past, you would prompt the LLM with non-standard structured schema and ask the LLM to output JSON making the tool call. You would then parse out this JSON block manually and make the tool call. We've not evolved much from there, but modern LLM APIs have \"solved\" this elegantly by handling this prompt-engineering + fine-tuning + parsing logic on their end. It's a little more than glorified prompt engineering, there are [techniques](https://www.databricks.com/blog/introducing-structured-outputs-batch-and-agent-workflows) which can guarantee that a model's output will be a JSON object that adheres to a specified JSON schema as long as enough tokens to complete the JSON object are generated. This, along with some fine-tuning masked on the API provider side gives you a clean API that you can use to allow the model to perform tool calls. \n\n\u003e Under the hood, constrained decoding powers structured outputs. Constrained decoding is a technique in which we limit the set of tokens that can be returned by a model at each step of token generation based on an expected structural format. For example, let’s consider the beginning of a  JSON object which always begins with a left curly bracket. Since only one initial character is possible, we constrain generation to only consider tokens that start with a left curly bracket when applying token sampling. Although this is a simple example, this example can be applied to other structural components of a JSON object such as required keys that the model knows to expect or the type of a specific key-value pair. At each position in the output, a set of tokens adherent to the schema are identified, and sampled accordingly. More technically, raw logits output by the LLM that do not correspond to the schema are masked at each time stamp before they are sampled.\n\u003e - [Introducing Structured Outputs for Batch and Agent Workflows](https://www.databricks.com/blog/introducing-structured-outputs-batch-and-agent-workflows)\n\nOkay, the LLM understands the schema and we can force it to output JSON objects constrained to a specific JSON schema, what's next? In the previous example, the LLM would see this schema and know it can call `get_current_weather` with a location and unit. When it decides to use the tool, it generates something like:\n\n```json\n[\n {\n \"id\": \"call_12345xyz\",\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_weather\",\n \"arguments\": \"{\\\"location\\\":\\\"Paris, France\\\"}\"\n }\n },\n ...\n]\n```\n\nThe API caller would then go through this list of tool calls the LLM has requested and execute them one-by-one (or in parallel). The LLM only speaks text, so the outputs of all these function / tool calls will be read by the LLM as strings. The outputs can themselves be structured or semi-structured, it'll be processed as is by the LLM. The API caller would then add a message to the prompt similar to:\n\n```json\n{\n \"role\": \"user\",\n \"content\": [\n {\n \"type\": \"tool_result\",\n \"tool_use_id\": \"call_12345xyz\",\n \"content\": \"65 degrees\"\n }\n ]\n}\n```\n\nThe LLM can then use this information and reply back with something like \"The current weather in Paris is 65 degrees Fahrenheit. It's a cool day in the city of love!\"\n\nThis API is pretty neat and covers up a lot of the complexity involved in making the LLM outputs constrained to a fixed schema. The next challenge, then, is to determine how best to expose this capability in a clean, idiomatic, and developer-friendly way within a given programming language—in my case, Scala—ideally by treating tool schemas and invocations as first-class constructs expressible through the language’s type system, implicits, and meta-programming capabilities. We want the usage to ensure type-safety but also allow semantically rich integration with the LLM API.\n# The Challenge: Making This Developer-Friendly\nManually writing these JSON schemas is obviously a bad idea. It's tedious, there's no type-safety and we're likely to have errors. There's two sources of truth which have to constantly be kept in sync. This won't do. We need a clean way to be able to generate the function schema from a function definition. Write the tool once, modify it how many ever times you want to and the framework should automatically be capable of generating everything else. \n\n## Python\nThis is actually achieved pretty easily in Python, thanks to libraries like `Pydantic` and `typing`:\n\n```python\nfrom typing import Annotated, Literal, Optional, get_type_hints\nimport json\n\ndef param(description: str, required: bool = True):\n return {\"description\": description, \"required\": required}\n\ndef get_weather(\n location: Annotated[str, param(\"City and country e.g. Bogotá, Colombia\")],\n units: Annotated[Optional[Literal[\"celsius\", \"fahrenheit\"]], \n param(\"Units the temperature will be returned in.\", required=False)] = \"celsius\"\n) -\u003e str:\n \"\"\"Retrieves current weather for the given location.\"\"\"\n return f\"65 degrees {units} in {location}\"\n\ndef generate_function_schema(func):\n hints = get_type_hints(func, include_extras=True)\n properties = {}\n required = []\n \n for name, hint in hints.items():\n if hasattr(hint, '__metadata__'):\n meta = hint.__metadata__[0]\n properties[name] = {\n \"type\": \"string\",\n \"description\": meta[\"description\"]\n }\n if meta.get(\"required\", True):\n required.append(name)\n \n return {\n \"type\": \"function\",\n \"function\": {\n \"name\": func.__name__,\n \"description\": func.__doc__,\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": properties,\n \"required\": required\n }\n }\n }\n\nschema = generate_function_schema(get_weather)\nprint(json.dumps(schema, indent=2))\n```\n\nThe output looks something like:\n\n```json\n{\n \"type\": \"function\",\n \"function\": {\n \"name\": \"get_weather\",\n \"description\": \"Retrieves current weather for the given location.\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"City and country e.g. Bogot\\u00e1, Colombia\"\n },\n \"units\": {\n \"type\": \"string\",\n \"description\": \"Units the temperature will be returned in.\"\n }\n },\n \"required\": [\n \"location\"\n ]\n }\n }\n}\n```\n\nPython has support for powerful meta-programming features which makes this possible. The `typing` module let's you `inspect` the type signatures of live objects during runtime and the meta-programming friendly double underscore attributes makes fetching the function name, it's doc string and annotations super easy. Adding to the above, if we use `Pydantic` and define the attributes in a class, we can call it's `model_json_schema()` library function to handle a lot of the schema construction tedious logic. One of the many reasons why Python is super popular in the LLM / Agent framework community. \n## CPP\nI'd love to try this in C++ next, I'm sure you could craft some beautifully convoluted template and pre-processor magic to make it happen, but I'm skipping it in the interest of my sanity and time. JSON in C++ _shudders_. (Mental note to revisit this later, because who doesn’t enjoy a little masochism?)\n\n## Scala 2\nLet's start by defining some kind of target end-state we want to expose to developers using our framework. This would be perfect, \n\n```scala\n@Tool(\n name = \"get_weather\",\n description = \"Retrieves current weather for the given location.\"\n)\ndef getWeather(\n @Param(\n description = \"City and country e.g. Bogotá, Colombia\",\n required = true\n )\n location: String,\n @Param(\n description = \"Units the temperature will be returned in.\",\n enum = Array(\"celsius\", \"fahrenheit\"),\n required = false\n )\n units: String = \"celsius\"\n): String = ???\n```\n\nSpoiler, I'll fail to get this to work. We'll actually end up with something more like:\n\n```scala\nobject TemperatureUnit extends Enumeration {\n type TemperatureUnit = Value\n val CELSIUS, FAHRENHEIT = Value\n}\n\ncase class WeatherArgs(\n @Parameter(description = \"The city and state, e.g., Bangalore, IN\")\n location: String,\n @Parameter(description = \"The unit for the temperature\")\n unit: Option[TemperatureUnit.Value] = None\n)\n\n@Tool(name = \"get_current_weather\", description = \"Get the current weather in a given location\")\nclass WeatherTool extends ToolExecutor[WeatherArgs] {\n def execute(args: WeatherArgs): String = {\n s\"The weather in ${args.location} is 72 degrees ${args.unit.getOrElse(TemperatureUnit.FAHRENHEIT)}.\"\n }\n}\n```\n\nSlightly less \"nice\", but it's actually almost exactly the same interface `Pydantic` uses to automate function schema generation!\n\n---\nBut anyways, let's try to implement the initial interface we described. Annotations seem easy enough to tack-on later, let's tackle the main problem, how do we fetch the type signatures of the function arguments and generate the schema for it? You can either do this at compile time, or during runtime. Scala 2 makes this choice simple for me, it does not have first class support for compile time meta-programming. ChatGPT tells me Scala 3 has support, but I'm locked to Scala 2, so oh well. How do I get the type information of a variable at runtime then?\n### Problem 1: Type Erasure\nHere we'll hit our first (and probably biggest) obstacle. At runtime, the JVM does not hold onto any type information for generics. \n```scala\nclass Container[T](value: T)\nval stringContainer = new Container[String](\"hello\")\n```\n\nAt runtime, the JVM doesn't know that `stringContainer` holds a `String`. The type parameter `T` is erased during compilation (WHY?!). This leads to you observing fun behavior, like this:\n\n```scala\nval strings: List[String] = List(\"a\", \"b\", \"c\")\nval ints: List[Int] = List(1, 2, 3)\n\nprintln(strings.getClass) // class scala.collection.immutable.$colon$colon\nprintln(ints.getClass) // class scala.collection.immutable.$colon$colon\nprintln(strings.getClass == ints.getClass) // true\n```\n\n~~This is why statically typed languages are far superior...~~ However, even in comparison to dynamically typed languages like Python, type information is usually maintained as a piece of runtime data instead of just being erased. \n\n**Note:** The JVM only erases **generic type parameters**, not the actual types of objects.\n#### WHY?!\nSo the natural question: why would a dynamically typed language deliberately erase useful type information at runtime? As expected, there's a brilliant reason for it...\n\nJava trying to monkey patch a new feature (Generics) onto an existing mess of a language. Co-incidentally, *great* [video](https://www.youtube.com/watch?v=uT50sl5DbJk), should watch.\n\nThe short version is that, the designers needed old code (pre-generics) to run on the new JVMs without recompilation and for the new-generic code to be callable from the old non-generic code. This way, the bytecode format can be unified and will work everywhere. Here's an example:\n\n```java\n// Pre-Generics Java\nList oldList = new ArrayList();\noldList.add(\"string\");\noldList.add(42); // This works\n\n// Post-Generics Java\nList\u003cString\u003e newList = new ArrayList\u003cString\u003e();\n```\n\nThese two snippets should both compile to identical bytecode for the format to be unified and not require recompilation. The only way to do this is to erase the generic type information. \n#### How does this even work?\nA follow up question you may have is how does the JVM still work if types are erased for generics? How does it do method dispatch for example? Let's go back to the `stringContainer` example:\n\n```java\nclass Container[T](value: T)\nval stringContainer = new Container[String](\"hello\")\n\n// --- After type eraseure ---\n\nclass Container(value: Object) { // T becomes Object\n def getValue: Object = value\n def func(): Unit = ???\n}\n\n// The compiler can just cast it here implicitly since when fetching an object, it's implicit type should be known at compile time \n// thanks to generics allowing for stronger static typing guarantees... ironically.\nval stringContainer = new Container(\"hello\")\nval str: String = stringContainer.getValue.asInstanceOf[String] \n```\n\nSo in essence, because generics provide a lot of static type-safety guarantees at compile time, the compiler can (at compile time) insert these casts to correctly cast all generic field accesses. But this also means that at runtime, we as the developer can't easily ask \"what type was this generic parameter?\"\n#### Solution\nSo, we can't \"easily\" fetch the type of a generic at runtime, but how do we do it? Short answer, `Reflection`. Long answer... \n##### Reflection\n\n\u003e _Reflection_ is the ability of a program to inspect, and possibly even modify itself. It has a long history across object-oriented, functional, and logic programming paradigms. While some languages are built around reflection as a guiding principle, many languages progressively evolve their reflection abilities over time.\n\u003e \n\u003e Reflection involves the ability to **reify** (i.e. make explicit) otherwise-implicit elements of a program. These elements can be either static program elements like classes, methods, or expressions, or dynamic elements like the current continuation or execution events such as method invocations and field accesses.\n\u003e - [Heather Miller, Eugene Burmako, Philipp Haller - Scala Docs](https://docs.scala-lang.org/overviews/reflection/overview.html)\n\nReflection can be divided into two broad groups depending on what phase of the code development loop it runs in. \n\n- **Compile-time Reflection:** can be considered a superset of templates in C++. All the introspection, inspection and instantiation of code is done at compile time. These patterns are very powerful for developing generic containers and algorithms to perform transformation \u0026 reductions on these containers. \n- **Runtime Reflection:** gives access to the program to inspect it's own type metadata and even modify it during program execution. This is powerful, but also dangerous. It lets us bypass all compiler placed safety nets like allowing users to modify `const` variables, access `private` members of classes and cause more chances to throw errors during execution. This is sometimes useful to allow easier mocking of `private` methods of classes in unit-tests, easier dependency injection \u0026 other serialization / de-serialization problems where you may want to map Java/Scala types to a standard schema like JSON, Proto, Database rows, etc. \n\nIn short, a language uses \"Reflection\" to provide you hooks to inspect, instantiate, modify or invoke members of that object \n##### `TypeTag`\nScala's types being erased at runtime essentially means that there is information that is available at compile time that is erased / lost during runtime. The classic Java solution to this problem seems to be tacking on another monkey patch to allow developers to \"persist\" this information using `TypeTags`. `TypeTags` are generated by the compiler. Here's some example usage.\n\n```scala\nimport scala.reflect.runtime.universe._\n\n// The [T] is defining a generic method. It's a type parameter like in C++ templates and Scala will substitute it in. \n// We're currying the implicit argument here. It's the same as \n// def inspectTypeWithTag = (value: T) =\u003e (tag: TypeTag[T]) =\u003e Unit.\n// Scala just fills in the implicit argument automatically. \n// \n// So when this function is called with say `inspectTypeWithTag(\"a\")`, Scala does 2 things:\n// 1. Infer T = List[String].\n// 2. Search for an implicit TypeTag[List[String]] in scope and pass it in.\ndef inspectTypeWithTag[T](value: T)(implicit tag: TypeTag[T]): Unit = {\n println(\"=== With TypeTag ===\")\n println(s\"Runtime class: ${value.getClass}\")\n println(s\"Static type: ${tag.tpe}\")\n println(s\"Type constructor: ${tag.tpe.typeConstructor}\")\n println(s\"Type arguments: ${tag.tpe.typeArgs}\")\n}\n\ndef inspectTypeNoTag[T](value: T): Unit = {\n println(\"=== Without TypeTag ===\")\n println(s\"Runtime class: ${value.getClass}\")\n}\n\nval list = List(\"a\", \"b\")\ninspectTypeNoTag(list)\ninspectTypeWithTag(list)\n/* === Without TypeTag ===\nRuntime class: class scala.collection.immutable.$colon$colon\n=== With TypeTag ===\nRuntime class: class scala.collection.immutable.$colon$colon\nStatic type: List[String]\nType constructor: List\nType arguments: List(String) */\n```\n\nThe `TypeTag` is like a persisted metadata blob containing all the type signature info of the object at compile time, before erasure ate it.\n##### Reflection (cont...)\n**What's an \"Universe\"?**\n\n\u003e The reflection environment differs based on whether the reflective task is to be done at run time or at compile time. The distinction between an environment to be used at run time or compile time is encapsulated in a so-called _universe_. Another important aspect of the reflective environment is the set of entities that we have reflective access to. This set of entities is determined by a so-called _mirror_.\n\u003e \n\u003e For example, the entities accessible through runtime reflection are made available by a `ClassloaderMirror`. This mirror provides only access to entities (packages, types, and members) loaded by a specific classloader.\n\u003e \n\u003e Mirrors not only determine the set of entities that can be accessed reflectively. They also provide reflective operations to be performed on those entities. For example, in runtime reflection an _invoker mirror_ can be used to invoke a method or constructor of a class.\n\u003e \n\u003e There are two principal types of universes– since there exists both runtime and compile-time reflection capabilities, one must use the universe that corresponds to whatever the task is at hand. Either:\n\u003e \n\u003e - `scala.reflect.runtime.universe` for **runtime reflection**, or\n\u003e - `scala.reflect.macros.Universe` for **compile-time reflection**.\n\u003e \n\u003e A universe provides an interface to all the principal concepts used in reflection, such as `Types`, `Trees`, and `Annotations`.\n\u003e \n\u003e - [Environments, Universes, Mirrors - ScalaDoc](https://docs.scala-lang.org/overviews/reflection/environment-universes-mirrors.html)\n\n**What's a \"Mirror\"?**\n\n\u003e All information provided by reflection is made accessible through mirrors. Depending on the type of information to be obtained, or the reflective action to be taken, different flavors of mirrors must be used. Classloader mirrors can be used to obtain representations of types and members. From a classloader mirror, it’s possible to obtain more specialized invoker mirrors (the most commonly-used mirrors), which implement reflective invocations, such as method or constructor calls and field accesses.\n\u003e \n\u003e - [Environments, Universes, Mirrors - ScalaDoc](https://docs.scala-lang.org/overviews/reflection/environment-universes-mirrors.html)\n\nIn short, the JVM exposes APIs to get access to the class information, accessible fields / methods, etc. Mirrors are an abstraction on top of these JVM APIs in Scala that knows how to call these underlying JVM APIs to perform reflection operations on Scala classes. Universes just group these mirrors into runtime / compile time reflection operations. Let's take this simple `Person` class for illustration:\n\n```scala\ncase class Person(name: String, age: Int) \nval alice = Person(\"Alice\", 30)\n```\n\nNow let's access the `name` field using both approaches:\n\n**1. Raw Java Reflection:**\n\n```scala\nval personClass = alice.getClass\nval nameField = personClass.getDeclaredField(\"name\")\nnameField.setAccessible(true) \nval value = nameField.get(alice)\nprintln(s\"Java reflection result: $value\") // \"Alice\"\n```\n\n**2. Scala Reflection:**\n\n```scala\nimport scala.reflect.runtime.universe._\nimport scala.reflect.runtime.currentMirror\n\n// val personType: reflect.runtime.universe.Type = Person\nval personType = typeOf[Person] \n\n// val nameSymbol: reflect.runtime.universe.TermSymbol = value name\n// It's basically an internal representation of the identifier \"name\". In Scala reflection, names of members (like fields or methods) can be either:\n// - TermName for values, variables, methods, and objects\n// - TypeName for type members, type aliases, and classes/traits\nval nameSymbol = personType.decl(TermName(\"name\")).asTerm\n\n// Create mirrors\nval instanceMirror = currentMirror.reflect(alice)\nval fieldMirror = instanceMirror.reflectField(nameSymbol)\n\n// Access the field\nval value = fieldMirror.get\nprintln(s\"Scala reflection result: $value\") // \"Alice\"\n```\n\nWhen you call `fieldMirror.get`, Scala internally calls the Java reflection APIs exposed by the JVM. The mirrors are just a more elegant, type-safe wrapper around `java.lang.reflect.*` APIs. \n### Problem 2: Limited Runtime Reflection\nWhile Scala provides runtime reflection through its reflection API, it's not as straightforward as Python's `__annotations__` or `sig`. We still need to implicitly layer in a `TypeTag` so that we can capture type information under-the-hood without exposing the developer to weird APIs and also come up with a substitute for things like `__doc__` in Python. We can use `Annotations` in Scala for this. In Python, decorators are just functions that transform other functions. In Scala, annotations are more limited, they're metadata that needs to be extracted through reflection. \n\n## A Type-Safe LLM Tool Definition Framework In Scala\nNow that we know what the problems are and what tools we have available to solve these problems, we can actually build a fairly elegant solution. Let's break it down:\n\n\u003cdiv style=\"width: 200px; max-width: 100%;\"\u003e\n \u003cdiv class=\"tenor-gif-embed\" data-postid=\"26375280\" data-share-method=\"host\" data-aspect-ratio=\"1\" data-width=\"100%\"\u003e\n \u003ca href=\"https://tenor.com/view/breakdance-chineese-gif-26375280\"\u003eBreakdance Chineese GIF\u003c/a\u003e \n from \u003ca href=\"https://tenor.com/search/breakdance-gifs\"\u003eBreakdance GIFs\u003c/a\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cscript type=\"text/javascript\" async src=\"https://tenor.com/embed.js\"\u003e\u003c/script\u003e\n\n### Why will Idea-1 not work?\nLet's go back to what we wanted to implement:\n```scala\n@Tool(\n name = \"get_weather\",\n description = \"Retrieves current weather for the given location.\"\n)\ndef getWeather(\n @Param(\n description = \"City and country e.g. Bogotá, Colombia\",\n required = true\n )\n location: String,\n @Param(\n description = \"Units the temperature will be returned in.\",\n enum = Array(\"celsius\", \"fahrenheit\"),\n required = false\n )\n units: String = \"celsius\"\n): String = ???\n```\n\nBased on what we know now, if we tried to write a function that takes `function` as a generic and tries to generate the schema for it:\n\n```scala\nimport scala.reflect.runtime.universe._\n\n@Tool(\n name = \"get_weather\",\n description = \"Retrieves current weather for the given location.\"\n)\ndef getWeather(\n @Param(description = \"City and country e.g. Bogotá, Colombia\")\n location: String,\n @Param(description = \"Units the temperature will be returned in.\")\n units: Option[String]\n): String = ???\n\ndef generateSchema[T](func: T)(implicit tag: TypeTag[T]): Unit = {\n val tpe = tag.tpe\n println(s\"Function type: $tpe\") \n val args = tpe.typeArgs\n args.foreach(t =\u003e println(s\"- $t\"))\n}\n\n/*\nOutput:\nFunction type: (String, Option[String]) =\u003e String\n- String\n- Option[String]\n- String\n*/\n```\n\nOne immediate problem you may notice here is that the parameter **names** are dropped. So we don't actually have access to the name \"location\" or \"units\" which makes serialization / de-serialization here a challenge. This is mainly due to compile-time meta-programming support in Scala 2. Case classes on the other hand, preserve this information, and they're actually sufficient to get the complete setup working. Here's how you do it for a case class.\n\n```scala\ndef extractCaseClassInfo[T: TypeTag]: Unit = {\n  val tpe = typeOf[T]  \n  // Get primary constructor parameters\n  val constructorSymbol = tpe.decl(termNames.CONSTRUCTOR).asMethod\n  val constructorParams = constructorSymbol.paramLists.flatten\n  println(s\"Case class: ${tpe.typeSymbol.name}\")\n  println(\"Fields:\")\n  constructorParams.foreach { param =\u003e\n    val name = param.name.toString\n    val paramType = param.typeSignature\n    println(s\"  - $name:\")\n    println(s\"      Type     : $paramType\")\n  }\n}\n\nextractCaseClassInfo[WeatherArgs]\n/*\nCase class: WeatherArgs\nFields:\n - location:\n Type : String\n - units:\n Type : Option[String]\n*/\n```\n### Idea-2\n```scala\nobject TemperatureUnit extends Enumeration {\n type TemperatureUnit = Value\n val CELSIUS, FAHRENHEIT = Value\n}\n\ncase class WeatherArgs(\n @Parameter(description = \"The city and state, e.g., Bangalore, IN\")\n location: String,\n @Parameter(description = \"The unit for the temperature\")\n unit: Option[TemperatureUnit.Value] = None\n)\n\n@Tool(name = \"get_current_weather\", description = \"Get the current weather in a given location\")\nclass WeatherTool extends ToolExecutor[WeatherArgs] {\n def execute(args: WeatherArgs): String = {\n s\"The weather in ${args.location} is 72 degrees ${args.unit.getOrElse(TemperatureUnit.FAHRENHEIT)}.\"\n }\n}\n```\n\nOkay, this is pretty similar to the `Pydantic` example and also a pretty clean interface to provide to developers. Can we implement this successfully?\n#### Defining the Abstraction\nWe need a few base components to get the above abstraction to work:\n##### Annotations\n**Annotations** in Scala will allow us to mark tools and their parameters. We can create an annotation like so:\n\n```scala\nclass Tool(val name: String, val description: String) extends StaticAnnotation\nclass Parameter(val description: String) extends StaticAnnotation\n```\n\nNow, we can take our case class, annotate it and then read these descriptions during runtime via reflection like so:\n\n```scala\ndef readAnnotation[T: TypeTag]: Unit = {\n val tpe = typeOf[T]\n \n // Helper to extract string literal from an annotation argument\n // This will go through the `Tree` exposed by reflection, find the correct annotation\n // variable via a constant `index` and return the value (or `default` if none exists).\n def extractStringArg(args: List[Tree], index: Int, default: String): String =\n args.lift(index).collect { \n case Literal(Constant(s: String)) =\u003e s \n }.getOrElse(default)\n \n // Read @Tool annotation from class\n tpe.typeSymbol.annotations\n .find(_.tree.tpe =:= typeOf[Tool])\n .foreach { ann =\u003e\n val args = ann.tree.children.tail\n val name = extractStringArg(args, 0, \"Unknown\")\n val desc = extractStringArg(args, 1, \"No description\")\n println(s\"Tool: name = $name, description = $desc\")\n }\n\n // Read @Parameter annotations from constructor params\n val constructorParams = tpe.decls\n .collectFirst { case m: MethodSymbol if m.isPrimaryConstructor =\u003e m }\n .map(_.paramLists.flatten)\n .getOrElse(Nil)\n \n constructorParams.foreach { param =\u003e\n param.annotations\n .find(_.tree.tpe =:= typeOf[Parameter])\n .foreach { ann =\u003e\n val desc = extractStringArg(ann.tree.children.tail, 0, \"No description\")\n println(s\"Parameter: ${param.name} -\u003e $desc\")\n }\n }\n}\n\nreadAnnotation[WeatherTool]\n// Tool: name = get_current_weather, description = Get the current weather in a given location\n\nreadAnnotation[WeatherArgs]\n// Parameter: location -\u003e The city and state, e.g., Bangalore, IN\n// Parameter: unit -\u003e The unit for the temperature\n```\n\n##### A `ToolExecutor` Base Trait\nThe next thing we need is a fixed format that all tool calls should follow. This is a very solved problem, we can just define an `abstract class` and ensure that all our tools extend this class. We can have an `execute` method that we require all our tools to define. This would be the function that's called when the LLM makes the tool call and wants to execute our tool. \n\nHowever, a class like this would need to grab the `type` information of the parameters passed to the tool via a generic. And this is information that's deleted at runtime. So, we need to also have a `TypeTag` that's implicitly loaded in our class that automatically persists this information from compile time to runtime. This is the field we'll be performing our reflection operations on to generate the function schema during runtime. Apart from the `TypeTag`, we'll also store the `Class[_]` object which is kind-of the Java class spec for our type `T` grabbed via the reflection API. This will come in handy later when we want to instantiate a class later during de-serialization. \n\n```scala\nabstract class ToolExecutor[T: TypeTag] {\n def execute(args: T): String\n \n // Capture the TypeTag for later use\n private[aiutils] val typeTag: TypeTag[T] = implicitly[TypeTag[T]]\n private[aiutils] lazy val argClass: Class[_] = {\n val mirror = typeTag.mirror\n mirror.runtimeClass(typeTag.tpe.erasure)\n }\n}\n```\n#### Generating the Function Schema\nNow, we can finally put all of this together and use Scala's reflection API to extract all this information and auto-magically generate the JSON schemas for the LLM tool calls at runtime. Let's walk through how we can implement this functionality. \n\n```scala\ndef extractProperties(tpe: Type): (Map[String, Any], List[String]) = {\n // These are the two main things the LLM tool call API needs. \n // Properties is a `Map` of argument -\u003e description that contains argument description \u0026 type information\n // Required is a `List` that contains all the arguments that the LLM is required to populate\n val properties = mutable.Map[String, Any]()\n val required = mutable.ListBuffer[String]()\n \n // Let's start by fetching the primary constructor and grabbing the `paramList`\n val constructor = tpe.decl(termNames.CONSTRUCTOR).asMethod\n val params = constructor.paramLists.head\n params.foreach { param =\u003e\n val paramName = param.name.toString\n val paramType = param.typeSignature\n \n // Check if it's an Option type. This is to populate the `required` list.\n val (actualType, isOptional) = \n if (paramType.typeConstructor =:= typeOf[Option[_]].typeConstructor)\n (paramType.typeArgs.head, true)\n else (paramType, false)\n if (!isOptional) required += paramName\n \n // Extract the @Parameter annotation\n val description = param.annotations\n .find(_.tree.tpe =:= typeOf[Parameter])\n .flatMap(extractDescriptionFromAnnotation)\n .getOrElse(s\"Parameter $paramName\")\n \n // Generate JSON schema for this parameter. We can later augment this to handle different LLM formats easily by just modifying the `typeToJsonSchema` function.\n val schema = typeToJsonSchema(actualType) + (\"description\" -\u003e description)\n properties(paramName) = schema\n }\n (properties.toMap, required.toList)\n}\n```\n##### Handling Complex Types\nOne of the \"pretty\" things about this approach is that it's functional in nature. Kind of. We can naturally handle nested structures with recursion.\n\n```scala\n// This is sometimes LLM specific. But we can handle that repetetive logic here by handling the boilerplate duplication at the lowest layer.\nprivate def typeToJsonSchema(tpe: Type): Map[String, Any] = {\n if (tpe =:= typeOf[String]) {\n Map(\"type\" -\u003e \"string\")\n } else if (tpe =:= typeOf[Int] || tpe =:= typeOf[Long]) {\n Map(\"type\" -\u003e \"integer\")\n } else if (tpe \u003c:\u003c typeOf[List[_]] || tpe \u003c:\u003c typeOf[Seq[_]]) {\n\t// Go recursion!\n val elementType = tpe.typeArgs.headOption.getOrElse(typeOf[String])\n Map(\"type\" -\u003e \"array\", \"items\" -\u003e typeToJsonSchema(elementType)) \n } else if (tpe.typeSymbol.isClass \u0026\u0026 tpe.typeSymbol.asClass.isCaseClass) {\n // Go recursion! (2)\n val (nestedProps, nestedRequired) = extractProperties(tpe)\n Map(\"type\" -\u003e \"object\", \"properties\" -\u003e nestedProps, \"required\" -\u003e nestedRequired)\n } else if (tpe.toString.endsWith(\".Value\")) { // Scala Enumerations\n // Extract enum values through reflection. Go Claude! \n val enumPath = tpe.toString.stripSuffix(\".Value\")\n val moduleSymbol = cm.staticModule(enumPath)\n val moduleMirror = cm.reflectModule(moduleSymbol)\n val enumInstance = moduleMirror.instance.asInstanceOf[Enumeration]\n Map(\"type\" -\u003e \"string\", \"enum\" -\u003e enumInstance.values.map(_.toString).toSeq)\n } else throw new Exception(s\"Cannot convert type to JSON schema: $tpe\")\n}\n```\n\nThis handles everything from primitive types to complex nested structures with arrays of objects—exactly what you need for real-world tools. One shitty thing about this is that the `Exception` for when I can't auto-magically parse something is thrown **only at runtime**. Even though all the information is available at compile-time. Sad. But hey, as long as you make sure all your tool creations are unit-tested, not the biggest problem. But in Scala 3 I believe you should be able to implement all of this logic using compile time reflection. \n#### De-Serialization for Function Execution\nThe final piece is executing tool calls. We need to:\n1. De-Serialize the JSON arguments from the LLM into the `Args` class.\n2. Instantiate the case class\n3. Call the tool's execute method\n\nWe can implement the `Registry` pattern here and do something like this:\n\n```scala\n// Add a `generate` function to our `ToolSchemaGenerator`, something like:\nobject ToolSchemaGenerator {\n def generate[T: TypeTag](executor: ToolExecutor[T]) = {\n val tpe = typeOf[T]\n // Extract tool annotation...\n // Extract properties and required fields...\n // Return a schema object\n SchemaObject(\n `type` = \"function\",\n function = FunctionSchema(...)\n )\n }\n // .. The rest of the code\n}\n// --- \nclass ToolRegistry {\n private val tools = mutable.Map[String, (ToolExecutor[_], LlmTool)]()\n def register[T: TypeTag](executor: ToolExecutor[T]): Unit = {\n val schema = ToolSchemaGenerator.generate(executor)\n tools(schema.function.name) = (executor, schema)\n }\n \n def execute(name: String, jsonArgs: String): Try[String] = {\n tools.get(name) match {\n case Some((executor, _)) =\u003e\n Try {\n // Use Jackson to deserialize JSON to the case class\n val args = mapper.readValue(jsonArgs, executor.argClass)\n // Safe cast because we know the types match\n executor.asInstanceOf[ToolExecutor[Any]].execute(args)\n }\n case None =\u003e \n Failure(new NoSuchElementException(s\"Tool '$name' not found\"))\n }\n }\n}\n```\n\nNote that this only works because we stored the `Class[_]` object in the `ToolExecutor`, which Jackson (our JSON library) can use to deserialize the JSON into the correct type.\n## The Final Result\nWith all these pieces in place, using the framework is fairly simple (I think?):\n\n```scala\n@Tool(name = \"get_current_weather\", description = \"Get the current weather in a given location\")\nclass WeatherTool extends ToolExecutor[WeatherArgs] {\n def execute(args: WeatherArgs): String = {\n s\"The weather in ${args.location} is 72 degrees ${args.unit.getOrElse(FAHRENHEIT)}.\"\n }\n}\n\n// Register it\nval registry = new ToolRegistry()\nregistry.register(new WeatherTool())\nval schemas = registry.getToolSchemas\nmakeLlmCall(messages, schemas) // \u003c- Make the LLM calls\n\n// Execute tool calls from the LLM\nval result = registry.execute(\"get_current_weather\", \n \"\"\"{\"location\": \"Bangalore, IN\", \"unit\": \"CELSIUS\"}\"\"\")\nmakeLlmCall(messages ++ result, schemas) // \u003c- Or whatever ...\n```\n\nOverall, it's a pretty nice framework. To the end-user, all the complexities of:\n\n1. Generating type-checked JSON schemas for their Scala functions\n2. Generating description information for their Scala parameters \u0026 functions for the LLM\n3. Deserializing the LLM tool-call responses to type-checked argument classes\n4. Executing these tools and sending it back to the LLM\n\nAre more or less completely abstracted out and kept \"under-the-hood.\" Pretty neat.\n\nThere's more we can do here for sure. We can probably add some type validation and tool composition logic here as well, but that's for when I'm not as lazy :)", - "lastmodified": "2026-01-11T05:05:48.083575542+05:30", + "lastmodified": "2026-01-11T10:59:02.598586235+05:30", "tags": [ "domain-cs-ai-ml-llms", "domain-cs-languages-theory" @@ -101,8 +101,8 @@ }, "/blog/chain-matrix-multiplication": { "title": "Chain Matrix Multiplication", - "content": "Previously, we discussed a dynamic programming solution to solve [Levenshtein Edit Distance](/blog/levenshtein-edit-distance). Today, we'll look at another interesting problem.\n# Chain Matrix Multiplication / Parenthesization\n\n## The problem\n\nThe problem of chain matrix multiplication is quite interesting. We previously saw that the best-known algorithms for multiplying two matrices are around the order $O(n^{2.81})$, This is not very ideal, especially for multiplying a chain of matrices. However, there is something we can do to severely save computing power!\n\nConsider the following problem.\n\n$$ A = 20 \\times 1 \\\\ B = 1 \\times 20 \\\\ C = 20 \\times 1 \\\\ Compute \\ ABC $$\n\nBecause the multiplication is associative, we can **choose** what multiplication we wish to perform. That means, we can do both of the following.\n\n$(A \\times B) \\times C$ and $A \\times (B\\times C)$. Notice that if we did the former, our first computation would give us a $20 \\times 20$ matrix which must be multiplied with a $20 \\times 1$ matrix. This will give Strassen's input of the order $O(n^2)$.\n\n**However**, if we picked the alternate route, after the first multiplication, we would have a $1\\times1$ matrix to be multiplied with a $20 \\times 1$ matrix. This is far more superior and will help reduce the input sizes of the matrices we perform multiplication on as this gives Strassen's input only of the order $O(n)$.\n\n**This** is the core idea of chain matrix multiplication. A more general term for this problem can be _\"Parenthesization.\"_ It simply asks the question, _\"For some associative computation where each computation takes some cost $c$ to compute, what is the minimum cost I can incur in total for my total computation by just reordering the computation by rules of associativity?\"_\n\n## How do we approach this?\n\nWe realize pretty quickly that greedy approaches will not work here. There is no notion of the locally optimal solution. Even if we pick the first pairing to be the one that gives the least cost it says nothing about how this pick affects the later picks. Hence we must try them all out.\n\n## What about DP?\n\nHow can we effectively exploit some substructure of this problem to write a recursive solution?\n\nLet's say we're given a sequence of $n$ matrices to multiply $a_0 \\times a_1 \\times \\dots \\times a_{n-1}$.\n\nNotice that at any given point, we can use the following idea to divide the problem into sub-problems. For any given **continuous** sub-segment, I must divide it into a multiplication of exactly **two** segments. For the above sequence, let the optimal pairing be $[a_0 \\dots a_i]\\times[a_{i+1} \\dots a_{n-1}]$. Then this is the split that I must perform at this state.\n\nHow do I know what the optimal split is? I must simply try all possible positions for the split all the way from between $a_0$ , $a_1$ to $a_{n-2}$ , $a_{n-1}$. To \"try\" each of these possible positions, I must know beforehand the cost of calculating each subpart.\n\nSo far we've seen examples of prefix and suffix dp. In the LIS problem, we calculated the LIS for every prefix. For edit distance, we could've done it either using a prefix or suffix dp. However, we quickly realize that this problem does not have that kind of structure. It is a lot more difficult to draw the DAG structure for this problem as this problem does not have a very \"linear\" way of solving it. Notice that our solution essentially requires us to compute the minimum cost for each and every \"sub-segment\" in our array of matrices.\n\n### Arriving at the DP solution\n\nLet's try to answer the following questions as we try to arrive at our DP solution.\n\n1. **What is the number of subproblems?**\n \n As stated previously, we need to compute the optimal cost of multiplying every \"subarray\" of matrices. For some given array of length $N$ we can have $\\frac{N \\times (N+1)}{2}$ such sub-segments. (We will have 1 segment of length $N$, 2 of length $N-1$, etc. Which gives us a total of $\\sum_{i=1}^{n}i$)\n \n Hence our sub-problems are of the order of $O(n^2)$. Our DP will likely be at least of $n^2$ complexity.\n \n2. **Find the solution at some state and count the number of possibilities we have to brute force over**\n \n At some given state, notice that we are trying to compute the minimum cost required to multiply an ordered list of matrices from $[a_i\\dots a_j]$. To do so, we must brute force over all possible splits of this sub-array. The following pseudo-code will paint a better picture.\n \n ```cpp\n for k in [i, j-1]:\n \t\tDP[i][j] = min(DP[i][j], DP[i][k] + DP[k+1][j] + cost(M[i][k], M[k+1][j])\n ```\n \n Here, $DP[i][j]$ stores the minimum cost incurred in optimally multiplying the segment from $i \\to j$ and `cost` simply calculates the cost of multiplying the resultant two matrices $[a_i \\dots a_k]\\times[a_{k+1}\\dots a_j]$.\n \n Notice that for any given $i, j$ there are a linear number of problems we must brute force over. Hence this step of our algorithm will have $O(n)$ time complexity.\n \n3. **Finding the recurrence**\n \n We already derived the recurrence to explain the previous point better. The recurrence is the same as the one given in the pseudo-code. Each of the $DP[i][k]$ and $DP[k+1][j]$ states there represents the solution to one of its sub-problems.\n \n4. **Figuring out the DAG structure and making sure we don't have any cycles**\n \n This turns to be a lot messier and harder to work with for substring/subarray dp as compared to prefix/suffix dp. This is intuitively understood from the fact that we lose linear structure. Hence we will visit this topic at a later point in time.\n \n5. **Completing the solution**\n \n Notice that we have $O(n^2)$ sub-problems and each sub-problem requires $O(n)$ time to compute. This gives our algorithm an overall running time of $O(n^3)$ time complexity. And since we have $O(n^2)$ sub-problems we would require that much space to store the solutions to all our sub-problems.\n \n \u003eNote that this is fairly high complexity for an algorithm that simply just determines the best and most optimal order in which to multiply an ordered list of matrices. It does not make sense to spend time planning, coding, and integrating such an algorithm in the workflow pipeline if the matrix computations we are doing are fairly small.\n \u003e\n \u003eHowever, if we are working with matrices of huge sizes and the number of matrices is relatively smaller than the size of the matrices, precomputing the best order of multiplication before multiplying the matrices themselves could provide us with a **huge** boost in performance. Think about the example given at the beginning but several orders of magnitudes higher!\n\nAnother nice thing to notice is that this solution is not only applicable to chain matrix multiplication. We could've really changed the `cost` function in our algorithm to any cost function of our choice. In fact, the problem we have solved can be generalized to picking the optimal order of performing some operation on an ordered list of elements where the operation follows the **associativity** property alone.\n\n## Realizing the DAG structure\n\nAs mentioned before, it is not quite simple to understand the DAG structure for this problem. To get a good idea of what's going on, lets begin by simply drawing the recursion diagram for a small case. Let's say $[1, 4]$.\n\n![chain-matrix-mult-1](/images/chain-matrix-mult-1.webp)\n\nNotice that the leaves of our tree are all the sub-segments of length 1. Imagine visually pruning all the leaves from our tree. We will now have a new set of leaves.\n\nThese are the new states/sub-problems to calculate. Notice that after performing such an operation, we have a mix of segments of different lengths. But which ones can be computed completely after having just computed the previous leaf states?\n\nNotice that these are just the segments of length 2. $[1, 2], [2, 3], [3, 4]$. We can perform this operation again, and again, and so on till we reach $[1, 4]$. In general, this construction can be extended to any general $[1, n]$.\n\nFrom this, it is easy to realize that we are computing DP states in order of increasing the length of sub-segments. Our DAG would look as follows.\n\n![chain-matrix-mult-2](/images/chain-matrix-mult-2.webp)\n\nHere, I've attempted to paint the arrows showing the transition from a state of length just 1 below in green, 2 below in yellow, and 3 below in blue.\n\nThere are no cycles and we have $O(n^2)$ nodes.\n\n### DP ≠ Shortest Path on DAG\n\nWhile the shortest / longest path in a DAG example was quite useful to visualize DP previously, we must realize that this is not always the case. Why?\n\nThis is because the state at some node $[i, j]$ is not **just** dependent on the previous state. Remember that there is a cost associated with every multiplication that is dependent on the state it is being compared with.\n\nFor example, when computing the solution at node $[1, 3]$, it is not enough to just consider the cost from $[1, 1]$. The cost at $[1, 3]$ only has meaning when we sum up the total effect from both $[1, 1]$ **AND** $[2, 3]$.\n\nIn this DP solution, we cannot simply construct a DAG structure and find the longest/shortest path as the solution for that node is reliant on the values of **multiple** nodes. It was a great way to visualize and be introduced to DP, but it is **not** always the case :)\n\n### Can we do better?\n\nLast time, we were able to reduce the space complexity of our DP by realizing that the DP only relied on the states of the DP solution exactly **one** level below the current level. However, here we realize that this is sadly not the case. The solution at some node $[i, j]$ is very much reliant on every level below it. 1D row optimization etc does not seem to be of much use here. There is also no monotonicity that can be exploited to make the linear computation at some node logarithmic similar to how we did with LIS. Hence I do not think there is a better way to solve this problem.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n\n---\n\n## Related Reading\n\n- [DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)\n- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem)\n", - "lastmodified": "2026-01-11T05:05:48.332921011+05:30", + "content": "Previously, we discussed a dynamic programming solution to solve [Levenshtein Edit Distance](/blog/levenshtein-edit-distance). Today, we'll look at another interesting problem.\n# Chain Matrix Multiplication / Parenthesization\n\n## The problem\n\nThe problem of chain matrix multiplication is quite interesting. We previously saw that the best-known algorithms for multiplying two matrices are around the order $O(n^{2.81})$, This is not very ideal, especially for multiplying a chain of matrices. However, there is something we can do to severely save computing power!\n\nConsider the following problem.\n\n$$ A = 20 \\times 1 \\\\ B = 1 \\times 20 \\\\ C = 20 \\times 1 \\\\ Compute \\ ABC $$\n\nBecause the multiplication is associative, we can **choose** what multiplication we wish to perform. That means, we can do both of the following.\n\n$(A \\times B) \\times C$ and $A \\times (B\\times C)$. Notice that if we did the former, our first computation would give us a $20 \\times 20$ matrix which must be multiplied with a $20 \\times 1$ matrix. This will give Strassen's input of the order $O(n^2)$.\n\n**However**, if we picked the alternate route, after the first multiplication, we would have a $1\\times1$ matrix to be multiplied with a $20 \\times 1$ matrix. This is far more superior and will help reduce the input sizes of the matrices we perform multiplication on as this gives Strassen's input only of the order $O(n)$.\n\n**This** is the core idea of chain matrix multiplication. A more general term for this problem can be _\"Parenthesization.\"_ It simply asks the question, _\"For some associative computation where each computation takes some cost $c$ to compute, what is the minimum cost I can incur in total for my total computation by just reordering the computation by rules of associativity?\"_\n\n## How do we approach this?\n\nWe realize pretty quickly that greedy approaches will not work here. There is no notion of the locally optimal solution. Even if we pick the first pairing to be the one that gives the least cost it says nothing about how this pick affects the later picks. Hence we must try them all out.\n\n## What about DP?\n\nHow can we effectively exploit some substructure of this problem to write a recursive solution?\n\nLet's say we're given a sequence of $n$ matrices to multiply $a_0 \\times a_1 \\times \\dots \\times a_{n-1}$.\n\nNotice that at any given point, we can use the following idea to divide the problem into sub-problems. For any given **continuous** sub-segment, I must divide it into a multiplication of exactly **two** segments. For the above sequence, let the optimal pairing be $[a_0 \\dots a_i]\\times[a_{i+1} \\dots a_{n-1}]$. Then this is the split that I must perform at this state.\n\nHow do I know what the optimal split is? I must simply try all possible positions for the split all the way from between $a_0$ , $a_1$ to $a_{n-2}$ , $a_{n-1}$. To \"try\" each of these possible positions, I must know beforehand the cost of calculating each subpart.\n\nSo far we've seen examples of prefix and suffix dp. In the LIS problem, we calculated the LIS for every prefix. For edit distance, we could've done it either using a prefix or suffix dp. However, we quickly realize that this problem does not have that kind of structure. It is a lot more difficult to draw the DAG structure for this problem as this problem does not have a very \"linear\" way of solving it. Notice that our solution essentially requires us to compute the minimum cost for each and every \"sub-segment\" in our array of matrices.\n\n### Arriving at the DP solution\n\nLet's try to answer the following questions as we try to arrive at our DP solution.\n\n1. **What is the number of subproblems?**\n \n As stated previously, we need to compute the optimal cost of multiplying every \"subarray\" of matrices. For some given array of length $N$ we can have $\\frac{N \\times (N+1)}{2}$ such sub-segments. (We will have 1 segment of length $N$, 2 of length $N-1$, etc. Which gives us a total of $\\sum_{i=1}^{n}i$)\n \n Hence our sub-problems are of the order of $O(n^2)$. Our DP will likely be at least of $n^2$ complexity.\n \n2. **Find the solution at some state and count the number of possibilities we have to brute force over**\n \n At some given state, notice that we are trying to compute the minimum cost required to multiply an ordered list of matrices from $[a_i\\dots a_j]$. To do so, we must brute force over all possible splits of this sub-array. The following pseudo-code will paint a better picture.\n \n ```cpp\n for k in [i, j-1]:\n \t\tDP[i][j] = min(DP[i][j], DP[i][k] + DP[k+1][j] + cost(M[i][k], M[k+1][j])\n ```\n \n Here, $DP[i][j]$ stores the minimum cost incurred in optimally multiplying the segment from $i \\to j$ and `cost` simply calculates the cost of multiplying the resultant two matrices $[a_i \\dots a_k]\\times[a_{k+1}\\dots a_j]$.\n \n Notice that for any given $i, j$ there are a linear number of problems we must brute force over. Hence this step of our algorithm will have $O(n)$ time complexity.\n \n3. **Finding the recurrence**\n \n We already derived the recurrence to explain the previous point better. The recurrence is the same as the one given in the pseudo-code. Each of the $DP[i][k]$ and $DP[k+1][j]$ states there represents the solution to one of its sub-problems.\n \n4. **Figuring out the DAG structure and making sure we don't have any cycles**\n \n This turns to be a lot messier and harder to work with for substring/subarray dp as compared to prefix/suffix dp. This is intuitively understood from the fact that we lose linear structure. Hence we will visit this topic at a later point in time.\n \n5. **Completing the solution**\n \n Notice that we have $O(n^2)$ sub-problems and each sub-problem requires $O(n)$ time to compute. This gives our algorithm an overall running time of $O(n^3)$ time complexity. And since we have $O(n^2)$ sub-problems we would require that much space to store the solutions to all our sub-problems.\n \n \u003eNote that this is fairly high complexity for an algorithm that simply just determines the best and most optimal order in which to multiply an ordered list of matrices. It does not make sense to spend time planning, coding, and integrating such an algorithm in the workflow pipeline if the matrix computations we are doing are fairly small.\n \u003e\n \u003eHowever, if we are working with matrices of huge sizes and the number of matrices is relatively smaller than the size of the matrices, precomputing the best order of multiplication before multiplying the matrices themselves could provide us with a **huge** boost in performance. Think about the example given at the beginning but several orders of magnitudes higher!\n\nAnother nice thing to notice is that this solution is not only applicable to chain matrix multiplication. We could've really changed the `cost` function in our algorithm to any cost function of our choice. In fact, the problem we have solved can be generalized to picking the optimal order of performing some operation on an ordered list of elements where the operation follows the **associativity** property alone.\n\n## Realizing the DAG structure\n\nAs mentioned before, it is not quite simple to understand the DAG structure for this problem. To get a good idea of what's going on, lets begin by simply drawing the recursion diagram for a small case. Let's say $[1, 4]$.\n\n![chain-matrix-mult-1](/images/chain-matrix-mult-1.webp)\n\nNotice that the leaves of our tree are all the sub-segments of length 1. Imagine visually pruning all the leaves from our tree. We will now have a new set of leaves.\n\nThese are the new states/sub-problems to calculate. Notice that after performing such an operation, we have a mix of segments of different lengths. But which ones can be computed completely after having just computed the previous leaf states?\n\nNotice that these are just the segments of length 2. $[1, 2], [2, 3], [3, 4]$. We can perform this operation again, and again, and so on till we reach $[1, 4]$. In general, this construction can be extended to any general $[1, n]$.\n\nFrom this, it is easy to realize that we are computing DP states in order of increasing the length of sub-segments. Our DAG would look as follows.\n\n![chain-matrix-mult-2](/images/chain-matrix-mult-2.webp)\n\nHere, I've attempted to paint the arrows showing the transition from a state of length just 1 below in green, 2 below in yellow, and 3 below in blue.\n\nThere are no cycles and we have $O(n^2)$ nodes.\n\n### DP ≠ Shortest Path on DAG\n\nWhile the shortest / longest path in a DAG example was quite useful to visualize DP previously, we must realize that this is not always the case. Why?\n\nThis is because the state at some node $[i, j]$ is not **just** dependent on the previous state. Remember that there is a cost associated with every multiplication that is dependent on the state it is being compared with.\n\nFor example, when computing the solution at node $[1, 3]$, it is not enough to just consider the cost from $[1, 1]$. The cost at $[1, 3]$ only has meaning when we sum up the total effect from both $[1, 1]$ **AND** $[2, 3]$.\n\nIn this DP solution, we cannot simply construct a DAG structure and find the longest/shortest path as the solution for that node is reliant on the values of **multiple** nodes. It was a great way to visualize and be introduced to DP, but it is **not** always the case :)\n\n### Can we do better?\n\nLast time, we were able to reduce the space complexity of our DP by realizing that the DP only relied on the states of the DP solution exactly **one** level below the current level. However, here we realize that this is sadly not the case. The solution at some node $[i, j]$ is very much reliant on every level below it. 1D row optimization etc does not seem to be of much use here. There is also no monotonicity that can be exploited to make the linear computation at some node logarithmic similar to how we did with LIS. Hence I do not think there is a better way to solve this problem.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", + "lastmodified": "2026-01-11T10:59:51.094314781+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -110,7 +110,7 @@ "/blog/church-turing-hypothesis": { "title": "Church-Turing Hypothesis", "content": "# Church-Turing Hypothesis\n\nThe church turning hypothesis is essentially a hypothesis that answers the question, _\"What are algorithms?\"_ The hypothesis states that algorithms are nothing but Turing machines. Any mechanical computation can be performed by a Turing machine. If there is no Turing machine that can decide a problem P, then there is no algorithm that can solve P.\n\nNote that while the Church-Turing Hypothesis is just a hypothesis and **not** proof. However, it is widely accepted as it is founded on very logical arguments that are based on the \"axioms\" of computing we stated while defining what constitutes a computational solution.\n\n## The Turing Machine\n\nWe have an infinite length tape that has been split into an infinite number of finite length cells. We have a \"read/write\" head that can move along the tape (to the left or right cells). This read-write head operates on the current cell it is on top of. Note that this is in line with the fact that it is not possible to read/write to an infinite amount of memory. Hence it must operate on finite-sized cells.\n\nThe Turing machine is defined as a 7-tuple $(Q, \\Sigma, \\Gamma, \\delta, q_0, q_{accept}, q_{reject})$.\n\n1. $Q$ - **The finite set of states**\n \n The TM must at all times be in one of the finitely many _control_ states defined by the set $Q$. This is similar to the state diagram of circuits. When in a particular state the TM decides and responds to some input based on the control state it is in.\n \n2. $\\Sigma$ - **The finite alphabet set (not containing blank)**\n \n $\\Sigma$ is the finite input alphabet. This is the _alphabet_ we encode our input in. Since input can be of any size, we also require a _blank_ alphabet to represent the end of some particular input. This is analogous to the C implementation of strings. Strings can be thought of as the input with ASCII constituting its input alphabet. The blank character here would be the `\\\\0` null terminator which signifies the end of the string. Note that the blank alphabet is **not** a part of this set.\n \n3. $\\Gamma$ - **The finite tape alphabet (includes blank)**\n \n $\\Gamma$ is the finite _working_ alphabet + the _input_ alphabet. It is hence, a superset of the finite alphabet set $\\Sigma$. The tape alphabet but contain at least _one_ more symbol than the input alphabet. Namely, the blank alphabet. However, apart from the blank alphabet, we can have many more _work_ alphabets which signify something of meaning to the TM read-write head. This is analogous to the \"instruction encoding\" of the ISAs used by modern computers.\n \n4. $\\delta:(Q\\times \\Gamma) \\to(Q\\times\\Gamma\\times\\{L, R\\})$ - **The transition function**\n \n When we are in some control state $Q$ and we read some tape symbol $\\Gamma$, then we can move to some other state $Q'$, overwrite the contents of the current cell with some tape symbol $\\Gamma'$ and move either to the left or to the right. Note that $\\Gamma'$ and $Q'$ may be the same states. The transition function essentially just tells the Turing machine what to do when it reads some tape alphabet when it is in some control state $Q$.\n \n Note that it is also possible to remain in the same state. We can simply encode a move right and the right cell can be a move right. This would take 2 steps but the result is the same. The goal here is to define an abstract construct in a **simple** manner that allows us to represent any algorithm. _NOT_ defining the most efficient such construct.\n \n5. $q_0$ - **The start state**\n \n Whenever the machine begins to work on a decision problem it must begin in some pre-defined control state. This is the _start_ state of the machine.\n \n6. $q_{accept}$ - **The accept state**\n \n If the machine ever reaches this state, then the machine can decide that the input does indeed belong to the language and output `1` and stop.\n \n7. $q_{reject}$ - **The reject state**\n \n If the machine ever reaches this state, then it can decide that the input does not belong to the language and output `0` and stop.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:05:48.340879575+05:30", + "lastmodified": "2026-01-11T10:59:02.608227841+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-cs-algorithms-complexity" @@ -119,15 +119,15 @@ "/blog/complexity-theory-reductions": { "title": "Complexity Theory - Reductions", "content": "# Reductions\n\n\u003eIn computability theory and computational complexity theory, a reduction is an algorithm for transforming one problem into another problem. A sufficiently efficient reduction from one problem to another may be used to show that the second problem is at least as difficult as the first.\n\nIntuitively, what does this mean?\n\nLet's say we have two problems $f$ and $g$. Let's suppose that problem $g$ has a known solution. Then the following is a reduction from $f \\to g$.\n\n![reductions-1](/images/reductions-1.webp)\n\nThe **\"Reduction\"** is basically finding those two blue boxes, which convert the input and output from that of problem $f$ to equivalent input for problem $g$. Now we can simply compute the solution for problem $g$ and then use the reverse of our reduction algorithm to transform the output to that required by $f$.\n\nIf we can find two such blue triangles which can transform the input \u0026 output in such a way then we can effectively say that problem $f$ has been reduced to solving problem $g$. This is because solving $g$ implies being able to solve $f$.\n\nWhat's more interesting is if these blue triangles are **polynomial-time** algorithms. If we can find poly-time algorithms which can perform this transformation of the input and output, then we have an _efficient_ reduction.\n\nWe have effectively managed to solve $f$ using the solution of $g$, along with some (hopefully efficient) pre and post-processing.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. A discussion with [Anurudh Peduri](https://anurudhp.github.io/) on the Theory Group Discord. ", - "lastmodified": "2026-01-11T05:05:48.414080416+05:30", + "lastmodified": "2026-01-11T10:59:02.683587403+05:30", "tags": [ "domain-cs-algorithms-complexity" ] }, "/blog/covered-calls-strategy": { "title": "Covered Calls Strategy", - "content": "# Introduction\nThe covered call strategy is an options trading strategy ([Derivatives - Options](/blog/derivatives-options)) in which an investor holds a long position in an underlying asset and simultaneously writes (sells) call options on that same asset. This strategy is employed by investors seeking to generate income from the premium received by selling the call options, while also benefiting from potential stock price appreciation up to the strike price of the call option.\n\n# How It Works\n![pasted-image-20240507114807](/images/pasted-image-20240507114807.webp)\nA call option is a contract that gives the buyer the right, but not the obligation, to purchase a specified quantity of an underlying asset at a predetermined strike price within a specific period. In the covered call strategy, the investor owns the underlying asset (e.g., shares of a stock) and writes (sells) call options on those shares. \n\n**Example:** For instance, consider an investor holding 100 shares of a stock currently trading at ₹400 per share (market value). The investor can sell a call option with a strike price of ₹420 and an option premium of ₹4 per share (₹400 total premium). The call option expires in 30 days.\n\nIf the stock price remains below ₹420 at expiration, the call option will not be exercised, and the investor retains the ₹400 premium as income. However, if the stock price rises above ₹420, the call option holder may exercise the option, requiring the investor to sell the shares at the strike price of ₹420, regardless of the higher market price. The payoff profile of the covered call strategy is characterized by a capped upside potential but limited downside risk. The maximum profit is equal to the call option premium received, plus the difference between the strike price and the purchase price of the underlying asset (if the option is exercised). The maximum loss is limited to the initial cost of purchasing the underlying asset, minus the premium received. The covered call strategy generates income from the call option premium, providing a cushion against potential downside risk in the underlying asset. However, the strategy also caps the upside potential, as the investor is obligated to sell the underlying asset at the strike price if the option is exercised. Additionally, the investor faces the risk of having the underlying asset called away, potentially missing out on further appreciation.\n\n# Advantages\n## Dividend Income\nIf the underlying asset is a dividend-paying stock, the investor continues to receive dividends while the covered call position is held.\n## Stock Price Appreciation\nBy selling out-of-the-money call options, the investor can benefit from potential stock price appreciation up to the strike price of the call option.\n## Option Premium Income\nThe premium received from selling the call option generates additional income for the investor. If the option expires unexercised, the investor can sell a new call option on the same underlying position, generating recurring premium income.\n## Downside Hedge\nThe covered call strategy is sometimes regarded as a downside hedge for an investor's portfolio. If the portfolio remains flat or declines in value, the income generated from selling call options can partially offset the losses, providing a cushion against downside risk compared to simply holding the underlying assets.\n# Disadvantages\n## Capped Upside Potential\nWhile the covered call strategy provides three potential sources of return (dividend income, stock price appreciation up to the strike price, and option premium income), the third source of return is achieved by significantly restricting the second source – stock price appreciation. As illustrated by the payoff graph, the covered call strategy caps the upside potential, as any appreciation beyond the strike price is transferred to the call option buyer. Consequently, the strategy effectively hedges potential upside gains.\n## Higher Tax\nThe premiums received from selling call options are generally considered short-term capital gains, which are taxed at a higher rate than long-term capital gains. Additionally, if the options are exercised, requiring the sale and repurchase of the underlying asset, the investor may need to report and pay taxes on the capital gains, potentially reducing the tax-deferral benefits of holding the position.\n## Share Holding Power Reduction\nIf the stock price fluctuates significantly, even if it remains flat over the long term, the covered call strategy may result in a gradual reduction in the number of shares held due to the exercise of call options. This can lead to realizing losses, despite the overall position being flat.\n## Risk-Adjusted Return Considerations\nWhile some people argue that covered call strategies may offer superior risk-adjusted returns, meaning higher returns per unit of risk taken, this claim is subject to debate and may depend on the specific market conditions and underlying assets involved.\n\n---\n\n## Related Reading\n\n- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis)\n- [What is the Stock Market?](/blog/what-is-the-stock-market)\n", - "lastmodified": "2026-01-11T05:05:48.554248898+05:30", + "content": "# Introduction\nThe covered call strategy is an options trading strategy ([Derivatives - Options](/blog/derivatives-options)) in which an investor holds a long position in an underlying asset and simultaneously writes (sells) call options on that same asset. This strategy is employed by investors seeking to generate income from the premium received by selling the call options, while also benefiting from potential stock price appreciation up to the strike price of the call option.\n\n# How It Works\n![pasted-image-20240507114807](/images/pasted-image-20240507114807.webp)\nA call option is a contract that gives the buyer the right, but not the obligation, to purchase a specified quantity of an underlying asset at a predetermined strike price within a specific period. In the covered call strategy, the investor owns the underlying asset (e.g., shares of a stock) and writes (sells) call options on those shares. \n\n**Example:** For instance, consider an investor holding 100 shares of a stock currently trading at ₹400 per share (market value). The investor can sell a call option with a strike price of ₹420 and an option premium of ₹4 per share (₹400 total premium). The call option expires in 30 days.\n\nIf the stock price remains below ₹420 at expiration, the call option will not be exercised, and the investor retains the ₹400 premium as income. However, if the stock price rises above ₹420, the call option holder may exercise the option, requiring the investor to sell the shares at the strike price of ₹420, regardless of the higher market price. The payoff profile of the covered call strategy is characterized by a capped upside potential but limited downside risk. The maximum profit is equal to the call option premium received, plus the difference between the strike price and the purchase price of the underlying asset (if the option is exercised). The maximum loss is limited to the initial cost of purchasing the underlying asset, minus the premium received. The covered call strategy generates income from the call option premium, providing a cushion against potential downside risk in the underlying asset. However, the strategy also caps the upside potential, as the investor is obligated to sell the underlying asset at the strike price if the option is exercised. Additionally, the investor faces the risk of having the underlying asset called away, potentially missing out on further appreciation.\n\n# Advantages\n## Dividend Income\nIf the underlying asset is a dividend-paying stock, the investor continues to receive dividends while the covered call position is held.\n## Stock Price Appreciation\nBy selling out-of-the-money call options, the investor can benefit from potential stock price appreciation up to the strike price of the call option.\n## Option Premium Income\nThe premium received from selling the call option generates additional income for the investor. If the option expires unexercised, the investor can sell a new call option on the same underlying position, generating recurring premium income.\n## Downside Hedge\nThe covered call strategy is sometimes regarded as a downside hedge for an investor's portfolio. If the portfolio remains flat or declines in value, the income generated from selling call options can partially offset the losses, providing a cushion against downside risk compared to simply holding the underlying assets.\n# Disadvantages\n## Capped Upside Potential\nWhile the covered call strategy provides three potential sources of return (dividend income, stock price appreciation up to the strike price, and option premium income), the third source of return is achieved by significantly restricting the second source – stock price appreciation. As illustrated by the payoff graph, the covered call strategy caps the upside potential, as any appreciation beyond the strike price is transferred to the call option buyer. Consequently, the strategy effectively hedges potential upside gains.\n## Higher Tax\nThe premiums received from selling call options are generally considered short-term capital gains, which are taxed at a higher rate than long-term capital gains. Additionally, if the options are exercised, requiring the sale and repurchase of the underlying asset, the investor may need to report and pay taxes on the capital gains, potentially reducing the tax-deferral benefits of holding the position.\n## Share Holding Power Reduction\nIf the stock price fluctuates significantly, even if it remains flat over the long term, the covered call strategy may result in a gradual reduction in the number of shares held due to the exercise of call options. This can lead to realizing losses, despite the overall position being flat.\n## Risk-Adjusted Return Considerations\nWhile some people argue that covered call strategies may offer superior risk-adjusted returns, meaning higher returns per unit of risk taken, this claim is subject to debate and may depend on the specific market conditions and underlying assets involved.", + "lastmodified": "2026-01-11T10:59:49.916345087+05:30", "tags": [ "domain-finance-quantitative" ] @@ -135,7 +135,7 @@ "/blog/de-novo-assembly-overlap-graphs": { "title": "De-Novo Assembly \u0026 Overlap Graphs", "content": "# Preface \u0026 References\nI document topics I've discovered and my exploration of these topics while following the course, [Algorithms for DNA Sequencing, by John Hopkins University](https://www.coursera.org/learn/dna-sequencing) on [Coursera](https://www.coursera.org/). The course is taken by two instructors [Ben Langmead](https://scholar.google.com/citations?user=2JMaTKsAAAAJ\u0026hl=en) and [Jacob Pritt](https://www.coursera.org/instructor/jacobpritt).\n\nWe will study the fundamental ideas, techniques, and data structures needed to analyze DNA sequencing data. In order to put these and related concepts into practice, we will combine what we learn with our programming expertise. Real genome sequences and real sequencing data will be used in our study. We will use Boyer-Moore to enhance naïve precise matching. We then learn indexing, preprocessing, grouping and ordering in indexing, K-mers, k-mer indices and to solve the approximate matching problem. Finally, we will discuss solving the alignment problem and explore interesting topics such as De Brujin Graphs, Eulerian walks and the Shortest common super-string problem. \n\nAlong the way, I document content I've read about while exploring related topics such as suffix string structures and relations to my research work on the STAR aligner.\n# De-Novo Assembly\n\nNow that we've covered the section where we worked on the genome reconstruction problem ([DNA Sequencing](/blog/dna-sequencing)) assuming the existence of another genome from the same species, what do we do when there exists no such previously reconstructed genome? Such a situation can occur when we're studying the genome of a new exotic species or simply lack access to said genome. This was the problem that the original scientists who worked on the Human Genome Project had to deal with and the problem is indeed far more computationally intensive than when we already have a snapshot to work with. \n\n## Core Ideas\n\nTo slowly build up to the solution, let us first understand the key ideas involved in the problem. We essentially have many, many short reads of DNA sequences from the main genome and need to somehow piece them back together to reconstruct the original genome. To reiterate, we are given these short reads in no particular order and have no picture of where to match these short reads inorder to reconstruct the main sequence.\n\nTo solve this problem, let us begin by working back from the final solution. \n\n!![coverage](/images/coverage.webp)\n\nLet's suppose we did know the positions of the short reads in the original sequence. We then define the term **coverage** as the number of overlapping reads for each character $c$ of the main genome. We can then simply define a term **average coverage** as the coverage we can expect for each character of the sequence given the length of the sequence, length of each read and the total number of short reads we have of the sequence. \n\n$$\\text{Avg. Coverage} = \\frac{\\text{Length of read } * \\text{ Number of reads}}{\\text{Length of genome}}$$\n\nFor the above example, the value comes out to be around $5$ (simply round to the nearest integer). We hence call this a **5-fold** coverage of the genome. Now notice that if we have two overlapping reads, the *suffix* of one read is **very similar** to the *prefix* of the next read. This follows from the fact that they are overlapping consequence reads. From this, we get the two laws of assembly.\n\n1. If a suffix of read $A$ is similar to a prefix of read $B$, then $A$ and $B$ *might* overlap.\n\n2. More coverage leads to more and longer overlaps.\n\n3. **Repeats are bad.** (Will be discussed later.)\n\nNote that in the first law we again use the term *similar*, because there can be errors. These mainly stem from DNA sequencing errors and from *polyploidy.* That is, species can have two copies of each chromosome, and these copies can differ slightly.\n\n## Overlap Graphs\n\nWe define overlap graphs for a particular set of reads as follows.\n\nLet the nodes of the graphs represent the reads we have obtained of the genome. Now, there exists an edge $e$, between an **ordered** pair of nodes $(u, v)$ when a suffix of $u$, overlaps with the prefix of $v$.\n\nNow, not all overlaps are equally important. For example, an overlap of size $1$ can be very frequently occurring and doesn't provide much evidence of it occurring due to it being consequent overlapping reads in the genome. Hence, we can build overlap graphs where an edge $e$ exists between an ordered pair of nodes $(u, v)$, only when the overlap between them exceeds some constant value. Consider the following overlap graph for overlaps of size $\\geq 4$\n\n![overlap-graph](/images/overlap-graph.webp)\n\n", - "lastmodified": "2026-01-11T05:05:51.0200153+05:30", + "lastmodified": "2026-01-11T10:59:04.626614105+05:30", "tags": [ "domain-science-bioinformatics" ] @@ -143,7 +143,7 @@ "/blog/defining-computational-problems": { "title": "Defining Computational Problems", "content": "# Introduction\nTo study \u0026 analyze algorithms, we must have a solid foundational theory for what algorithms are. We often define algorithms as a series of steps that must be followed to solve a problem. This raises a few basic questions that we must answer concretely.\n\n- What are **computational** problems?\n- Once _a_ solution is obtained, how do we know that it is indeed correct?\n- Say we have 2 solutions `a` \u0026 `b`. How can we compare these 2 solutions? On what basis can we say one solution is better than the other?\n- Once we can do the above, how can we find the lower bound, i.e., the most optimal solution to a problem? and more importantly, can we prove that it is indeed optimal?\n\n# What are computational problems?\n\nThere are many kinds of problems in the world. Math problems, world peace \u0026 hunger problems, problems arising from eating a bat, etc. to name a few. We wish to concern ourselves with problems of a particular class, **computational** problems. However, there are many difficulties associated with defining such a class. Consider the following challenges:\n\n- **There may be infinite ways of posing the same problem**\n \n Consider the following problems.\n \n 1. _What is the smallest prime number?_\n 2. _What is the smallest even positive integer?_\n 3. _What is the GCD of the set of even positive integers?_\n \n Notice that the output (solution) for all the above problems is two. From a computational standpoint, the above problems are all the same. But there are infinite ways to pose the same problem.\n \n- **How do we pose a question without solving it?**\n \n Consider the age-old problem of sorting. The problem is usually phrased as follows,\n \n \u003e _Given a sequence of `n` integers, among all possible permutations of this sequence, output such a sequence that it is in ascending order. That is, $a_i \u003c a_{i+1} \\ \\ \\forall \\ \\ 1 \\leq i \\lt n$_\n \n Notice that in the above statement, we provide the solution to the question itself. Phrased differently, the problem is essentially telling us to iterate over all possible $n!$ permutations and pick the permutation such that the sequence is in ascending order. While granted, this isn't a _good_ solution, it is a solution. We must come up with a way to pose problems that we do not have a solution to yet. Or maybe even problems for which there does _not_ exist any solution.\n \n- **What kind of tools can we allow to be used to solve the problem?**\n \n Notice that depending on the solutions that are allowed to a problem, the meaning of \"solving\" the problem changes. Consider the following _mythological_ sort algorithm.\n \n 1. Meditate until god appears\n 2. Provide input to god\n 3. Obtain sorted output from god\n \n This is not a _computational_ solution to the sorting problem. Hence it is necessary to enforce a _computational_ constraint on problems.\n \n ## Defining a computational problem\n \n When defining a computational problem, we make the following assumption about the world we live in.\n \n \u003e🧭 The input is digitized. We live in a noisy environment. Whenever there is noise, if 2 symbols are closer than some threshold, we say they are in the same equivalence class and are one and the same. This ensures that the total number of symbols we must deal with becomes finite in a finite space. This ensures that we're able to digitize the input and output.\n \n Further, assume that the output has multiple but a finite number of bits. We can then model each bit as a separate problem. This enables us to model all problems with finite output as decision problems. A decision problem is simply a problem with **1-bit** output. \"_Is this true or false?\"_\n \n **This allows us to pose problems as membership queries in _\"languages.\"_**\n \n ### Defining a \"Language\"\n \n We can reduce the question _\"X is my input and I am looking for the output Y\"_ to \"_Does my input X belong to the set of all inputs which give output one?\"_\n \n \u003e💬 **Languages** Each decision problem is characterized by a subset of the set of all possible inputs. (,i.e., subset of say, {0, 1}*)\n \u003e\n \u003e $L = \\{x \\ | \\ s\\in \\{0,1 \\}^* \\}$\n \u003e\n \u003e For example: Consider the problem of checking if a sequence is sorted or not.\n \u003e \n \u003e Let us encode our strings as numbers in binary separated by some terminator which splits each number in the sequence. Our question now reduces to, _\"Given a string encoded this form, does it belong to the language $L_{sorted}$?\"_ The string that encodes the sequence {1, 2, 3} would belong to this set. But the string which encodes the sequence {3, 2, 5} would not. Our encoding must be able to represent all different inputs in a unique manner. Notice that this has allowed us to reduce the problem to a simple decision problem of querying membership in our language $L_{sorted}$\n \n This is essentially how all problems in complexity theory are formalized.\n \n This formalization allows us to deal with most of the challenges aforementioned. Multiple ways to pose the same question no longer matter as the problems are characterized by the language. If the set of all possible inputs for which the output is 1 is the same for 2 different problems, then they are one and the same. Further, we can now pose problems without providing a solution as it is possible for us to define sets without having to enumerate them.\n \n We will discuss the problem of what tools can the solver be allowed to use when discussing how we define a _solution_ to a computational problem.\n \n # What are solutions in the world of computing?\n \n While there is no such thing called \"axioms of computation\" in standard literature, Prof. Kannan calls the following the base \"axioms\" or assumptions we make when defining what counts as a solution.\n \n - **It takes non-zero time to retrieve data from far-off locations**\n \n This essentially implies that the flow of information is not instantaneous. Consider memory so large that it spans the distance from Earth to Mars. If information transfer from the 2 ends of this memory was instantaneous then it would imply information traveling faster than light speed. This shouldn't be possible. Hence it is not feasible to allow our computational solutions to be able to have **random** access to huge chunks of memory.\n \n - **Only finite information can be stored/retrieved from finite volume**\n \n We cannot store infinite information in finite memory. Note that this assumption/axiom essentially invalidates solutions that allow time travel. If time travel were possible, we could go back in time to change the contents of memory/access infinite different states of some finite volume, and hence, allow infinite information access from finite memory. This is now ruled out.\n \n - **A finite-length code only exerts a finite amount of control**\n \n Any finite-length program cannot be omnipotent. That is, because the number of instructions is finite, there can only be a finite number of states the machine can exist in. Both the symbols making the instruction set and the instructions are finite, hence limiting the states it can be into a finite amount.\n \n \n \u003e ⚠️ Note that these assumptions are made because we are limited by the technology of our time. If we are able to construct technology that can, indeed, violate any of the above \"axioms\", then we will in fact be able to come up with a model of computation where we will be able to solve problems much harder than the ones we are able to do today.\n# How to compare computational solutions?\n\nNow that we have defined computational problems and solutions, we need a way to compare two different solutions to a problem and say deterministically which solution is _\"better\"._\n\nHowever, we again run into multiple challenges. It is difficult to come up with a deterministic answer to a somewhat subject question, _\"Which solution is better?\"_\n\nIn the field of complexity theory, we usually focus on worst-case analysis/asymptotic analysis. We measure the performance of a solution in terms of its input size. However, note that this is not necessarily the best method to compare two solutions. Let's say some solution 'a' takes (a heavy) constant amount of time to run and another solution 'b' runs in logarithmic time. For larger inputs, we should see algorithm `a` perform better than `b`. But it may be true that our machine is never provided large inputs. In this case, it might be better to compare the best case.\n\nTo judge which algorithm is \"better\", we can say that the solution which uses lesser resources to compute is better. However, we run into another challenge. Computing a solution does not usually just depend on _one_ resource. One very precious resource is time. But there are other resources that matter too. Space \u0026 power are two other important resources.\n\nIn general, we put **time** at a pedestal compared to all other resources. In general, every other resource can be reused or generated in some manner. We can reuse memory \u0026 generate power, but time once lost can never be gained back. Hence unless specified explicitly, when we compare 2 solutions, we often implicitly assume that we are comparing 2 solutions based on the most precious resource **time.**\n\n\u003e ☄️ Note that, _technically,_ according to relativity: space and time are the same constructs and we can effectively interchange them. It is possible to say that I can start a program that would take 50 million years to compute, load it into a spaceship, let it go on a trip across the universe close to the speed of light and then when it returns to earth after a month, collect the output. However, as we lack the resources to be able to do anything even remotely close to this in the near (or distant) future, we ignore these technicalities when deciding on the quantity we'd primarily wish to compare algorithms with.\n\nNow that we've defined how to define a problem, let's try to construct useful arguments using this definition. [Are there computational problems that computers cannot solve?](/blog/are-there-computational-problems-that-computers-cannot-solve). How do we define a \"solution\" to a computation problem in an \"algorithmic\" sense. Note that this theory dates to before when computers were invented. How do we formalize a notion of a machine that can carry out these tasks? This is what the [Church-Turing Hypothesis](/blog/church-turing-hypothesis) aims to answer. \n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:05:51.029195778+05:30", + "lastmodified": "2026-01-11T10:59:04.636800764+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-cs-algorithms-complexity" @@ -151,8 +151,8 @@ }, "/blog/derivatives-options": { "title": "Derivatives - Options", - "content": "In [What is the Stock Market?](/blog/what-is-the-stock-market), we learnt about what a stock market is, what stocks (or shares) of companies are and why people trade for them on the stock market. We use the blanket term **Equities** to refer to the company stocks traded on the stock market. \n\n\u003e Equity, typically referred to as shareholders' equity (or owners' equity for privately held companies), represents the amount of money that would be returned to a company's shareholders if all of the assets were liquidated and all of the company's debt was paid off in the case of liquidation. - [Equity Definition: What it is, How It Works and How to Calculate It - Investopedia](https://www.investopedia.com/terms/e/equity.asp)\n\nWhat we will discuss in this chapter, is a specific **derivative** of an financial instrument (here, a stock), called an **option**. \n\n\u003eA derivative is a security whose underlying asset dictates its pricing, risk, and basic term structure. Each derivative has an underlying asset that dictates its pricing, risk, and basic term structure. The perceived risk of the underlying asset influences the perceived risk of the derivative. - [Derivatives 101 - Investopedia](https://www.investopedia.com/articles/optioninvestor/10/derivatives-101.asp)\n\n# History \u0026 Origin\n\n\u003eThe earliest known options were bought around 600 BC by the Greek Philosopher Thales of Miletus. He believed that the coming summer would yield a bumper crop of olives. To make money of this idea, he could have purchased olive presses, which if you were right, would be in great demand, but he didn't have enough money to buy the machines. So instead he went to all the existing olive press owners and paid them a little bit of money to **secure the option to rent their presses in the summer for a specified price**. When the harvest came, Thales was right, there were so many olives that the price of renting a press skyrocketed. Thales paid the press owners their pre-agreed price, and then he rented out the machines at a higher rate and pocketed the difference. Thales had executed the first known call option.\n\u003e\n\u003e**CALL OPTION**\n\u003eA call option gives you the right, but not the obligation to buy something at a later date for a set price known as the strike price. Call options are useful if you expect the price to go up.\n\u003e\n\u003e**PUT OPTION**\n\u003eYou can also buy a put option, which gives you the right, but not the obligation to sell something at a later date for the strike price. Put options are useful if you expect the price to go down. \n\u003e\n\u003e- [The Trillion Dollar Equation - Veritasium](https://www.youtube.com/watch?v=A5w-dEgIU1M\u0026t=148s)\n\n# A Toy Example \nImagine you're bullish on Reliance Industries (RIL) and think its share price will rise. The current price of RIL is ₹1000, but you can buy a **call option** that gives you the **right, but not the obligation**, to buy RIL shares i**n one year** for **₹1000** (the **strike price**) by paying a **premium**, say ₹100.\n\n\u003e**Quick side note:** There are two main *styles* of options: American and European. American options allow you to exercise the option at any point before the expiry date. European options allow you to exercise the option on the expiry date. We'll focus on European options for now. In certain places, if the trader doesn’t specify exercising instructions, it goes for compulsory exercising by the regulatory authority and that day is termed as the exercise date for that option.\n\nSo, if after a year the price of RIL shoots up to ₹1300, you can use your option to buy shares at ₹1000 and immediately sell them at ₹1300. Here, after factoring in the ₹100 premium you paid, you've pocketed a profit of ₹200 (₹1300 selling price - ₹1000 strike price - ₹100 premium).\n\nHowever, if the share price tanks to ₹700 in a year, you simply let the option expire, losing only the ₹100 you paid for it.\n## PnL ANALYSIS\n![pasted-image-20240310192917](/images/pasted-image-20240310192917.webp)\n- **If the stock price falls below the strike price, you lose the option premium.** (In this case, you lose ₹100)\n- **But if the price climbs higher than the strike price, you earn the difference minus the option cost.** (Here, you make a profit of ₹200)\n\n| | | **PRICE INCREASES** | | **PRICE DECREASES** | |\n| -------------- | ------------------ | ------------------- | ---------- | ------------------- | ---------- |\n| **Instrument** | **Money Invested** | **Profit/Loss** | **Return** | **Profit/Loss** | **Return** |\n| Stock | ₹1000 | ₹300 | **30%** | -₹300 | **-30%** |\n| Option | ₹100 | ₹200 | **200%** | -₹100 | **-100%** |\nThe key thing to note here is the percentage difference in returns between the profit and loss scenarios. Options provide **massive leverage**. With the same ₹1000, I can instead choose to buy 10 options and possibly make ₹2000 in profit or stand to lose the entire amount invested (₹1000). \n### Strike Price\nThe predetermined price at which the holder of a stock option has the right (call option) or obligation (put option) to buy or sell the underlying stock / financial instrument.\n### In-the-Money (ITM) Option\nAn option is considered \"in the money\" if the current market price of the stock is already **favorable** for the option holder to exercise the option.\n- For a **call option**, the stock price should be **higher** than the strike price.\n- For a **put option**, the stock price should be **lower** than the strike price.\n### Out-(Of)-the-Money (OTM) Option\nAn option is considered \"out of the money\" if the current market price of the stock is **not favorable** for the option holder to exercise the option.\n- For a **call option**, the stock price should be **lower** than the strike price.\n- For a **put option**, the stock price should be **higher** than the strike price.\n# Advantages of Using Options\n## Limited Downside Risk\nCompared to buying the stock directly, options limit your potential losses. If you had bought RIL shares instead of the option and the price went down to ₹10, you'd lose ₹990. The downside risk with stocks is possibly infinite. With options, you only lose the premium, no matter how low the stock price goes. That said, most traders usually always place a stop-loss on the stocks they have in holding to artificially limit their downside. However, if the stock crashes in a single day, it might not be possible to trade at the stop loss and you might still stand to lose a lot more. With an option, you have a **fixed** downside. \n## Leverage\nOptions offer leverage, which means you can amplify your returns. If you had directly bought RIL at ₹1000 and the price went up to ₹1300, your investment would've grown by 30%. But with the option, you only paid ₹100 upfront. So your profit of ₹200 is actually a 200% return on your investment (₹200 profit / ₹100 option cost). However, remember that if the price falls, you lose your entire ₹100 premium, whereas owning the stock would only mean a loss equivalent to the fall in price. This is both useful and extremely risky if used as a gambling option. In practice, downside with stable stocks is not much compared to the 100% downside with options. \n## Hedging\nOptions can be a hedging tool to manage risk in your portfolio. They were originally created to mitigate risk, and can act like insurance for your stock holdings. To understand this better, let's walk through another toy example. \n### Toy Example\nImagine you're a big believer in HDFC Bank's long-term prospects, but you're worried about a potential market crash and want to hedge yourself against this risk. You currently hold 100 shares of HDFC Bank, currently priced at ₹2500 each (total investment: ₹2,50,000). To hedge against this risk, you **buy put options**. Think of a put option as an **insurance policy** for your stock. You can buy a put option that gives you the right, but not the obligation, to sell your HDFC Bank shares at a predetermined price (strike price) by a specific expiry date. For example, let's say you buy a put option with a strike price of ₹2500 and an expiry date of 3 months for a premium of ₹50 per share (total premium cost: ₹5000 for 100 shares). Now, let's do some PnL analysis. \n#### PnL ANALYSIS\n- **SCENARIO 1: Market Crash**\n\tThe worst happens. The market crashes, and HDFC Bank's share price drops to ₹2000. Without the options hedge, you would lose ₹$(2500 - 2000) \\times 100$ = ₹50,000. But, because you hedged yourself by buying put options, you can exercise your put option and sell your 100 HDFC Bank shares at the predetermined strike price of ₹2500 each (total sell value: ₹2,50,000). Here's the PnL breakdown:\n\t- Loss from stock price drop =\u003e ₹50,000\n\t- Profit from put option: ₹2500 (strike price) $\\times$ 100 shares - ₹2000 (cost of buying HDFC share now) $\\times$ 100 shares - ₹5000 (premium) = ₹45.000\n\tBy using the put option, you limited your loss to the cost of the premium (₹5000) instead of the entire ₹50,000 drop in stock price. \n\t\n- **SCENARIO 2: HDFC Stock Booms!**\n\tThankfully, the market remains stable, and HDFC Bank's share price even goes up to ₹2800. In this case, you wouldn't exercise the put option since you can sell your shares at a higher price in the open market. The put option would simply expire, and you would lose the initial premium of ₹5000. But that's a small price to pay for the security the put option provided during those nervous market moments.\n#### Key Takeaway\nOptions offer a flexible way to hedge your stock portfolio. While they won't completely eliminate risk, they can act as a safety net to minimize your losses in case the stock price takes a tumble. Think of it as setting a stop loss on your stock investments that you know you're guaranteed to bottom out at and you pay the insurance cost upfront. \n# Going Long vs Short on Options\n## Call Options\n### Long Call\nBuying a call option grants you the **right, but not the obligation**, to purchase a stock at a specific price (strike price) by a certain date (expiry). You're essentially betting the stock price will rise above the strike price by expiry. It's a **bullish** strategy.\n### Short Call\nSelling a call option obligates you to sell the underlying stock at the strike price by expiry if the buyer exercises the option. You collect a premium upfront for taking on this obligation. This strategy is used when you're **neutral** or **slightly bullish** on the stock price, believing it won't significantly rise above the strike price by expiry. It carries **unlimited potential loss** if the stock price soars.\n## Put Options\n### Long Put\nBuying a put option grants you the **right, but not the obligation**, to sell a stock at a specific price (strike price) by a certain date (expiry). You're essentially betting the stock price will fall below the strike price by expiry. It's a **bearish** strategy.\n### Short Put\nSelling a put option obligates you to buy the underlying stock at the strike price by expiry if the buyer exercises the option. You collect a premium upfront for taking on this obligation. This strategy is used when you're **bullish** on the stock's long-term prospects but believe it might dip in the short term. It offers **limited profit** but protects against a significant price decline (capped at the difference between the strike price and the purchase price).\n# Settlement Methods\nThis is an exchange specific problem, but different exchanges tackle the method of future / option contract settlement in different ways. The two ways of dealing with this implementation detail are **cash settlement** and **physical delivery**.\n## Cash Settlement\nCash settlement simplifies stock option contracts in India by eliminating the physical delivery of shares. \n\n\u003eA **cash settlement** is a settlement method used in certain futures and options contracts where, upon expiration or exercise, the seller of the financial instrument does not deliver the actual (physical) underlying asset but instead transfers the associated cash position. - [Cash Settlement - Investopedia](https://www.investopedia.com/terms/c/cashsettlement.asp)\n\nImagine you believe RIL's share price will fall and decide to go short on a call option contract for 100 shares. Traditionally, exercising this option would require you to purchase those 100 shares on contract expiry. With cash settlement, you only deal with the price difference at expiry.\n\n- **Profit Scenario:** Let's say you entered the contract when RIL was trading at ₹2,500 per share, with the contract quoting a predetermined purchase price of ₹2,800 per share. If the share price plummets to ₹2,000 at expiry, **the seller wouldn't have to arrange funds for the unwanted shares**. Instead, the long position (who bet on the price going up) would simply credit you with the difference – (₹2,800 - ₹2,000) $\\times$ 100 shares = ₹30,000.\n- **Loss Scenario:** Conversely, if RIL's share price skyrockets to ₹3,500, you wouldn't have to buy 100 shares at ₹3,500 either. You can just pay the difference: ₹70,000. \n\nIt eliminates the hassle of physical share delivery, focusing solely on the price differential at expiry. This translates to a more efficient and safer market for stock exchanges as the risk they have to take up is less. Since the one taking the loss side of the trade doesn't need to have assets to buy the entire underlying stock and just needs to pay the difference, which is often much cheaper in comparison.\n\n## Physical Delivery\nPhysical delivery in stock options is the traditional method where the underlying shares are physically exchanged upon expiry. While cash settlement simplifies things, physical delivery offers a different experience. Physical delivery adds an extra layer of complexity compared to cash settlement. It requires managing the logistics of share certificates and potential delivery costs.\n\nThe Indian National Stock Exchange (since July 2018 expiry contracts), uses physical delivery as the mode of settlement of futures contracts. \n\n\u003eAs stated in this [SEBI circular](https://www.sebi.gov.in/legal/circulars/apr-2018/review-of-framework-for-stocks-in-derivatives-segment_38629.html), starting from July 2018 expiry, F\u0026O positions are being settled moved from cash settlement mode to compulsory physical delivery settlement in a phased manner. Starting from October 2019 expiry, all stock F\u0026O contracts will be compulsorily physically settled. If you hold a position in any Stock F\u0026O contract, at expiry, you will be required to give/take delivery of stocks. \n\u003e\n\u003e- **In the money contracts (ITM)**\n\u003e\tAll ITM contracts which aren’t CTM will be mandatorily exercised by the exchange. This means that anyone holding an ITM option contract will receive/give delivery of stocks depending on whether one is holding call/put options. All the costs arising out of this delivery obligation will be applied to the client’s account.\n\u003e\n\u003e- **Out of the money contracts (OTM)**\n\u003e\tAll OTM options will expire worthless. There will be no delivery obligations arising out of this.\n\u003e\n\u003e**Spread and covered contracts**\n\u003e\tSpread contracts that result in both – take and give delivery obligation will be netted off for the client. For example, you have a bull call spread of Reliance of the same expiry, a lot of long call options of strike 1300 and a lot of short call options of strike 1320 and the spot expires at 1330, this will result in a net-off and there won’t be any delivery obligation.\n\u003e\t\n\u003e- [Policy on settlement of compulsory delivery derivative contracts — Update Oct 2019 - Zerodha](https://zerodha.com/z-connect/general/policy-on-settlement-of-compulsory-delivery-derivative-contracts-update-oct-2019)\n\u003e\n\u003ePhysical delivery of stock options can potentially lead to systemic risk in the capital markets and pose a risk to traders. \n\u003e\n\u003e**The physical delivery risk**\n\u003e\tLike I mentioned earlier, if you hold stock futures or any in the money stock option at the close of expiry, you are assigned to give or take delivery of the entire contract value worth of stocks. Since the risk goes up with respect to the client not having enough cash to take delivery or stock to give delivery, the margins required to hold a future or short option position goes up as we get closer to expiry. Margins required are a minimum of 40% of the contract value for futures on the last day of expiry. For in the money long or buy option positions, a delivery margin is assigned from 4 days before expiry. The margins for in the money long options [go up from 10% to 50% of contract value](https://support.zerodha.com/category/trading-and-markets/margin-leverage-and-product-and-order-types/articles/policy-on-physical-settlement)—50% on the last two days of expiry. If the customer doesn’t have sufficient funds or stocks to give or take delivery, the broker squares off the contract. If the customer shows an intent to hold after the higher margin is blocked, it shows an intent to give or take delivery. \n\u003e\t\n\u003e\tThe risk though comes from out of the money options that suddenly turn in the money on the last day of expiry. No additional margins are blocked for OTM options in the expiry week, and when it suddenly turns in the money, a customer with small amounts of premium and no margin can get assigned to give or take large delivery positions, causing significant risk to the trader and the brokerage firm.\n\u003e\t\n\u003e- [Physical delivery of stock F\u0026O \u0026 their risks - Zerodha](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n\n### A Case Study on the Risk Involved in Physical Delivery Settlement\n\u003eThis happened on Dec expiry, Thursday 30th Dec 2021. Shares of Hindalco closed at Rs 449.65 at expiry. This meant that the Hindalco 450 PE expired just in the money by 35 paise. This meant that everyone who had bought this 450 PE and held it at the expiry was required to deliver Hindalco stock—1075 shares for every 1 lot of Hindalco held. \n\u003e\n\u003eThis is what happened to Hindalco shares on 30th Dec:\n\u003e\n\u003e![pasted-image-20240312051304](/images/pasted-image-20240312051304.webp)\n\u003e\n\u003eThe stock was above Rs 450 for most of the expiry day and even a few days prior to it. Since it was out of money, no additional physical margins would have been charged, and everyone holding this strike would have assumed that it would expire out of the money. In all likelihood, everyone who held this put option would have written off the trade as a loss and assumed that the maximum loss would be limited to the premium paid. \n\u003e\n\u003eSo at 3 pm, when the Hindalco stock price went below 450, this was how the marketdepth looked like. Those who realized that this option would expire in the money trying to exit, but with no buyers to be able to do so even at Rs 0.05 when the intrinsic value of the strike was Rs 0.35.\n\u003e\n\u003eEveryone holding long puts would have been forced assigned to deliver Hindalco shares. 1 lot of Hindalco = 1075 shares = ~Rs 5lks contract value. Customers who had bought put options with a few thousand rupees were potentially required to deliver tens of lakhs of Hindalco stock. Failing to deliver would have meant short delivery. The [consequences of short delivery](https://support.zerodha.com/category/trading-and-markets/trading-faqs/general/articles/what-is-short-delivery-and-what-are-its-consequences) are losses in terms of auction penalty, apart from the market risk of Hindalco stock price going up from the close of expiry to the auction date. Hindalco stock was up 5% already on Friday, and the auction happens on T+3 days or on Tuesday, and assuming the stock price doesn’t go up further, that is still a whopping loss of Rs 25 (5% of Hindalco) for Rs 0.35 worth of premium at market close. \n\u003e\n\u003eIf this wasn’t puts but calls, there wouldn’t be a short delivery risk, but there would still be a market risk that the customer would be exposed to from the close of expiry to when the customer can sell the stock. But in case of buy delivery (Buy futures, buy calls, short puts), the stock can be sold the next day itself and hence there is no marked to market risk of 3 days. The risk is exponentially more in the case of F\u0026O positions that can lead to short delivery (Short futures, sell calls, buy puts). \n\u003e\n\u003eThe risk exists with futures, short options, and buy ITM options as well. But since there are sufficient margins that also go up closer to expiry, a customer who provides additional margin is willingly holding the position, or else the position is squared off. Because there are no additional physical delivery margins for OTM options and because most option buyers think that when they buy options the maximum they can lose is equal to the premium paid and take no action, the risks go up for the entire ecosystem.\n\u003e\n\u003eApart from the risk to the trader, this can be a systemic issue because if a customer account goes into debit, the liability falls on the broker. A large individual trader or group of customers of a broker could potentially go into a large enough debit to bankrupt the brokerage firm and, in turn, put the risk on other customers as well. Stocks can move drastically on expiry day, and out of the money, option contracts can suddenly move just in the money with no liquidity to exit, making it impossible for brokerage risk management teams to do anything. All option contracts are settled based on the last 30 min average price of the underlying stock and not the last traded price, making this even trickier without knowing if a CTM option strike will actually close in the money or not until post the market closing. And like I explained earlier, the risk is not just in terms of the auction and short delivery, but also marked to market risk for 3 days.\n\u003e\n\u003eForcing traders to give or take large delivery positions can potentially be misused by large traders or operators wanting to manipulate the price movement of stocks.\n\u003e- [Physical delivery of stock F\u0026O \u0026 their risks](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n\n# References\n1. [The Trillion Dollar Equation](https://www.youtube.com/@veritasium)\n2. [What is Zerodha's policy on the physical settlement of equity derivatives on expiry?](https://support.zerodha.com/category/trading-and-markets/margins/margin-leverage-and-product-and-order-types/articles/policy-on-physical-settlement)\n3. [Cash Settlement - Investopedia](https://www.investopedia.com/terms/c/cashsettlement.asp)\n4. [Physical Delivery - Investopedia](https://www.investopedia.com/terms/p/physicaldelivery.asp)\n5. [Policy on settlement of compulsory delivery derivative contracts — Update Oct 2019 - Zerodha](https://zerodha.com/z-connect/general/policy-on-settlement-of-compulsory-delivery-derivative-contracts-update-oct-2019)\n6. [Physical delivery of stock F\u0026O \u0026 their risks - Zerodha](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n\n\n---\n\n## Related Reading\n\n- [Covered Calls Strategy](/blog/covered-calls-strategy)\n- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis)\n", - "lastmodified": "2026-01-11T05:05:51.5240464+05:30", + "content": "In [What is the Stock Market?](/blog/what-is-the-stock-market), we learnt about what a stock market is, what stocks (or shares) of companies are and why people trade for them on the stock market. We use the blanket term **Equities** to refer to the company stocks traded on the stock market. \n\n\u003e Equity, typically referred to as shareholders' equity (or owners' equity for privately held companies), represents the amount of money that would be returned to a company's shareholders if all of the assets were liquidated and all of the company's debt was paid off in the case of liquidation. - [Equity Definition: What it is, How It Works and How to Calculate It - Investopedia](https://www.investopedia.com/terms/e/equity.asp)\n\nWhat we will discuss in this chapter, is a specific **derivative** of an financial instrument (here, a stock), called an **option**. \n\n\u003eA derivative is a security whose underlying asset dictates its pricing, risk, and basic term structure. Each derivative has an underlying asset that dictates its pricing, risk, and basic term structure. The perceived risk of the underlying asset influences the perceived risk of the derivative. - [Derivatives 101 - Investopedia](https://www.investopedia.com/articles/optioninvestor/10/derivatives-101.asp)\n\n# History \u0026 Origin\n\n\u003eThe earliest known options were bought around 600 BC by the Greek Philosopher Thales of Miletus. He believed that the coming summer would yield a bumper crop of olives. To make money of this idea, he could have purchased olive presses, which if you were right, would be in great demand, but he didn't have enough money to buy the machines. So instead he went to all the existing olive press owners and paid them a little bit of money to **secure the option to rent their presses in the summer for a specified price**. When the harvest came, Thales was right, there were so many olives that the price of renting a press skyrocketed. Thales paid the press owners their pre-agreed price, and then he rented out the machines at a higher rate and pocketed the difference. Thales had executed the first known call option.\n\u003e\n\u003e**CALL OPTION**\n\u003eA call option gives you the right, but not the obligation to buy something at a later date for a set price known as the strike price. Call options are useful if you expect the price to go up.\n\u003e\n\u003e**PUT OPTION**\n\u003eYou can also buy a put option, which gives you the right, but not the obligation to sell something at a later date for the strike price. Put options are useful if you expect the price to go down. \n\u003e\n\u003e- [The Trillion Dollar Equation - Veritasium](https://www.youtube.com/watch?v=A5w-dEgIU1M\u0026t=148s)\n\n# A Toy Example \nImagine you're bullish on Reliance Industries (RIL) and think its share price will rise. The current price of RIL is ₹1000, but you can buy a **call option** that gives you the **right, but not the obligation**, to buy RIL shares i**n one year** for **₹1000** (the **strike price**) by paying a **premium**, say ₹100.\n\n\u003e**Quick side note:** There are two main *styles* of options: American and European. American options allow you to exercise the option at any point before the expiry date. European options allow you to exercise the option on the expiry date. We'll focus on European options for now. In certain places, if the trader doesn’t specify exercising instructions, it goes for compulsory exercising by the regulatory authority and that day is termed as the exercise date for that option.\n\nSo, if after a year the price of RIL shoots up to ₹1300, you can use your option to buy shares at ₹1000 and immediately sell them at ₹1300. Here, after factoring in the ₹100 premium you paid, you've pocketed a profit of ₹200 (₹1300 selling price - ₹1000 strike price - ₹100 premium).\n\nHowever, if the share price tanks to ₹700 in a year, you simply let the option expire, losing only the ₹100 you paid for it.\n## PnL ANALYSIS\n![pasted-image-20240310192917](/images/pasted-image-20240310192917.webp)\n- **If the stock price falls below the strike price, you lose the option premium.** (In this case, you lose ₹100)\n- **But if the price climbs higher than the strike price, you earn the difference minus the option cost.** (Here, you make a profit of ₹200)\n\n| | | **PRICE INCREASES** | | **PRICE DECREASES** | |\n| -------------- | ------------------ | ------------------- | ---------- | ------------------- | ---------- |\n| **Instrument** | **Money Invested** | **Profit/Loss** | **Return** | **Profit/Loss** | **Return** |\n| Stock | ₹1000 | ₹300 | **30%** | -₹300 | **-30%** |\n| Option | ₹100 | ₹200 | **200%** | -₹100 | **-100%** |\nThe key thing to note here is the percentage difference in returns between the profit and loss scenarios. Options provide **massive leverage**. With the same ₹1000, I can instead choose to buy 10 options and possibly make ₹2000 in profit or stand to lose the entire amount invested (₹1000). \n### Strike Price\nThe predetermined price at which the holder of a stock option has the right (call option) or obligation (put option) to buy or sell the underlying stock / financial instrument.\n### In-the-Money (ITM) Option\nAn option is considered \"in the money\" if the current market price of the stock is already **favorable** for the option holder to exercise the option.\n- For a **call option**, the stock price should be **higher** than the strike price.\n- For a **put option**, the stock price should be **lower** than the strike price.\n### Out-(Of)-the-Money (OTM) Option\nAn option is considered \"out of the money\" if the current market price of the stock is **not favorable** for the option holder to exercise the option.\n- For a **call option**, the stock price should be **lower** than the strike price.\n- For a **put option**, the stock price should be **higher** than the strike price.\n# Advantages of Using Options\n## Limited Downside Risk\nCompared to buying the stock directly, options limit your potential losses. If you had bought RIL shares instead of the option and the price went down to ₹10, you'd lose ₹990. The downside risk with stocks is possibly infinite. With options, you only lose the premium, no matter how low the stock price goes. That said, most traders usually always place a stop-loss on the stocks they have in holding to artificially limit their downside. However, if the stock crashes in a single day, it might not be possible to trade at the stop loss and you might still stand to lose a lot more. With an option, you have a **fixed** downside. \n## Leverage\nOptions offer leverage, which means you can amplify your returns. If you had directly bought RIL at ₹1000 and the price went up to ₹1300, your investment would've grown by 30%. But with the option, you only paid ₹100 upfront. So your profit of ₹200 is actually a 200% return on your investment (₹200 profit / ₹100 option cost). However, remember that if the price falls, you lose your entire ₹100 premium, whereas owning the stock would only mean a loss equivalent to the fall in price. This is both useful and extremely risky if used as a gambling option. In practice, downside with stable stocks is not much compared to the 100% downside with options. \n## Hedging\nOptions can be a hedging tool to manage risk in your portfolio. They were originally created to mitigate risk, and can act like insurance for your stock holdings. To understand this better, let's walk through another toy example. \n### Toy Example\nImagine you're a big believer in HDFC Bank's long-term prospects, but you're worried about a potential market crash and want to hedge yourself against this risk. You currently hold 100 shares of HDFC Bank, currently priced at ₹2500 each (total investment: ₹2,50,000). To hedge against this risk, you **buy put options**. Think of a put option as an **insurance policy** for your stock. You can buy a put option that gives you the right, but not the obligation, to sell your HDFC Bank shares at a predetermined price (strike price) by a specific expiry date. For example, let's say you buy a put option with a strike price of ₹2500 and an expiry date of 3 months for a premium of ₹50 per share (total premium cost: ₹5000 for 100 shares). Now, let's do some PnL analysis. \n#### PnL ANALYSIS\n- **SCENARIO 1: Market Crash**\n\tThe worst happens. The market crashes, and HDFC Bank's share price drops to ₹2000. Without the options hedge, you would lose ₹$(2500 - 2000) \\times 100$ = ₹50,000. But, because you hedged yourself by buying put options, you can exercise your put option and sell your 100 HDFC Bank shares at the predetermined strike price of ₹2500 each (total sell value: ₹2,50,000). Here's the PnL breakdown:\n\t- Loss from stock price drop =\u003e ₹50,000\n\t- Profit from put option: ₹2500 (strike price) $\\times$ 100 shares - ₹2000 (cost of buying HDFC share now) $\\times$ 100 shares - ₹5000 (premium) = ₹45.000\n\tBy using the put option, you limited your loss to the cost of the premium (₹5000) instead of the entire ₹50,000 drop in stock price. \n\t\n- **SCENARIO 2: HDFC Stock Booms!**\n\tThankfully, the market remains stable, and HDFC Bank's share price even goes up to ₹2800. In this case, you wouldn't exercise the put option since you can sell your shares at a higher price in the open market. The put option would simply expire, and you would lose the initial premium of ₹5000. But that's a small price to pay for the security the put option provided during those nervous market moments.\n#### Key Takeaway\nOptions offer a flexible way to hedge your stock portfolio. While they won't completely eliminate risk, they can act as a safety net to minimize your losses in case the stock price takes a tumble. Think of it as setting a stop loss on your stock investments that you know you're guaranteed to bottom out at and you pay the insurance cost upfront. \n# Going Long vs Short on Options\n## Call Options\n### Long Call\nBuying a call option grants you the **right, but not the obligation**, to purchase a stock at a specific price (strike price) by a certain date (expiry). You're essentially betting the stock price will rise above the strike price by expiry. It's a **bullish** strategy.\n### Short Call\nSelling a call option obligates you to sell the underlying stock at the strike price by expiry if the buyer exercises the option. You collect a premium upfront for taking on this obligation. This strategy is used when you're **neutral** or **slightly bullish** on the stock price, believing it won't significantly rise above the strike price by expiry. It carries **unlimited potential loss** if the stock price soars.\n## Put Options\n### Long Put\nBuying a put option grants you the **right, but not the obligation**, to sell a stock at a specific price (strike price) by a certain date (expiry). You're essentially betting the stock price will fall below the strike price by expiry. It's a **bearish** strategy.\n### Short Put\nSelling a put option obligates you to buy the underlying stock at the strike price by expiry if the buyer exercises the option. You collect a premium upfront for taking on this obligation. This strategy is used when you're **bullish** on the stock's long-term prospects but believe it might dip in the short term. It offers **limited profit** but protects against a significant price decline (capped at the difference between the strike price and the purchase price).\n# Settlement Methods\nThis is an exchange specific problem, but different exchanges tackle the method of future / option contract settlement in different ways. The two ways of dealing with this implementation detail are **cash settlement** and **physical delivery**.\n## Cash Settlement\nCash settlement simplifies stock option contracts in India by eliminating the physical delivery of shares. \n\n\u003eA **cash settlement** is a settlement method used in certain futures and options contracts where, upon expiration or exercise, the seller of the financial instrument does not deliver the actual (physical) underlying asset but instead transfers the associated cash position. - [Cash Settlement - Investopedia](https://www.investopedia.com/terms/c/cashsettlement.asp)\n\nImagine you believe RIL's share price will fall and decide to go short on a call option contract for 100 shares. Traditionally, exercising this option would require you to purchase those 100 shares on contract expiry. With cash settlement, you only deal with the price difference at expiry.\n\n- **Profit Scenario:** Let's say you entered the contract when RIL was trading at ₹2,500 per share, with the contract quoting a predetermined purchase price of ₹2,800 per share. If the share price plummets to ₹2,000 at expiry, **the seller wouldn't have to arrange funds for the unwanted shares**. Instead, the long position (who bet on the price going up) would simply credit you with the difference – (₹2,800 - ₹2,000) $\\times$ 100 shares = ₹30,000.\n- **Loss Scenario:** Conversely, if RIL's share price skyrockets to ₹3,500, you wouldn't have to buy 100 shares at ₹3,500 either. You can just pay the difference: ₹70,000. \n\nIt eliminates the hassle of physical share delivery, focusing solely on the price differential at expiry. This translates to a more efficient and safer market for stock exchanges as the risk they have to take up is less. Since the one taking the loss side of the trade doesn't need to have assets to buy the entire underlying stock and just needs to pay the difference, which is often much cheaper in comparison.\n\n## Physical Delivery\nPhysical delivery in stock options is the traditional method where the underlying shares are physically exchanged upon expiry. While cash settlement simplifies things, physical delivery offers a different experience. Physical delivery adds an extra layer of complexity compared to cash settlement. It requires managing the logistics of share certificates and potential delivery costs.\n\nThe Indian National Stock Exchange (since July 2018 expiry contracts), uses physical delivery as the mode of settlement of futures contracts. \n\n\u003eAs stated in this [SEBI circular](https://www.sebi.gov.in/legal/circulars/apr-2018/review-of-framework-for-stocks-in-derivatives-segment_38629.html), starting from July 2018 expiry, F\u0026O positions are being settled moved from cash settlement mode to compulsory physical delivery settlement in a phased manner. Starting from October 2019 expiry, all stock F\u0026O contracts will be compulsorily physically settled. If you hold a position in any Stock F\u0026O contract, at expiry, you will be required to give/take delivery of stocks. \n\u003e\n\u003e- **In the money contracts (ITM)**\n\u003e\tAll ITM contracts which aren’t CTM will be mandatorily exercised by the exchange. This means that anyone holding an ITM option contract will receive/give delivery of stocks depending on whether one is holding call/put options. All the costs arising out of this delivery obligation will be applied to the client’s account.\n\u003e\n\u003e- **Out of the money contracts (OTM)**\n\u003e\tAll OTM options will expire worthless. There will be no delivery obligations arising out of this.\n\u003e\n\u003e**Spread and covered contracts**\n\u003e\tSpread contracts that result in both – take and give delivery obligation will be netted off for the client. For example, you have a bull call spread of Reliance of the same expiry, a lot of long call options of strike 1300 and a lot of short call options of strike 1320 and the spot expires at 1330, this will result in a net-off and there won’t be any delivery obligation.\n\u003e\t\n\u003e- [Policy on settlement of compulsory delivery derivative contracts — Update Oct 2019 - Zerodha](https://zerodha.com/z-connect/general/policy-on-settlement-of-compulsory-delivery-derivative-contracts-update-oct-2019)\n\u003e\n\u003ePhysical delivery of stock options can potentially lead to systemic risk in the capital markets and pose a risk to traders. \n\u003e\n\u003e**The physical delivery risk**\n\u003e\tLike I mentioned earlier, if you hold stock futures or any in the money stock option at the close of expiry, you are assigned to give or take delivery of the entire contract value worth of stocks. Since the risk goes up with respect to the client not having enough cash to take delivery or stock to give delivery, the margins required to hold a future or short option position goes up as we get closer to expiry. Margins required are a minimum of 40% of the contract value for futures on the last day of expiry. For in the money long or buy option positions, a delivery margin is assigned from 4 days before expiry. The margins for in the money long options [go up from 10% to 50% of contract value](https://support.zerodha.com/category/trading-and-markets/margin-leverage-and-product-and-order-types/articles/policy-on-physical-settlement)—50% on the last two days of expiry. If the customer doesn’t have sufficient funds or stocks to give or take delivery, the broker squares off the contract. If the customer shows an intent to hold after the higher margin is blocked, it shows an intent to give or take delivery. \n\u003e\t\n\u003e\tThe risk though comes from out of the money options that suddenly turn in the money on the last day of expiry. No additional margins are blocked for OTM options in the expiry week, and when it suddenly turns in the money, a customer with small amounts of premium and no margin can get assigned to give or take large delivery positions, causing significant risk to the trader and the brokerage firm.\n\u003e\t\n\u003e- [Physical delivery of stock F\u0026O \u0026 their risks - Zerodha](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n\n### A Case Study on the Risk Involved in Physical Delivery Settlement\n\u003eThis happened on Dec expiry, Thursday 30th Dec 2021. Shares of Hindalco closed at Rs 449.65 at expiry. This meant that the Hindalco 450 PE expired just in the money by 35 paise. This meant that everyone who had bought this 450 PE and held it at the expiry was required to deliver Hindalco stock—1075 shares for every 1 lot of Hindalco held. \n\u003e\n\u003eThis is what happened to Hindalco shares on 30th Dec:\n\u003e\n\u003e![pasted-image-20240312051304](/images/pasted-image-20240312051304.webp)\n\u003e\n\u003eThe stock was above Rs 450 for most of the expiry day and even a few days prior to it. Since it was out of money, no additional physical margins would have been charged, and everyone holding this strike would have assumed that it would expire out of the money. In all likelihood, everyone who held this put option would have written off the trade as a loss and assumed that the maximum loss would be limited to the premium paid. \n\u003e\n\u003eSo at 3 pm, when the Hindalco stock price went below 450, this was how the marketdepth looked like. Those who realized that this option would expire in the money trying to exit, but with no buyers to be able to do so even at Rs 0.05 when the intrinsic value of the strike was Rs 0.35.\n\u003e\n\u003eEveryone holding long puts would have been forced assigned to deliver Hindalco shares. 1 lot of Hindalco = 1075 shares = ~Rs 5lks contract value. Customers who had bought put options with a few thousand rupees were potentially required to deliver tens of lakhs of Hindalco stock. Failing to deliver would have meant short delivery. The [consequences of short delivery](https://support.zerodha.com/category/trading-and-markets/trading-faqs/general/articles/what-is-short-delivery-and-what-are-its-consequences) are losses in terms of auction penalty, apart from the market risk of Hindalco stock price going up from the close of expiry to the auction date. Hindalco stock was up 5% already on Friday, and the auction happens on T+3 days or on Tuesday, and assuming the stock price doesn’t go up further, that is still a whopping loss of Rs 25 (5% of Hindalco) for Rs 0.35 worth of premium at market close. \n\u003e\n\u003eIf this wasn’t puts but calls, there wouldn’t be a short delivery risk, but there would still be a market risk that the customer would be exposed to from the close of expiry to when the customer can sell the stock. But in case of buy delivery (Buy futures, buy calls, short puts), the stock can be sold the next day itself and hence there is no marked to market risk of 3 days. The risk is exponentially more in the case of F\u0026O positions that can lead to short delivery (Short futures, sell calls, buy puts). \n\u003e\n\u003eThe risk exists with futures, short options, and buy ITM options as well. But since there are sufficient margins that also go up closer to expiry, a customer who provides additional margin is willingly holding the position, or else the position is squared off. Because there are no additional physical delivery margins for OTM options and because most option buyers think that when they buy options the maximum they can lose is equal to the premium paid and take no action, the risks go up for the entire ecosystem.\n\u003e\n\u003eApart from the risk to the trader, this can be a systemic issue because if a customer account goes into debit, the liability falls on the broker. A large individual trader or group of customers of a broker could potentially go into a large enough debit to bankrupt the brokerage firm and, in turn, put the risk on other customers as well. Stocks can move drastically on expiry day, and out of the money, option contracts can suddenly move just in the money with no liquidity to exit, making it impossible for brokerage risk management teams to do anything. All option contracts are settled based on the last 30 min average price of the underlying stock and not the last traded price, making this even trickier without knowing if a CTM option strike will actually close in the money or not until post the market closing. And like I explained earlier, the risk is not just in terms of the auction and short delivery, but also marked to market risk for 3 days.\n\u003e\n\u003eForcing traders to give or take large delivery positions can potentially be misused by large traders or operators wanting to manipulate the price movement of stocks.\n\u003e- [Physical delivery of stock F\u0026O \u0026 their risks](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n\n# References\n1. [The Trillion Dollar Equation](https://www.youtube.com/@veritasium)\n2. [What is Zerodha's policy on the physical settlement of equity derivatives on expiry?](https://support.zerodha.com/category/trading-and-markets/margins/margin-leverage-and-product-and-order-types/articles/policy-on-physical-settlement)\n3. [Cash Settlement - Investopedia](https://www.investopedia.com/terms/c/cashsettlement.asp)\n4. [Physical Delivery - Investopedia](https://www.investopedia.com/terms/p/physicaldelivery.asp)\n5. [Policy on settlement of compulsory delivery derivative contracts — Update Oct 2019 - Zerodha](https://zerodha.com/z-connect/general/policy-on-settlement-of-compulsory-delivery-derivative-contracts-update-oct-2019)\n6. [Physical delivery of stock F\u0026O \u0026 their risks - Zerodha](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks)\n", + "lastmodified": "2026-01-11T10:59:51.592928664+05:30", "tags": [ "domain-finance-economics", "domain-finance-quantitative" @@ -161,15 +161,15 @@ "/blog/dna-sequencing": { "title": "DNA Sequencing", "content": "# Preface \u0026 References\nI document topics I've discovered and my exploration of these topics while following the course, [Algorithms for DNA Sequencing, by John Hopkins University](https://www.coursera.org/learn/dna-sequencing) on [Coursera](https://www.coursera.org/). The course is taken by two instructors [Ben Langmead](https://scholar.google.com/citations?user=2JMaTKsAAAAJ\u0026hl=en) and [Jacob Pritt](https://www.coursera.org/instructor/jacobpritt).\n\nWe will study the fundamental ideas, techniques, and data structures needed to analyze DNA sequencing data. In order to put these and related concepts into practice, we will combine what we learn with our programming expertise. Real genome sequences and real sequencing data will be used in our study. We will use Boyer-Moore to enhance naïve precise matching. We then learn indexing, preprocessing, grouping and ordering in indexing, K-mers, k-mer indices and to solve the approximate matching problem. Finally, we will discuss solving the alignment problem and explore interesting topics such as De Brujin Graphs, Eulerian walks and the Shortest common super-string problem. \n\nAlong the way, I document content I've read about while exploring related topics such as suffix string structures and relations to my research work on the STAR aligner.\n\n# DNA Sequencing\nDNA sequencing is a powerful tool used by scientists to study topics such as rare genetic diseases in children, tumors, microbes that live in us, etc. all of which have profound implications on our lives. Sequencing is used pretty much everywhere in live sciences and medicines today. The technology used for sequencing has come down in cost and that has caused for a big 'boom' in the development of this field, similar to how transistor prices going down kick-started the computing industry. \n\nAlgorithms play a key role in this field. Take for example, the effort to sequence the human genome back in the late 90s. There were two popular school's of thought, one who believed that an algorithm crux to the sequencing of the human genome (called de novo assembly) was computationally infeasible in practice, while the others believed that with a large enough compute node it was indeed possible. Finally, it was the second set of people who succeeded by tackling the computational challenge head to head, which allowed them to progress much quicker. It is important for us to know what's possible and what's practical to actually compute. Further, knowing about what work has already been done is the first step to figuring out where the next contribution's should be and how. \n\n## DNA sequencing: Past and present\n\nFirst generation DNA sequencing was a method invented by Fred Sagner and was also known as \"Chain termination\" sequencing. It was quite labour intensive but over the years improved and many tasks were automated. The HGP (Human Genome Project) used 100s of first generation DNA sequencers to sequence the human genome. However, what we're more interested in is what happened to the cost-per-genome ratio right after the end of the Human Genome Project towards the beginning of the 2000s. \n\n![cpg-1](/images/cpg-1.webp)\nSource: [Sequencing Human Genome Cost - NIH](https://genome.gov/sequencingcosts)\n\nAs we can see, something important happened around the year 2007. This is the year when a new kind of sequencing technology started to be used in life science labs around the world. This technology was called 'next' generation sequencing or 'second' generation sequencing. But the name that probably describes it best is 'massively-parallel' sequencing. Add to this improvements in technology, speed, etc. and there was massive technological and algorithmic improvements in this field since then. \n\n## How DNA Gets Copied\n\n### DNA as Strings\n\nWe are pretty familiar with the double-helix structure of DNA. If we un-ravel this helix and just pick one of the two 'rails', then this strand of the original DNA sequence is split into four sub-sequences made up of the bases A, C, G, or T so that four point sets may be created based on the position of each nucleotide in the original DNA sequence in order to fully use the global information of the DNA sequence. This means that we can represent DNA sequences in the form of a long string containing just the characters 'A', 'C', 'G' and 'T'. \n\n![dna-as-a-string](/images/dna-as-a-string.webp)\n\nThis has further implications that any read of the DNA sequence simply translates to sub-strings in the original DNA string. This essentially allows us to use the massive literature and work that we have done in the field of string algorithms in the field of DNA sequencing. \n\n### The copying process\n\nDNA exists in base pairs A-T and C-G. Your genome is almost present in every cell in your body, therefore when one of these cells splits, it must transmit a copy of your genome to each of the progeny cells. Consequently, DNA is first double stranded before being divided into two single stranded molecules. It seems as though we split this ladder straight down the middle. We now have two distinct strands as a result of the separation of the complimentary base pairs. The genome sequence is still recorded on each strand, and the two strands are complementary to one another despite their separation. It acts as a sort of template and provides the instructions necessary for re-creating the original DNA sequence. The name of the molecule, the enzyme that puts the complementary bases in it's place, is called DNA polymerase. Given one of these single-stranded templates and a base, DNA polymerase is a tiny biological device that can synthesize DNA (this base might be floating around somewhere just waiting to be incorporated). With these two elements, the polymerase will piecemeal construct the complementary strand to produce a double-stranded replica of the template strand.\n\n## Massively parallel DNA sequencers\n\nReads refer to random sub-strings picked from a DNA sequence. One human chromosome is on the order 1 million bases long. Massively parallel sequencers produce reads that are around 100-150 bases long, but produce a huge amount of them. A sequencer 'eavesdrops' on the DNA copying process to sequence many templates simultaneously. This is how the process works in a nutshell. \n\n1. Convert input DNA into short single-stranded templates.\n\n2. Deposit on slide (scattering the multiple strands randomly across surface)\n\n3. Add DNA polymerase to this slide\n\n4. Add bases (raw material) to this slide, which are 'terminated' by a special chemical piece which doesn't allow the polymerase to construct anything **on top** of the base it adds to the template.\n\n5. Take a 'top-down' snapshot of the entire slide. (Terminators are engineered to glow a certain color which allows easy identification of the base)\n\n6. Remove the terminators\n\n7. Repeat until all the templates are built fully\n\nThe following is a visual depiction of the same.\n\n![mpds-1](/images/mpds-1.webp)\n![mpds-2](/images/mpds-2.webp)\n![mpds-3](/images/mpds-3.webp)\n![mpds-4](/images/mpds-4.webp)\n\n## Sequencing Errors and Base Quality\n\nThe process described above is largely accurate, but a minor detail we skimped out on is that before the sequencing begins, we amplify each template strand with multiple copies in a cluster. This allows the camera to more easily spot the glowing color of each cluster as just one strand is not enough to accurately distinguish the color. However, there is a hidden problem here. Say during one of the build cycles one of the bases in the solution is unterminated. This would cause the polymerase to go ahead and place the next base as well on top of what should've been this cycle's base. Now because this is a cluster, the majority color would still likely dominate. However, notice that once a base is out of cycle, it will always remain out of cycle. This means that with more and more cycles, the rate of error gets higher and higher. \n\n![mpds-e](/images/mpds-e.webp)\n\nTo counter this, we developed software called the 'base caller' which analyzer the images and tries to attach a confidence score to how confident it is about the base for each cluster in each cycle. The value reported is called the 'base quality.'\n\n$$Base \\ Quality \\ (Q) = -10 \\cdot \\log_{10}p$$\n\n$p$ is the probability that the base call is incorrect. This scale provides an easier interpretation of the probability value. For example, $Q = 10 \\to 1$ in $10$ chance that the call is incorrect. $Q = 20 \\to 1$ in $100$, and so on. The probability computation probably involves ML model nowadays but a reasonable measure one would imagine is simply computing \n\n$$p(not \\ orange) = \\frac{non \\ orange \\ light}{total \\ light}$$\n\nif it predicts orange as the base.\n\n## Re-Constructing Genome Using the Sequencing Reads\n\nOnce we have the billions of tiny sequenced reads, they are analogous to having a lot of tiny paper cutouts of a newspaper. Good for a collage, but not useful for reading the news. To make sense of these reads, we need to be able to stitch this back into one complete picture (back to the genome). To do this, we consider the following two cases.\n\n- When there already exists a genome reconstruction of the same species\n\n- When there exists no such reconstruction. (We are sequencing a new exotic species.)\n\n### Case I\n\nWe rely on the fact that the genomes of two different animals of the same species have $\\gt 99 \\%$ similarity in their genomes. That is, if we already have a sequenced genome from the same species, we can be guaranteed that the new genome will be **extremely** similar to the already reconstructed genome. If we imagine genome reconstruction similar to putting together a jigsaw puzzle, the already existing genome reconstruction is something like a photograph of the completed puzzle. We can then rely on this existing construction to guide us in putting together the jigsaw. \n\nIn our context, what we can do is match these short reads to the original sequence and see which places in the original sequence are very good matches for our read. We then use these markings as a guide to where the short read actually fits in the puzzle of the complete genome reconstruction. However, as we see in one of the practice labs, simply doing *exact string matching* is not sufficient. In practice, we find that trying to find exact matches of the short read in the original sequence gives **very low** matches in the original sequence. In the context of reconstructing our puzzle, this means we have very few clues to go off of for reconstruction. This happens primarily due to two main reasons:\n\n1. The DNA sequencing process can have errors as mentioned above. Perfect reads are not very likely.\n\n2. The 'snapshot' we are following will not be an exact match and will have some (albeit few) differences.\n\nBut the primary reason exact matching fails is due to the error(s) inherent in the DNA sequencing process. This essentially gives us a fair idea why exact string matching won't be sufficient for solving our problem. We later explore approximate matching and alignment problems which are primarily what we use to tackle this issue. ([Algorithms for Approximate String Matching - Alignment](/blog/algorithms-for-approximate-string-matching-alignment), [Booyer-Moore \u0026 Knuth-Morris-Pratt for Exact Matching](/blog/booyer-moore-knuth-morris-pratt-for-exact-matching)). \n\n### Case II\n\nIn the case where there exists no already existing snapshot to follow, we will have to tackle the same problem faced by the people working on the original Human Genome Project (HGP). We rely on techniques of de novo assembly to reconstruct the genome string. We will discuss this in more detail towards the end of the course. \n\n", - "lastmodified": "2026-01-11T05:05:49.603590673+05:30", + "lastmodified": "2026-01-11T10:59:03.738005064+05:30", "tags": [ "domain-science-bioinformatics" ] }, "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn": { "title": "DP as DAGs, Shortest Path on DAGs \u0026 LIS in O(nlogn)", - "content": "Over the past few notes, we learned about developing efficient strategies to solving computational problems by using the greedy idea ([Set Cover \u0026 Approximation Algorithms](/blog/set-cover-approximation-algorithms), [More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union), [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding)). The greedy idea focuses on choosing the most optimum solution at a local stage and reducing what's left to a subproblem with the same structure. This is great when problems have a locally optimum solution and have optimal substructure properties. But what do we do when this is not the case? What to do when greedy does not work?\n# Dynamic Programming\n\nDynamic programming is a technique used to efficiently solve problems that check the **optimal substructure** criteria. If we are able to reduce the given problem to smaller subproblems with the same structure, then we can employ a technique similar to divide and conquer. The idea here is that we can model a problem as a **transition** of sorts from the solution to its subproblem. If this is true, then it is possible that we might have a **LOT** of overlapping subproblems. Notice that instead of repeatedly recomputing the solutions to these subproblems, we can store them in memory somewhere and simply look up the solution for a specific subproblem in $O(1)$ instead of recomputing it.\n\n## Visualizing DP as DAGs\n\nA **very** interesting view of visualizing DP was discussed. DP is usually presented as some form of DP table, transition states, and magic for loops that compute the answer to some problem. I often find this **extremely** unintuitive and difficult to follow along with. DP by nature is nothing but the idea of recursively solving a problem by splitting it into smaller problems and applying memoization whenever possible to overlapping subproblems.\n\nA very cool way to visualize this is by modeling the recursion tree for solving a problem in terms of DAGs (directed acyclic graphs).\n\nWe mentioned that DP relies on a problem having a recursive solution. That is, it must be possible to model it as a transition from the solution to its subproblems.\n\nNote that if we attempted to visualize this recursive method of solving as a graph, with some solutions dependent on the solution of its subproblems, we can **never** have a cycle. The presence of a cycle would imply that a problem depends on its subproblem and the subproblem depends on its parent. Computing this would lead to an infinite cycle.\n\n![dp-1](/images/dp-1.webp)\n\nSay we wish to compute $a_1$. For the problem structure depicted on the left, it is **impossible** to compute it recursively as we would be in an infinite cycle. The problem on the right however can be solved by independently computing the solution for $a_2, a_3$ and then computing $a_1$.\n\nThis also means that we can think of every recursive problem in some kind of DAG-like structure.\n\n### Visualizing Fibonacci\n\nConsider the famous Fibonacci problem. We can recursively state $F_n = F_{n-1}+F_{n-2}, \\ _{n \\gt1 }$\n\nLet's try to visualize the recursion tree for $F_4$ (which is also a DAG)\n\n![dp-2](/images/dp-2.webp)\n\nNotice that we are computing $F_2$ multiple times. (Assume $F_0$ and $F_1$ are known constants).\n\nWe can eliminate this overlap by computing it **just** once. This allows us to model the DAG as follows,\n\n![dp-3](/images/dp-3.webp)\n\nBy using the once computed $F_2$ to compute $F_3$, notice that we managed to eliminate an entire subtree of recursion. **This** is the core idea behind DP. By saving the states of previous computations, we are effectively able to eliminate recomputation for **all** overlapping subproblems, thus considerably reducing the complexity of our solution.\n\nNote that DP is essentially a brute force. It can recursively try a greedy/brute force over all possible solutions for a smaller subproblem, then use this to again use the same strategy and solve a bigger problem. DP allows us to apply brute force to the problem by reducing it into smaller subproblems which we can attempt to solve using brute force / other techniques.\n\n## The shortest path on a DAG\n\nConsider the problem of the shortest path on a DAG. The problem simply asks, _\"Given a DAG with V vertices and E weighted edges, compute the shortest path from Vertex $v_i$ to every other vertex on the graph.\"_\n\nOn normal graphs without negative edge weights, the Dijkstra algorithm can compute the solution in $O(V+ElogV)$ time. But given that our graph is directed, and has **no cycles**, _can we do better?_\n\nIn fact, yes we can. A very simple solution exists to this problem which is capable of computing the answer in just $O(V+E)$ time.\n\n### Toposort\n\nNotice that for **every** DAG, there exists at least one topological sort of its vertices which is valid. This is trivially inferred from the fact that by definition, a DAG does not contain any cycles. This implies that there must be at least one arrangement where we can list vertices in a topological ordering.\n\nA topological ordering essentially guarantees that when we reach vertex $v_i$ in the ordering, there is **no** path from $v_i$ to **ANY** vertex $v_j$ where $j \\lt i$. Further, there is **no** path from any vertex $v_k$ to $v_i$ where $k \\gt i$.\n\nThis means that the shortest path to $v_i$ will be a result of some transition from the shortest paths to all vertices $v_j$ such that $\\exists \\ (v_j, v_i) \\in E$. And since $j \\lt i$ must be true, we can simply process the vertices in **topological** order.\n\n### The algorithm\n\n$$ \\text{Toposort V in O(V+E)} \\\\ \\text{Initialize all dist[ \\ ] values to } \\infty \\\\ \\text{for each } v \\in V-\\{s\\} \\text{ in topological order:} \\\\ dist(v) = min_{(u, v)\\in E}\\{dist(u)+d(u,v)\\} $$\n\nNotice that the _recursive_ step in this algorithm is that to compute $dist(v)$ we require the value of $dist(u)$. Now, $dist(u)$ can always be computed recursively, but notice that **because** we're going in topological order, it **MUST** be true that any such $u$ where $\\exists (u,v)\\in E$ **must** have already been processed. This implies that we must have already computed the value of $dist(u)$.\n\nSo instead of recursively recomputing it, we can just store the value of $dist(u)$ and access the computed value in $O(1)$.\n\nAnd that's it. We've managed to use Dynamic Programming to solve our problem in $O(V+E)$.\n\n## Longest path in a DAG?\n\nWhat about the problem of finding the longest path from some vertex $s$ to every other vertex on a DAG? How can we efficiently compute this? Unlike with shortest path problems, computing the longest path in a general graph is **NP-Complete**. That is, there exists **NO** polynomial-time algorithm that is capable of computing the solution.\n\nWhy? A very common way to understand the longest path problem is as follows.\n\n\u003eThe longest path between two given vertices s and t in a weighted graph G is the same thing as the shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G.\n\nThat is, by simply negating the weights, the longest path problem can be reduced to the shortest path problem. So... what's the issue? Why is one NP-Complete? Note that the Dijkstra algorithm for finding shortest paths relies on the fact that all edge weights are positive. This is to ensure that there exist no negative cycles in the graph. If a negative cycle exists, the shortest path is simply $-\\infty$. By negating the weights on our graph $G$, we might end up with a negative weight cycle.\n\n**However**, note that this does not affect DAGs. In DAGs, the longest path problem is the **same** as the shortest path problem. Just, with negative edge weights. Or another way to think of it is as the exact same recursion but instead of defining $dist(v)$ as the minimum of $dist(u) + d(u,v)$ we simply define it as the maximum of the recursion. This simple change effectively changes the algorithm to the longest path solution.\n\n```cpp\nfor t in toposort:\n for each node from t:\n dp[node] = min(dp[t] + distance(t, node), dp[node]); // max for longest path\n```\n\n# The LIS problem\n\nThe LIS (Longest increasing subsequence) problem asks the following question, _\"Given an ordered list of n integers, what is the length of the longest increasing subsequence that belongs to this list?\"_\n\nLet's take an example.\n\nLet the list $arr$ be $[10, 22, 9, 33, 21, 50, 41, 60, 80]$. One possible solution to this list is as given below.\n\n![dp-4](/images/dp-4.webp)\n\nSo how do we solve this problem?\n\nBefore we attempt to solve this problem, let us take a short detour to learn about the idea of _reductions_ in the field of computational complexity theory. Linked here: [Complexity Theory - Reductions](/blog/complexity-theory-reductions).\n## Back to finding the LIS\n\nConsider the following idea, let's transform the given array $arr = [10, 22, 9, 33, 21, 50, 41, 60, 80]$ to a DAG by applying the following rules.\n\nThere exists a directed edge from the element at position $i$ to another element at position $j$ **if and only if**\n\n1. $i \\lt j$, and\n2. $arr[i] \\lt arr[j]$.\n\nLet's consider the implications of such a construction. What does it **mean** to find the LIS of some given array? Especially after this transformation.\n\n![dp-5](/images/dp-5.webp)\n\nNotice that there is **no difference** between the longest path on a DAG problem and finding the LIS of an array after we have performed this transformation to the array. In such a DAG, every \"path\" is a sequence of increasing numbers. We wish to find the longest such sequence. This, in turn, translates to simply finding the longest such path on the graph.\n\nOur graph enumerates all such increasing subsequences. The longest path is, therefore, also the longest increasing subsequence.\n\nWe have hence, **successfully** found a reduction to the problem. We have shown that by applying the transformation of the array to a DAG which was constructed by following the above two rules, we have managed to reduce the problem of finding the LIS of an array to the problem of finding the longest path on a DAG.\n\nSadly, our reduction is not as efficient as the solution to $g$ itself. Notice that constructing the graph is of order $O(V^2)$. Let us define the construction of our graph (the reduction) as a function $R(x)$ which takes in input $x$ for problem $f$ (LIS) and converts it to input for problem $g$ (Longest path on a DAG.)\n\nOur overall complexity will be $O(R(x)) + O(g(x))$. Since the reduction step is $O(V^2)$, our final solution will be $O(n^2)$. We may have up to $n^2$ edges.\n\nHence we have a solution to the LIS problem which computes the answer in $O(n^2)$.\n\nSimply transform it to the increasing subsequence DAG and compute its longest path.\n\nHowever, the natural question to ask again is, _can we do better?_\n\n## Computing LIS in $O(nlogn)$\n\nThe reduction to convert the LIS problem to the longest path on DAGs was great and gave us an $O(n^2)$ solution. But how can we do better? Is there any redundancy in our computation? Is there some extra information unique to this problem that we haven't exploited yet?\n\nTurns out, there is.\n\nLet's define our DP state as follows.\n\n$\\text{Let } dp[i] \\text{ be the smallest element at which a subsequence of length } i \\text{ terminates.}$\n\nIf we can compute $dp[i]$ for all $i$ from $1 \\ to \\ n$, the largest $i$ for which $dp[i]$ contains a valid value will be our answer. How do we compute this? Consider the following naïve algorithm, here $a$ is our input array and $dp$ is our dp table.\n\n```cpp\ndp[0 to n] = ∞\nd[0] = -INF\nfor i from 0 to n-1\n for j from 1 to n\n if (dp[j-1] \u003c a[i] and a[i] \u003c dp[j])\n dp[j] = a[i]\n```\n\nWhy is the above algorithm correct? Notice that the outer-loop is essentially trying to decide where to include the value $a[i]$. Further, notice that when we are iterating over $i$, the inner loop will never assign a value to any $dp[j]$ where $j \\gt i+1$.\n\nIntuitively this makes sense because at this point in time we are only considering the first $[0, i]$ segment/subarray. Such a subarray only has $i+1$ elements and can hence not be part of any $dp[j]$ where $j \\gt i+1$. If we look at what the algorithm is doing, $a[i]$ can **only** replace $dp[0 \\ to \\ i+1]$. Notice that after $i+1$, $dp[j] \\geq a[i]$. This means the replacement can never happen.\n\nNotice that according to our algorithm, the condition $dp[j-1] \\lt a[i] \\text{ and } a[i] \\lt dp[j]$ implies that the LIS of length $j-1$ must be lesser than $a[i]$ and $a[i]$ must be lesser than whatever the current computed smallest element is which terminates a LIS of length $j.$ The first part of the condition makes sure the LIS is increasing and the second part makes sure it is the smallest such element that fits the condition.\n\n**Key observation:** Note that we will at most, update **one** value and the DP array will always be **sorted.**\n\nWhy? Note that $dp[i]$ is the **smallest** element at which an increasing subsequence of length $i$ terminates. The keyword here is **smallest**.\n\nThis implies that, if in the future, $dp[i]$ is replaced by some $a[j]$, then $a[j]$ is the smallest element which terminates an increasing sequence of length $i$. What is the implication of this sentence?\n\nIf $a[j]$ is the **smallest** element that terminates an increasing sequence of length $i$, then it **can never** be the **smallest** element in the array that terminates an increasing sequence of any length $\\gt i$. The fact that it is used at position $i$ means that **any** such terminating value for **any position** $\\gt i$ **must** be $\\gt a[i]$.\n\nIf this is understood, then we have inferred that the array is both sorted and we require at most one replacement in each iteration of the outer loop. We have managed to transform the inner loops job into a simpler problem. The inner loop is actually trying to solve the following question, _\"Given a sorted array, what is the first number that is strictly greater than $a[i]$?\"_\n\nNote that the above question can be trivially solved using binary search. This means that our inner loop can be replaced with a simple binary search to achieve $O(nlogn)$ overall time complexity.\n\n### Code\n```cpp\n// Sample psuedo code\nint lis(int arr[], int n) {\n int dp[n+1] = INF;\n d[0] = -INF;\n\n for (int i = 0; i \u003c n; i++) {\n int j = upper_bound(dp, dp+n+1, a[i]); // Computed in log(n) by binary search\n if (dp[j-1] \u003c a[i] \u0026\u0026 a[i] \u003c dp[j])\n dp[j] = a[i];\n }\n\n for (int i = n; i \u003e= 0; i--)\n if (dp[i] \u003c INF) return i;\n}\n```\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr)\n\n---\n\n## Related Reading\n\n- [Levenshtein Edit Distance](/blog/levenshtein-edit-distance)\n- [Chain Matrix Multiplication](/blog/chain-matrix-multiplication)\n- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem)\n", - "lastmodified": "2026-01-11T05:05:50.122774001+05:30", + "content": "Over the past few notes, we learned about developing efficient strategies to solving computational problems by using the greedy idea ([Set Cover \u0026 Approximation Algorithms](/blog/set-cover-approximation-algorithms), [More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union), [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding)). The greedy idea focuses on choosing the most optimum solution at a local stage and reducing what's left to a subproblem with the same structure. This is great when problems have a locally optimum solution and have optimal substructure properties. But what do we do when this is not the case? What to do when greedy does not work?\n# Dynamic Programming\n\nDynamic programming is a technique used to efficiently solve problems that check the **optimal substructure** criteria. If we are able to reduce the given problem to smaller subproblems with the same structure, then we can employ a technique similar to divide and conquer. The idea here is that we can model a problem as a **transition** of sorts from the solution to its subproblem. If this is true, then it is possible that we might have a **LOT** of overlapping subproblems. Notice that instead of repeatedly recomputing the solutions to these subproblems, we can store them in memory somewhere and simply look up the solution for a specific subproblem in $O(1)$ instead of recomputing it.\n\n## Visualizing DP as DAGs\n\nA **very** interesting view of visualizing DP was discussed. DP is usually presented as some form of DP table, transition states, and magic for loops that compute the answer to some problem. I often find this **extremely** unintuitive and difficult to follow along with. DP by nature is nothing but the idea of recursively solving a problem by splitting it into smaller problems and applying memoization whenever possible to overlapping subproblems.\n\nA very cool way to visualize this is by modeling the recursion tree for solving a problem in terms of DAGs (directed acyclic graphs).\n\nWe mentioned that DP relies on a problem having a recursive solution. That is, it must be possible to model it as a transition from the solution to its subproblems.\n\nNote that if we attempted to visualize this recursive method of solving as a graph, with some solutions dependent on the solution of its subproblems, we can **never** have a cycle. The presence of a cycle would imply that a problem depends on its subproblem and the subproblem depends on its parent. Computing this would lead to an infinite cycle.\n\n![dp-1](/images/dp-1.webp)\n\nSay we wish to compute $a_1$. For the problem structure depicted on the left, it is **impossible** to compute it recursively as we would be in an infinite cycle. The problem on the right however can be solved by independently computing the solution for $a_2, a_3$ and then computing $a_1$.\n\nThis also means that we can think of every recursive problem in some kind of DAG-like structure.\n\n### Visualizing Fibonacci\n\nConsider the famous Fibonacci problem. We can recursively state $F_n = F_{n-1}+F_{n-2}, \\ _{n \\gt1 }$\n\nLet's try to visualize the recursion tree for $F_4$ (which is also a DAG)\n\n![dp-2](/images/dp-2.webp)\n\nNotice that we are computing $F_2$ multiple times. (Assume $F_0$ and $F_1$ are known constants).\n\nWe can eliminate this overlap by computing it **just** once. This allows us to model the DAG as follows,\n\n![dp-3](/images/dp-3.webp)\n\nBy using the once computed $F_2$ to compute $F_3$, notice that we managed to eliminate an entire subtree of recursion. **This** is the core idea behind DP. By saving the states of previous computations, we are effectively able to eliminate recomputation for **all** overlapping subproblems, thus considerably reducing the complexity of our solution.\n\nNote that DP is essentially a brute force. It can recursively try a greedy/brute force over all possible solutions for a smaller subproblem, then use this to again use the same strategy and solve a bigger problem. DP allows us to apply brute force to the problem by reducing it into smaller subproblems which we can attempt to solve using brute force / other techniques.\n\n## The shortest path on a DAG\n\nConsider the problem of the shortest path on a DAG. The problem simply asks, _\"Given a DAG with V vertices and E weighted edges, compute the shortest path from Vertex $v_i$ to every other vertex on the graph.\"_\n\nOn normal graphs without negative edge weights, the Dijkstra algorithm can compute the solution in $O(V+ElogV)$ time. But given that our graph is directed, and has **no cycles**, _can we do better?_\n\nIn fact, yes we can. A very simple solution exists to this problem which is capable of computing the answer in just $O(V+E)$ time.\n\n### Toposort\n\nNotice that for **every** DAG, there exists at least one topological sort of its vertices which is valid. This is trivially inferred from the fact that by definition, a DAG does not contain any cycles. This implies that there must be at least one arrangement where we can list vertices in a topological ordering.\n\nA topological ordering essentially guarantees that when we reach vertex $v_i$ in the ordering, there is **no** path from $v_i$ to **ANY** vertex $v_j$ where $j \\lt i$. Further, there is **no** path from any vertex $v_k$ to $v_i$ where $k \\gt i$.\n\nThis means that the shortest path to $v_i$ will be a result of some transition from the shortest paths to all vertices $v_j$ such that $\\exists \\ (v_j, v_i) \\in E$. And since $j \\lt i$ must be true, we can simply process the vertices in **topological** order.\n\n### The algorithm\n\n$$ \\text{Toposort V in O(V+E)} \\\\ \\text{Initialize all dist[ \\ ] values to } \\infty \\\\ \\text{for each } v \\in V-\\{s\\} \\text{ in topological order:} \\\\ dist(v) = min_{(u, v)\\in E}\\{dist(u)+d(u,v)\\} $$\n\nNotice that the _recursive_ step in this algorithm is that to compute $dist(v)$ we require the value of $dist(u)$. Now, $dist(u)$ can always be computed recursively, but notice that **because** we're going in topological order, it **MUST** be true that any such $u$ where $\\exists (u,v)\\in E$ **must** have already been processed. This implies that we must have already computed the value of $dist(u)$.\n\nSo instead of recursively recomputing it, we can just store the value of $dist(u)$ and access the computed value in $O(1)$.\n\nAnd that's it. We've managed to use Dynamic Programming to solve our problem in $O(V+E)$.\n\n## Longest path in a DAG?\n\nWhat about the problem of finding the longest path from some vertex $s$ to every other vertex on a DAG? How can we efficiently compute this? Unlike with shortest path problems, computing the longest path in a general graph is **NP-Complete**. That is, there exists **NO** polynomial-time algorithm that is capable of computing the solution.\n\nWhy? A very common way to understand the longest path problem is as follows.\n\n\u003eThe longest path between two given vertices s and t in a weighted graph G is the same thing as the shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G.\n\nThat is, by simply negating the weights, the longest path problem can be reduced to the shortest path problem. So... what's the issue? Why is one NP-Complete? Note that the Dijkstra algorithm for finding shortest paths relies on the fact that all edge weights are positive. This is to ensure that there exist no negative cycles in the graph. If a negative cycle exists, the shortest path is simply $-\\infty$. By negating the weights on our graph $G$, we might end up with a negative weight cycle.\n\n**However**, note that this does not affect DAGs. In DAGs, the longest path problem is the **same** as the shortest path problem. Just, with negative edge weights. Or another way to think of it is as the exact same recursion but instead of defining $dist(v)$ as the minimum of $dist(u) + d(u,v)$ we simply define it as the maximum of the recursion. This simple change effectively changes the algorithm to the longest path solution.\n\n```cpp\nfor t in toposort:\n for each node from t:\n dp[node] = min(dp[t] + distance(t, node), dp[node]); // max for longest path\n```\n\n# The LIS problem\n\nThe LIS (Longest increasing subsequence) problem asks the following question, _\"Given an ordered list of n integers, what is the length of the longest increasing subsequence that belongs to this list?\"_\n\nLet's take an example.\n\nLet the list $arr$ be $[10, 22, 9, 33, 21, 50, 41, 60, 80]$. One possible solution to this list is as given below.\n\n![dp-4](/images/dp-4.webp)\n\nSo how do we solve this problem?\n\nBefore we attempt to solve this problem, let us take a short detour to learn about the idea of _reductions_ in the field of computational complexity theory. Linked here: [Complexity Theory - Reductions](/blog/complexity-theory-reductions).\n## Back to finding the LIS\n\nConsider the following idea, let's transform the given array $arr = [10, 22, 9, 33, 21, 50, 41, 60, 80]$ to a DAG by applying the following rules.\n\nThere exists a directed edge from the element at position $i$ to another element at position $j$ **if and only if**\n\n1. $i \\lt j$, and\n2. $arr[i] \\lt arr[j]$.\n\nLet's consider the implications of such a construction. What does it **mean** to find the LIS of some given array? Especially after this transformation.\n\n![dp-5](/images/dp-5.webp)\n\nNotice that there is **no difference** between the longest path on a DAG problem and finding the LIS of an array after we have performed this transformation to the array. In such a DAG, every \"path\" is a sequence of increasing numbers. We wish to find the longest such sequence. This, in turn, translates to simply finding the longest such path on the graph.\n\nOur graph enumerates all such increasing subsequences. The longest path is, therefore, also the longest increasing subsequence.\n\nWe have hence, **successfully** found a reduction to the problem. We have shown that by applying the transformation of the array to a DAG which was constructed by following the above two rules, we have managed to reduce the problem of finding the LIS of an array to the problem of finding the longest path on a DAG.\n\nSadly, our reduction is not as efficient as the solution to $g$ itself. Notice that constructing the graph is of order $O(V^2)$. Let us define the construction of our graph (the reduction) as a function $R(x)$ which takes in input $x$ for problem $f$ (LIS) and converts it to input for problem $g$ (Longest path on a DAG.)\n\nOur overall complexity will be $O(R(x)) + O(g(x))$. Since the reduction step is $O(V^2)$, our final solution will be $O(n^2)$. We may have up to $n^2$ edges.\n\nHence we have a solution to the LIS problem which computes the answer in $O(n^2)$.\n\nSimply transform it to the increasing subsequence DAG and compute its longest path.\n\nHowever, the natural question to ask again is, _can we do better?_\n\n## Computing LIS in $O(nlogn)$\n\nThe reduction to convert the LIS problem to the longest path on DAGs was great and gave us an $O(n^2)$ solution. But how can we do better? Is there any redundancy in our computation? Is there some extra information unique to this problem that we haven't exploited yet?\n\nTurns out, there is.\n\nLet's define our DP state as follows.\n\n$\\text{Let } dp[i] \\text{ be the smallest element at which a subsequence of length } i \\text{ terminates.}$\n\nIf we can compute $dp[i]$ for all $i$ from $1 \\ to \\ n$, the largest $i$ for which $dp[i]$ contains a valid value will be our answer. How do we compute this? Consider the following naïve algorithm, here $a$ is our input array and $dp$ is our dp table.\n\n```cpp\ndp[0 to n] = ∞\nd[0] = -INF\nfor i from 0 to n-1\n for j from 1 to n\n if (dp[j-1] \u003c a[i] and a[i] \u003c dp[j])\n dp[j] = a[i]\n```\n\nWhy is the above algorithm correct? Notice that the outer-loop is essentially trying to decide where to include the value $a[i]$. Further, notice that when we are iterating over $i$, the inner loop will never assign a value to any $dp[j]$ where $j \\gt i+1$.\n\nIntuitively this makes sense because at this point in time we are only considering the first $[0, i]$ segment/subarray. Such a subarray only has $i+1$ elements and can hence not be part of any $dp[j]$ where $j \\gt i+1$. If we look at what the algorithm is doing, $a[i]$ can **only** replace $dp[0 \\ to \\ i+1]$. Notice that after $i+1$, $dp[j] \\geq a[i]$. This means the replacement can never happen.\n\nNotice that according to our algorithm, the condition $dp[j-1] \\lt a[i] \\text{ and } a[i] \\lt dp[j]$ implies that the LIS of length $j-1$ must be lesser than $a[i]$ and $a[i]$ must be lesser than whatever the current computed smallest element is which terminates a LIS of length $j.$ The first part of the condition makes sure the LIS is increasing and the second part makes sure it is the smallest such element that fits the condition.\n\n**Key observation:** Note that we will at most, update **one** value and the DP array will always be **sorted.**\n\nWhy? Note that $dp[i]$ is the **smallest** element at which an increasing subsequence of length $i$ terminates. The keyword here is **smallest**.\n\nThis implies that, if in the future, $dp[i]$ is replaced by some $a[j]$, then $a[j]$ is the smallest element which terminates an increasing sequence of length $i$. What is the implication of this sentence?\n\nIf $a[j]$ is the **smallest** element that terminates an increasing sequence of length $i$, then it **can never** be the **smallest** element in the array that terminates an increasing sequence of any length $\\gt i$. The fact that it is used at position $i$ means that **any** such terminating value for **any position** $\\gt i$ **must** be $\\gt a[i]$.\n\nIf this is understood, then we have inferred that the array is both sorted and we require at most one replacement in each iteration of the outer loop. We have managed to transform the inner loops job into a simpler problem. The inner loop is actually trying to solve the following question, _\"Given a sorted array, what is the first number that is strictly greater than $a[i]$?\"_\n\nNote that the above question can be trivially solved using binary search. This means that our inner loop can be replaced with a simple binary search to achieve $O(nlogn)$ overall time complexity.\n\n### Code\n```cpp\n// Sample psuedo code\nint lis(int arr[], int n) {\n int dp[n+1] = INF;\n d[0] = -INF;\n\n for (int i = 0; i \u003c n; i++) {\n int j = upper_bound(dp, dp+n+1, a[i]); // Computed in log(n) by binary search\n if (dp[j-1] \u003c a[i] \u0026\u0026 a[i] \u003c dp[j])\n dp[j] = a[i];\n }\n\n for (int i = n; i \u003e= 0; i--)\n if (dp[i] \u003c INF) return i;\n}\n```\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr)", + "lastmodified": "2026-01-11T10:59:49.630229736+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -177,7 +177,7 @@ "/blog/extended-euclidean-algorithm": { "title": "Extended Euclidean Algorithm", "content": "# An efficient algorithm to find GCD\n\nThe problem we're attempting to solve is pretty simple. Given two numbers $a \\text{ and } b$, find their GCD.\n\nA naïve way to solve this problem is as follows, find all prime factors of the two numbers, and multiply all the common ones. However, even with fast prime finding algorithms like Eratosthenes sieve, it will work only for small numbers. It is not feasible to precompute the sieve for large numbers.\n\nHowever, by combining the idea behind iterating over all roots of a number and the sieve logic we can actually come up with an algorithm that obtains the prime factors of a number in $O(\\sqrt n . logn)$. How? Consider this approach, let us iterate from $i = 2 \\to \\sqrt n$ and **if $i$** divides $n$, then keep dividing $n$ by $i$. Each of these divisions implies that $i$ is each time, a prime factor of $n$.\n\nThe code looks something like this\n\n```jsx\nfor(int i=2; i*i \u003c= n; i++){\n\t\twhile(n % i == 0){\n\t\t\t\tprimefactors.insert(i);\n\t\t\t\tn /= i;\n\t\t}\n}\nif(n\u003e2) primefactors.insert(n);\n```\n\nConsider let us try to **prove** why this works.\n\n**Our claim:** _Every composite number has at least one prime factor less than or equal to the square root of itself._\n\n**Proof:** Since our number is composite, it must have at least one-factor $a$. This implies that there exists some number $b$ such that $a \\times b = n$. Now, we want to prove that either\n\n$a \\leq \\sqrt n \\text{ or } b \\leq \\sqrt n$\n\nWe prove this by contradiction. Assume that both $a \\gt \\sqrt n$ and $b \\gt \\sqrt n$.\n\nThis implies that $a\\times b \\gt \\sqrt n \\times \\sqrt n \\implies ab \\gt n$. This is a contradiction. Hence either $a \\leq \\sqrt n$ or $b \\leq \\sqrt n$.\n\nNow, W.L.O.G. assumes that $a \\leq \\sqrt n$. Either $a$ can be prime or by the fundamental theorem or arithmetic, $a$ must have a prime divisor $\\lt a$. In both cases, our claim is true.\n\nThe inner while loop which removes every instance of the prime factor is pretty similar to the marking composite step in the sieve algorithm. Once they're removed we can move\n\nHowever, notice that in the worst case, if $n$ itself happens to be prime our algorithm would have a worst-case runtime of $O(\\sqrt n)$. Since the prime factors are obtained in sorted order, we can always run 2 pointers in linear time to get the common factors. But the worst case is still $O(\\sqrt n)$,\n\nThe question is, _can we do better?_\n\n## Euclid's Algorithm\n\nThe algorithm Euclid proposed to solve the GCD problem was extremely simple. According to him,\n\n$$ gcd(a, b) = \\begin{cases} a, \u0026 \\text{ if } b =0 \\\\ gcd(b, a \\ mod \\ b) \u0026 \\text{otherwise.} \\end{cases} $$\n\n### Proof\n\nFirst, note that the second argument strictly decreases with each iteration of the Euclidean algorithm, implying that the method will always halt (because the arguments are never negative).\n\nNow, to prove correctness, we must prove $gcd(a, b)= gcd(b, a \\ mod \\ b) \\forall a \\geq 0, b \\gt 0$\n\nFirst, notice that $a \\ mod \\ b = a-b.\\lfloor \\frac{a}{b} \\rfloor$\n\nWith this reduction, let us try to prove a simpler identity. $gcd(a, b) = gcd(a-b, b)$.\n\nLet $gcd(a, b) = d.$ This implies that $d|a \\text{ and } d|b$. This also means that $d | (a-b)$ and $d|b$.\n\nThis is true for all common factors of $a$ and $b$. Therefore, $(a, b)$ and $(a-b, b)$ share the same set of common factors. Hence $gcd(a,b) = gcd(a-b, b)$.\n\nNow notice that $a \\ mod \\ b$ is simply performing this operation $\\lfloor \\frac{a}{b} \\rfloor$ times. Hence $gcd(a, b) = gcd(a \\mod b, b)$. Hence proved.\n\nNow that we have managed to prove correctness, let us try to put an upper bound on the running time of this algorithm.\n\n### Time complexity\n\nNotice that in every step of the recursion, one of the arguments get cut in at least half. Consider the operation $a \\ mod \\ b$.\n\n**If $b \\leq \\frac{a}{2}$ :** Then by property of the mod, $a \\ mod \\ b \\lt \\frac{a}{2}$\n\n**If $b \\gt \\frac{a}{2}$:** Then the operation $a \\ mod \\ b = a-b \\lt \\frac{a}{2}$\n\nTherefore the number of recursive steps will be at max $log(min(a, b))$. And this is indeed the complexity of our algorithm.\n\nFurther note that for a $n$-bit number, since the operands get halved at every other step, we are removing one bit of the numbers per every 2 recursions. Hence the number of calls is linear in the number of bits of the number. The modulo operation is quadratic in the number of bits of the number. Hence final complexity $O(n^3)$.\n\n\u003e Lamé's theorem is used to estimate the method's running time, and it establishes an unexpected link between the Euclidean algorithm and the Fibonacci sequence: The Euclidean algorithm executes at most $n-2$ recursive calls if $a \\gt b \\geq 1$ and $b \\lt F_n$ for some $n$. Furthermore, the upper bound of this theorem can be shown to be optimal. $gcd(a,b)$ will do exactly $n-2$ recursive calls when $a = F_n$ and $b = F_{n-1}$. In other words, the worst-case input for Euclid's algorithm is a series of Fibonacci numbers.\n\n### Code\n\nBelow is the C++ implementation of the algorithm. Notice the conciseness that writing the algorithm recursively gives us.\n\n```cpp\nint gcd (int a, int b) {\n if (b == 0) return a;\n else return gcd (b, a % b);\n}\n```\n\nHowever, we can also write it iteratively for more efficiency as follows\n\n```cpp\nint gcd (int a, int b) {\n while (b) {\n a %= b;\n swap(a, b);\n }\n return a;\n}\n```\n\n# The Extended Euclidean Algorithm\n\nWhile the Euclidean algorithm calculates only the greatest common divisor (GCD) of two integers $a$ and $b$, the extended version also finds a way to represent GCD in terms of $a$ and $b$, i.e. coefficients $x$ and $y$ for which:\n\n$a \\cdot x + b \\cdot y = \\gcd(a, b)$\n\n## The algorithm\n\nLet the GCD of $a$ and $b$ be $g$.\n\nWe can find this representation by simply extending the previously explained algorithm. Notice that the previous algorithm terminates when $b=0$ and $a = g$. At this step, we can easily find the coefficients $g = g \\cdot 1 + 0 \\cdot 0$ .\n\nFrom here, the main idea is to **backtrack** through our recursive calls. The only transition we need to describe is the transition of $\u003cx, y\u003e$ from $(b, a \\ mod \\ b) \\to (a,b)$ ,\n\nLet's suppose that we have the coefficients $\u003cx_1, y_1\u003e$ for $(b, a\\ mod \\ b)$, This implies that the following equation holds true always.\n\n$$ b \\cdot x_1 + (a \\bmod b) \\cdot y_1 = g $$\n\nNow, we want to find the transition of $\u003cx_1, y_1\u003e \\to \u003cx_2, y_2\u003e$ for the pair $(a, b)$. That is,\n\n$$ a \\cdot x + b \\cdot y = g $$\n\nRecall that we can write $a \\bmod b = a - \\left\\lfloor \\frac{a}{b} \\right\\rfloor \\cdot b$. Now, substituting this in the previous equation, we get,\n\n$$ g = b \\cdot x_1 + (a \\bmod b) \\cdot y_1 = b \\cdot x_1 + \\left(a - \\left\\lfloor \\frac{a}{b} \\right\\rfloor \\cdot b \\right) \\cdot y_1 $$\n\nWe can now solve this equation to get,\n\n$$ g = a \\cdot y_1 + b \\cdot \\left( x_1 - y_1 \\cdot \\left\\lfloor \\frac{a}{b} \\right\\rfloor \\right) $$\n\nAnd that's it! We have found our transition.\n\n$$ \\begin{cases} x_2 = y_1 \\\\ y_2 = x_1 - y_1 \\cdot \\left\\lfloor \\frac{a}{b} \\right\\rfloor \\end{cases} $$\n\n## Proof\n\n**Lemma:** If $d$ divides both $a$ and $b$, and $d = ax + by$ for some integers $x$ and $y$, then necessarily $d = gcd(a, b)$.\n\n**Proof:**\n\n1. Since it is given that $d|a$ and $d|b$, then $d$ is a common divisor of $a$ and $b$. This implies that $d \\leq gcd(a, b)$ by definition of $gcd$.\n2. Since $gcd(a, b)|a$ and $gcd(a, b)|b$, it implies that $gcd(a,b)|(ax+by) \\implies gcd(a,b)|d$. Therefore $gcd(a,b) \\leq d$.\n\nWe have $gcd(a,b) \\leq d$ and $d \\leq gcd(a,b)$. Therefore it must be true that $d = gcd(a,b)$\n\n## Code\n\nAs before, we can implement this both recursively and iteratively. The recursive version is quite concise. Below is the C++ implementation of the recursive code.\n\n```cpp\nint gcd(int a, int b, int \u0026x, int \u0026y) {\n if (b == 0) {\n x = 1;\n y = 0;\n return a;\n }\n int x1, y1;\n int d = gcd(b, a % b, x1, y1);\n x = y1;\n y = x1 - y1 * (a / b);\n return d;\n}\n```\n\n# Modular Division (Multiplicative Inverse)\n\nWhen doing operations in the modular field, we perform operations between two numbers $a, b$ belonging to the field like $(a+b)mod \\ k$. Here $+$ is the binary operation and $Z_k$ is the modular field.\n\nNotice that while this is fine for integer addition, subtraction, and multiplication, it is not so easy to define division.\n\nIn the world of modular arithmetic, we define the modular multiplicative inverse of an integer $a$ as an integer $x$ such that $a \\cdot x \\equiv 1 \\mod k$. Further, in terms of defining notation, we write such an integer $x = a^{-1}$.\n\nFurther, notice that the modular inverse of an element belonging to the modular field of $Z_k$ does not always exist. For example, consider $3 \\in Z_6$. $Z_6 = \\{0, 1, 2, 3, 4, 5\\}$. By trying all possible elements we can confirm that there exists no integer $\\in Z_6$ such that $3 \\cdot x \\equiv 1 \\ mod \\ 6$. It can be proven that the modular inverse for an integer $a$ exists in the modular field $Z_k$ **if and only if $gcd(a, k) = 1$.** That is, $a$ and $k$ are relatively prime.\n\n## Finding the modular inverse using the Extended Euclidean Algorithm\n\nLet's take the following equation,\n\n$$ a \\cdot x + k \\cdot y = 1 $$\n\nRemember that the modular inverse of $a$ exists, if and only if $gcd(a, k) = 1$. Further, notice that the above equation can be solved by the extended euclidean algorithm.\n\nOnce the **EED** algorithm gives us the values of $x$ and $y$, we can mod the entire expression with $k$ to get\n\n$$ a \\cdot x + 0 \\equiv 1 \\ mod \\ k $$\n\nThen $x = a^{-1}$\n\n### Code\n\nThe code for it is pretty simple. Below is C++ implementation of the same\n\n```cpp\nint a, k, x, y;\nint g = EED(a, k, x, y);\nif(g!=1) // No solution\nelse // (x % m + m) % m is our solution \n```\n\nNote that we do the addition %m + m step to make sure $x$ is positive.\n\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Extended Euclidean Algorithm - cpalgorithms](https://cp-algorithms.com/algebra/extended-euclid-algorithm.html)\n", - "lastmodified": "2026-01-11T05:05:51.532609891+05:30", + "lastmodified": "2026-01-11T10:59:04.645172499+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-math-number-theory" @@ -186,7 +186,7 @@ "/blog/finite-and-infinite-sums": { "title": "Finite and Infinite Sums", "content": "# Finite and Infinite Sums\n\nTo understand the rearrangement theorem and it's many implications, we must first fully understand what a series is and what it means to find the sum of an **infinite** series. Let us begin by addressing the questions we raised in the introduction segment.\n\n## Series\n\nA series can be described as the sum of the terms of a given numerical sequence. Recall that a numerical sequence is simply an ordered list or collection of numbers where repetition is allowed. Consider the finite sequence of natural numbers from 1 to 10: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.\n\nThe series corresponding to the terms of this sequence is:\n\n$$\\sum\\limits_{k=1}^{10} k = 1 + 2 + 3 + \\ldots + 10 = 55$$\n\nThis was an example of a finite sequence. Similarly, just like we can have finite and infinite sequences, we can have an infinite series. An infinite series is simply the sum of an infinite number of terms of a corresponding sequence. For example, if we extend the previous finite series to include the sum of all the natural numbers, we get:\n\n$$\\sum\\limits_{k=1}^{\\infty} k = 1 + 2 + 3 + \\ldots$$\n\nBut what is the sum of an infinite series equal to? To answer this, let us begin by defining 2 different types of series. **Convergent** series and **divergent** series.\n\n### Convergent series\n\nWe define a convergent series as a series who's sequence of partial sums tends to a limit.\n\nConsider the following series,\n$$S = \\sum\\limits_{k=1}^{\\infty} \\frac{1}{2^k} = \\frac{1}{2} + \\frac{1}{4} + \\frac{1}{8} + \\frac{1}{16} + \\ldots$$\nWe now define a new term, **partial sums.** The $k^{th}$ partial sum of a series is defined as the sum of the first *k* terms of that series. The set of all the partial sums of an infinite series form an infinite sequence where the $n^{th}$ term of the sequence is equal to the $n^{th}$ partial sum of the series. Let $S_k$ denote the $k^{th}$ partial sum of our above defined series S. Then, the first few partial sums are as follows:\n$$S_1 = \\sum\\limits_{k=1}^{1} \\frac{1}{2^k} = \\frac{1}{2} = 1 - \\frac{1}{2}$$\n$$S_2 = \\sum\\limits_{k=1}^{2} \\frac{1}{2^k} = \\frac{1}{2} + \\frac{1}{4} = \\frac{3}{4} = 1 - \\frac{1}{4}$$\n$$S_3 = \\sum\\limits_{k=1}^{3} \\frac{1}{2^k} = \\frac{1}{2} + \\frac{1}{4} + \\frac{1}{8} = \\frac{7}{8} = 1 - \\frac{1}{8}$$\n$$S_4 = \\sum\\limits_{k=1}^{4} \\frac{1}{2^k} = \\frac{1}{2} + \\frac{1}{4} + \\frac{1}{8} + \\frac{1}{16} = \\frac{15}{16} = 1 - \\frac{1}{16}$$\nWe can write down these partial sums as the beginning few terms of an infinite sequence,\n$$\\left\\{ \\ \\frac{1}{2}, \\ \\frac{3}{4}, \\ \\frac{7}{8}, \\ \\frac{15}{16}, \\ \\ldots \\ \\right\\}$$\n\nRecall that the $n^{th}$ term of this sequence is the $n^{th}$ partial sum of the series. The $n^{th}$ partial sum of the series will also be the sum of our infinite series. We can now define the sum of an infinite series as the limit of the sequence of partial sums as *n* tends to infinity. If such a limit does not exist, we say that the series does not have a sum. Let us attempt to find the general term of this sequence, \n$$ 2s_{k}=\\frac {2}{2}+\\frac {2}{4}+\\frac {2}{8}+\\cdots +\\frac {2}{2^{k}}=1+\\left[\\frac {1}{2}+\\frac {1}{4}+\\cdots +\\frac {1}{2^{k-1}}\\right]=1+\\left[s_{k}-{\\frac {1}{2^{k}}}\\right]. $$\n$$ s_k = 1 - \\frac{1}{2^k}$$\n$$2s_{k}=\\frac {2}{2}+\\frac {2}{4}+\\frac {2}{8}+\\cdots +\\frac {2}{2^{k}}=1+\\left[\\frac {1}{2}+\\frac {1}{4}+\\cdots +\\frac {1}{2^{k-1}}\\right]=1+\\left[s_{k}-{\\frac {1}{2^{k}}}\\right].$$\n$$s_k = 1 - \\frac{1}{2^k}$$\n\nWe can notice this occurrence in our listing of the first 4 partial sums of this series. As k tends to infinity, we have:\n\n$$\\lim_{k \\to \\infty} 1 - \\frac{1}{2^k} = 1$$\n\nSince the sequence of its partial sums tends to a limit, this is a convergent series who's sum is equal to the $n^{th}$ term of the sequence as n tends to infinity. Here,\n$$S = \\sum\\limits_{k=1}^{\\infty} \\frac{1}{2^k} = \\frac{1}{2} + \\frac{1}{4} + \\frac{1}{8} + \\frac{1}{16} + \\ldots = 1$$\n\n### Divergent Series\n\nA divergent series can be simply defined as a series which does not converge, i.e., a series is divergent if the infinite sequence of its partial sums does not have a finite limit. This is true if either the limit does not exist or it is equal to.\n\nConsider the simple harmonic series,\n$$S = \\sum\\limits_{k=1}^{\\infty} \\frac{1}{k} = 1 + \\textcolor{red}{\\frac{1}{2} + \\frac{1}{3}} + \\textcolor{orange}{\\frac{1}{4} + \\frac{1}{5} + \\frac{1}{6} + \\frac{1}{7}} + \\textcolor{purple}{\\frac{1}{8} +\\frac{1}{9} +\\frac{1}{10} +\\frac{1}{11} +\\frac{1}{12} + \\frac{1}{13} + \\frac{1}{14} + \\frac{1}{15}} + \\textcolor{blue}{\\frac{1}{16} \\ldots}$$\n\nLet $S_k$ represent the sum of the terms between the terms indexed $2^k$ (excluded) and $2^{k-1}$ (included) We make the following observations:\n$$S_1 = 1 \u003e \\frac{1}{2}$$\n$$S_2 = \\textcolor{red}{\\frac{1}{2} + \\frac{1}{3}} \u003e \\frac{1}{2}$$\n$$S_3 = \\textcolor{orange}{\\frac{1}{4} + \\frac{1}{5} + \\frac{1}{6} + \\frac{1}{7}} \u003e \\frac{4}{8} = \\frac{1}{2}$$\n$$S_4 = \\textcolor{purple}{\\frac{1}{8} + \\frac{1}{9} + \\frac{1}{10} + \\frac{1}{11} + \\frac{1}{12} + \\frac{1}{13} + \\frac{1}{14} + \\frac{1}{15}} \u003e \\frac{8}{16} = \\frac{1}{2}$$\n$$\\vdots$$\n\nWe can write the series S as the sum of these segments $S_1$, $S_2$, $S_3$\\... etc. Now, since each segment individually sums to a number greater than $\\frac{1}{2}$, we can make the following relation:\n\n$$S = S_1 + \\textcolor{red}{S_2} + \\textcolor{orange}{S_3} + \\textcolor{purple}{S_4} + \\textcolor{blue}{\\ldots} \u003e \\frac{1}{2} + \\textcolor{red}{\\frac{1}{2}} + \\textcolor{orange}{\\frac{1}{2}} + \\textcolor{purple}{\\frac{1}{2}} + \\textcolor{blue}{\\ldots}$$\n\nThe sequence of partial sums of the series on the right $\\left( \\frac{1}{2} + \\frac{1}{2} + \\frac{1}{2} + \\ldots \\right)$ tends to $\\infty$. Hence, we can say that, our simple harmonic series S diverges to $\\infty$.\n\n## Commutativity of Absolute Convergence\n\n**If $\\displaystyle \\sum_{i=n}^m a_n$ is a series of complex numbers which converges absolutely, then every rearrangement of $\\displaystyle \\sum_{i=n}^m a_n$ converges to the same sum.**\n\nNow we know that since $a_n$ is absolutely convergent,\n\n$$\\displaystyle \\sum_{i=n}^m |a_i|\\leq \\epsilon$$\n\nif $m\\geq n\\geq N$ for an integer N , given an $\\epsilon$ \\\u003e 0 .\n\nWe choose p such that the integers 1,2,3..N are all present i the set $k_1,k_2,k_3...k_p$, which is the rearranged sequence . So if n\\\u003ep, the numbers, $a_1,a_2,a_3...a_n$ cancel out in the difference $s_n-S_n$, where $S_n$ is the sum of the rearranged sequence . Thus we get that, $s_n-S_n \\leq \\epsilon$ .Hence $S_n$ converges to the same value as $s_n$ .\n\n**Commutative law for addition: a+b= b+a**\n\nThat is, the commutative law says rearranging the summands in a finite sum does not change the total.\n\nBut, that's not true for conditionally convergent series\n\nEx:(Alternating harmonic series)\n\n$$1-\\frac{1}{2}+\\frac{1}{3}-\\frac{1}{4}+\\frac{1}{5}-\\frac{1}{6}+\\frac{1}{7}-\\frac{1}{8}+\\frac{1}{9}-\\frac{1}{10}+...= ln2$$\n\nConsider:\n\n$$1-\\frac{1}{2}-\\frac{1}{4}+\\frac{1}{3}-\\frac{1}{6}-\\frac{1}{8}+\\frac{1}{5}-\\frac{1}{10}-\\frac{1}{12}+...$$\n\n$$=(1-\\frac{1}{2})-\\frac{1}{4}+(\\frac{1}{3}-\\frac{1}{6})-\\frac{1}{8}+(\\frac{1}{5}-\\frac{1}{10})-\\frac{1}{12}+...$$\n\n$$\\frac{1}{2}(1-\\frac{1}{2}+\\frac{1}{3}-\\frac{1}{4}+\\frac{1}{5}-\\frac{1}{6}+\\frac{1}{7}-\\frac{1}{8}+\\frac{1}{9}-...)=\\frac{1}{2}ln2$$\n\nWe explained earlier that sum of a series is a limit of partial sums as $n\\to\\infty$\n\nSO, Rearranging terms of a series changes the partial sums and a result this changes limit of the partial sums.\n\n### Conditionally Convergent Series\n\nA series $\\sum\\limits_{n=1}^\\infty a_n$ is conditionally convergent if and only if the sequence of its partial sums, i.e., $\\lim_{m \\to \\infty} \\sum\\limits_{n=1}^m a_n$ exists and evaluates to some finite number while $\\lim_{m \\to \\infty} \\sum\\limits_{n=1}^m |a_n|$ diverges to $\\infty$.\n\n## Rearrangement:\n\n### Definition: \nLet ${k_n}, n = 1,2,3,...,$ be an integer valued positive sequence in which every positive integer appears once and only once (that is, $k_n= k_{n'}$ and only if $n=n'$). Given a series $\\sum a_n$, Put\n\n$$a'_n = a_{k_n} \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ (n=1,2,3...)$$\n\nwe say that $\\sum a'_n$ is a rearrangement of $\\sum a_n$.\n\n# Sum of Alternating Harmonic Series\n\nA power series (centered at 0) is a function of the form\n\n$\\displaystyle\\sum_{n=0}^{\\infty}a_nx^n = a_0 + a_1x + a_2x^2 + a_3x^3 + . ..$\n\nIf the series converges for non-zero $x$, there is an $R\u003e0$ so that the series converges in the open interval $-R\u003cx\u003cR.$ In this interval, the series can be differentiated and integrated term by term and the resulting series also converge in this open interval.\n\n## Abel's Theorem\n\nIf $\\displaystyle\\sum a_n$ converges , and if $f(x) = \\sum a_nx^n$, then\n\n$$\\sum a_n = \\displaystyle\\lim_{x\\to1^-}f(x)$$\n\nAbel's theorem and results on integration and differentiation of series allow us to find sums of series like the AHS.\n\nTo sum\n\n$$1-\\frac{1}{2}+\\frac{1}{3}-\\frac{1}{4}+\\frac{1}{5}-\\frac{1}{6}+\\frac{1}{7}-\\frac{1}{8}+\\frac{1}{9}-\\frac{1}{10}+...$$\n\nlet,\n\n$$f(x) = x-\\frac{1}{2}x^2+\\frac{1}{3}x^3-\\frac{1}{4}x^4+\\frac{1}{5}x^5-\\frac{1}{6}x^6+\\frac{1}{7}x^7-...$$\n\nThis power series converges in the open interval $-1\u003cx\u003c1$.\n\nlet $F(x)=f'(x)$ so that\n\n$$F(x)=f'(x) = 1-\\frac{1}{2}2x+\\frac{1}{3}3x^2-\\frac{1}{4}4x^3+\\frac{1}{5}5x^4-...$$\n\n$$=1-x+x^2-x^3+x^4-x^5+x^6+....$$\n\n$$=\\frac{1}{1+x}$$\n\nSince $f'(x)=\\frac{1}{1+x}$ we can see $f(x) = ln(1+x)$\n\nNow Abel's theorem says\n\n$$1-\\frac{1}{2}+\\frac{1}{3}-\\frac{1}{4}+\\frac{1}{5}-\\frac{1}{6}+\\frac{1}{7}-...=\\displaystyle\\lim_{x\\to1} ln(1+x)=ln2$$\n\nNext, we'll take a look at [Riemann Series Rearrangement](/blog/riemann-series-rearrangement).\n\n# References\n- [The classical theory of rearrangements, Agana, M. J.](https://scholarworks.boisestate.edu/cgi/viewcontent.cgi?article=2052\u0026amp)\n- [Riemann series theorem, Wikipedia](https://en.wikipedia.org/wiki/Riemann_series_theorem)\n- [Riemann's paradox: pi = infinity minus infinity, Mathologer](https://www.youtube.com/watch?v=-EtHF5ND3_s)\n- [Infinite Series - Numberphile](https://www.youtube.com/watch?v=Jwtn5_d2YCs)\n- [Riemann - Wikipedia](https://en.wikipedia.org/wiki/Bernhard_Riemann)\n- Principles of Mathematical Analysis, Walter Rudin\n- [On Conditionally Convergent Series, Werner Horn and Madjiguene Ndiaye](http://www.csun.edu/~hcmth017/riemann1/riemann1.pdf)\n- [Calculus Notes, Grinshpan](https://www.math.drexel.edu/~tolya/123_harmonic.pdf)\n- [Rearranging the Alternating Harmonic Series, Carl C. Cowen](https://www.math.iupui.edu/~ccowen/ButlerAHslides.pdf)\n", - "lastmodified": "2026-01-11T05:05:51.543153843+05:30", + "lastmodified": "2026-01-11T10:59:04.652614462+05:30", "tags": [ "domain-math-analysis" ] @@ -194,7 +194,7 @@ "/blog/flynn-s-taxonomy": { "title": "Flynn's Taxonomy", "content": "# Flynn’s Taxonomy\n\nIt’s basically a categorization of different forms of parallel computing architectures.\n\n1. At level 1 you basically have **single instruction single data stream.** There’s a single instruction which acts on exactly one data element,\n2. Next up you have **single instruction multiple data stream,** this time around we have a single instruction acting on multiple data elements at the same time. The idea is to have a large set of registers on which we perform the same operation using larger “ALU’s” or SIMD units. GPU are also a good example. \n3. I won’t call this an upgrade but another alternative is **multiple instruction single data stream**. An example is pipeline architecture, although strictly speaking the data that passes through different stages of the pipeline does undergo changes.\n4. Last up is **multiple instruction multiple data stream**. Consider multiple cores acting asynchronously performing SIMD operations. That’s one such example. ([Parallelism with OMP](/blog/parallelism-with-omp), [Brent's Theorem \u0026 Task Level Parallelism](/blog/brent-s-theorem-task-level-parallelism), etc.)\n\n## SIMD\n\n![simd-1](/images/simd-1.webp)\n\nThe idea here is to basically take multiple scalar instructions operating on adjacent pieces of memory and combine them into a single vector instruction which can apply the scalar operation element to element in a single instruction. In general SIMD operations have higher latency than their scalar counterparts but in exchange we get insane throughput. More on this in [Vectorization \u0026 Analyzing Loop Dependencies](/blog/vectorization-analyzing-loop-dependencies).\n\n[Intel’s intrinsics guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) contains the detailed documentation for all the different SIMD architectures and their corresponding SIMD instructions. Instead of requiring us to write assembly Intel exposes the intrinsic API via these library functions and these library functions generate the corresponding assembly.\n[https://docs.nvidia.com/cuda/parallel-thread-execution/index.html](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)\n\n## SPMD\n\nSPMD → Single program multiple data. Distributed computing sort of works on this idea. Tasks are split up and run simultaneously on multiple processors with different input in order to obtain results faster. **Note that the two are not mutually exclusive.** [Map Reduce](https://en.wikipedia.org/wiki/MapReduce).\n# References\nThese notes are quite old, and I wasn't rigorously collecting references back then. If any of the content used above belongs to you or someone you know, please let me know, and I'll attribute it accordingly.", - "lastmodified": "2026-01-11T05:06:37.004014144+05:30", + "lastmodified": "2026-01-11T10:59:49.07874153+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -202,13 +202,13 @@ "/blog/getting-rid-of-short-form-content": { "title": "Getting Rid of Short-Form Content", "content": "Social media platforms are universally competing to capture all our time and attention by spamming us with brain-rot short form content... and it's working. And unlike other forms of addiction, it's not even true that we get dopamine hits from consuming short form content. More often than not, we don't even realize the amount of time that was passed consuming tidbits of random brain rot. YouTube for example randomly starts on the shorts page and users don't even realize they're scrolling through shorts until much later. Below we'll quickly outline a few software fixes we can implement at the moment to try to purge short form content from our life. Needless to say, if you're an iOS user, please chuck your phone. \n# Instagram Reels\nI use Instagram sometimes to view friends' stories and chat once in a while. There are a couple of possible fixes to getting rid of short form content from Instagram mobile. \n## Web + AdBlock\nUse only Instagram web. The UI is absolute trash and terrible to use. But you can use an AdBlocker tool to get rid of the reels elements and then it's just a modded Instagram with pretty bad UI. \n## Instander\nUse a modded version of Instagram like [Instander](https://thedise.me/instander/). There are a few other mods like AeroInsta, MyInsta, etc. but I personally find Instander to be the most trustworthy among them. They're all closed source but the developer of Instander is pretty well known and trusted in the community. People have also ran network logging experiments to verify that Instander is not performing any extra data espionage or mining activities on the device. Alright, assuming you've uninstalled Instagram and installed Instander, what do we do now? \n1. After logging in, navigate to \"Instander Settings\", right above the usual Instagram settings button. Go to Developer Mode and enable it. Also on the same screen, click on \"Get MobileConfig\" and \"Update.\"\n2. Restart the app. \n3. Next, return to the home screen of Instander and long press the \"Home\" (house) icon on the bottom tab. This will enter you into developer options. Navigate to \"MetaConfig Settings \u0026 Overrides\"\n4. Search for \"panorama v2 variants\", you'll find a setting called `panorama v2 variants: reels tab enabled`, turn it off. \n5. Next, search for \"panavision nav\". At the bottom, you'll see a few settings for specific tabs. You're interested in `tab 1` and `tab 3`.\n6. Click on both `tab 1` and `tab 3` and set their value to `news`. Note that I want to turn off both explore and the reels tab. If you want to keep explore for some reason you can leave it as is.\n7. Restart the app.\nAnd we're done. There's still \"Suggested reels\" that sometimes pops into your primary feed when scrolling. It's annoying, but infinitely easier to notice and avoid compared to just clicking on something via explore. I do not know of any solution / fix to remove that from your feed. If anyone does know a solution to that then please let me know.\n### Disclaimer\nUsing modded Instagram can possibly get your account blocked by Instagram. But I know many people \u0026 entire Reddit communities who have used these apps for years without issues. Worst case you can probably appeal and get your account back. Worst-worst case you can get rid of Instagram. Win-win situation. \n# YouTube Shorts\n## Mobile\nThe easiest fix is through ReVanced Manager. \n1. Install [ReVanced](https://revanced.app/). \n2. Go to Patcher \u003e Select an App\n3. Search for \"YouTube\"\n4. Click on the inner `Suggested: vXX.XX.XX` button\n5. Install the `nodpi` version from a source like ApkMirror\n6. Launch patched YouTube. Install MicroG and sign in when it prompts you to.\n7. And we now have better YouTube. We need to get rid of the original YouTube now. There are two options.\n\t1. Disable it using system settings\n\t2. Use `adb` to uninstall it. \n## Desktop \nJust install one of the many YouTube feed / shorts blockers from the chrome web store. I use [ShortsBlocker](https://chromewebstore.google.com/detail/shortsblocker-remove-shor/oahiolknhkbpcolgnpljehalnhblolkm?hl=en). \n\n# TikTok\nEasy fix. Uninstall. There's literally no point of having this app. ", - "lastmodified": "2026-01-11T05:06:37.009067958+05:30", + "lastmodified": "2026-01-11T10:59:49.086222326+05:30", "tags": [] }, "/blog/how-the-economic-machine-works-by-ray-dalio": { "title": "How the Economic Machine Works by Ray Dalio", "content": "This is going to be my written understanding of Ray Dalio's absolutely beautiful 30 minute animated video to answer the question, \"Hoe does the economy really work?\" To any reader, watch the original video first. It's one of the best presented educational videos on the internet in my opinion. \n\n\u003ciframe title=\"How The Economic Machine Works by Ray Dalio\" src=\"https://www.youtube.com/embed/PHe0bXAIuk0?si=FUQfilqcRfebsYf7\" height=\"113\" width=\"200\" allowfullscreen=\"\" allow=\"fullscreen\" style=\"aspect-ratio: 1.76991 / 1; width: 100%; height: 100%;\"\u003e\u003c/iframe\u003e\n\n# The Machine Template\nRay describes the economy as a simple machine, a machine that many people don't understand or agree upon. He claims he as a **simple but practical** economical template that paints an accurate picture of the economy and markets in a country, which allowed him to sidestep the global financial crisis and has worked well for him for the last 30 years.\n\n\u003chr\u003e\n\n## Forces That Drive the Economy\nAccording to Ray, there are three primary forces that drive the economy. **Productivity growth**, the **short-term debt cycle** and the **long-term debt cycle** .\n\n![pasted-image-20240307000515](/images/pasted-image-20240307000515.webp)\n\nHis template, focuses on only layering these three forces together, and observing the effect it has on **transactions** and how that in turn affects the economy. \n\n![pasted-image-20240307000718](/images/pasted-image-20240307000718.webp)\n\n### Transactions\nTo start modelling the economy, we should first understand what we're talking about when we say the word 'economy.' \n\n\u003eAn economy is simply the sum of the transactions that make it up, and a transaction is a very simple thing. Everyone makes transactions all the time. Every time you \"buy\" something, you create a transaction. Each transaction consists of a buyer exchanging **money** or **credit**, for **goods**, **services**, or **financial assets**.\n\nSpending **credit**, is the same as spending money. So simply summing up the amount of credit spent and money spent, we can calculate the total spending of a nation. And the **total amount of spending drives the economy**. Let the *total spending* in a country be $T$. If $T_m$ is the total money spent, and $T_c$ is the total credit spent, we have $T = T_m + T_c$. Now, let the total weighted-quantity sold be $Q$. Now the price of $Q$ is just $P = \\frac{T}{Q}$. Keep this in mind for later, it'll help reason about why prices increase during an **inflation** and decrease during a **deflation**. So if a buyer, pays price $P$ to buy some item / service $Q$, we have a transaction. \n\n**Transactions are the basic building block of the economic machine** described by Ray. All cycles and forces in an economy are driven by transactions. Understanding transactions will help us understand the entire structure and working of the economic machine. \n## Market \nA market consists of all the buyers and sellers making transactions for the **same** thing. For example, we have a wheat market, a stock market, a car market, etc. \n## Economy\nAn economy consists of all of the transactions in **all** of its markets. If you add up the total spending and the total quantity sold in all of the markets, we have everything we need to know to understand the economy. \n## The Participants \nThe participants involved in transactions are **people**, **businesses**, **banks** \u0026 **governments**. \n### The Government\nThe most important buyer and seller that we want to understand is the government. It is primarily the actions of the government that control inflation, deflation, and all the other large scale economic events that occur. The government consists of two parts:\n1. **CENTRAL GOVERNMENT**\n\tThe central government is the body responsible for collecting taxes and interacting with the people. This is important because they collect taxes from the people, usually proportional to wealth and redistribute it back to the have-not in difficult times via stimulus checks and other similar means of support. \n2. **CENTRAL BANK**\n\tIt is different from other buyers and sellers because it **controls the amount of money and credit in the economy**. The central bank is both capable of **influencing interest rates** and **printing new money**. The central bank is an important player, perhaps the most important player in the flow of credit in the economy. \n### People \u0026 Businesses\nBoth work roughly the same way. Both people and businesses use the capital and income available to them to buy assets and other items by partaking in transactions. In particular, they act as propagators in the economic machine. People and businesses often take up debt, giving them access to credit to increase their spending. This will be covered in detail soon.\n### Banks\nSimilar to the described central bank above. They cannot print money, but they can adjust the interest rate and thereby help control the flow of credit in the economy.\n## Credit\nIs is **THE** most important part of the economy and often the least understood. It is important because it's extremely volatile and because a huge proportion of transactions that happen are done with credit. In 2021, the national debt of India amounted to around $2.36$ trillion USD while the money in circulation in 2021 is only 0.341 trillion USD. \n- [Statista - National Debt of India](https://www.statista.com/statistics/531619/national-debt-of-india/)\n- [Statista - Value of currency in circulation in India](https://www.statista.com/statistics/1028099/india-currency-in-circulation-value/)\n### How Is Credit Created?\nOut of thin air actually. Just like buyers and sellers make transactions in a market, so do **lenders** and **borrowers**. (Slight misconception involved here, will be cleared in the \"MISCONCEPTIONS\" section).\n#### Lender and Borrower Dynamic\n##### Lender Incentive\nLender's want to make their money work for them and generate more profits. So they often give **loans** to borrowers who agree to pay back the loan with some **interest** over a period of time. Essentially, they give money to a borrower on an agreement that the borrower will return them the principal amount along with some extra interest within some period of time. This assures them profits as long as the borrower doesn't go bankrupt and **default** on the loan. \n##### Borrower Incentive\nThis depends. It is possible the borrower might want to buy something that they can't outright afford, like a new car or house. Or maybe they want it as an investment into funding a new business they might want to start. The credit (loan) allows them to stretch a big investment over a period of time (EMI (Equated Monthly Installment) for example) or in funding their risky business idea that might generate them profits, using which they can pay back the loan and still have made more profit than incurred losses in interest payback. You can also think of it this way, for an individual that used his credit to take his loan, he uses credit to smoothen his buying power over time. He knows he'll have more buying power in the future as his income will increase, so he wants to increase his buying power now itself, and not have it be a direct function of his exact current income.\n\nEssentially, this lending and borrowing of money agreement allows both borrowers and lenders to get what they want. **When such an agreement is made, credit is created**. However, there are some important factors that control credit creation.\n#### Factors That Influence This Dynamic\n##### Interest Rate\nAs mentioned previously, borrowers pay back the principal borrowed along with some extra interest. This is typically simple or compound interest. In either case, the extra amount repaid depends on a number called the **interest rate**. When **interest rates are high**, there is **less borrowing** because it's expensive. But when **interest rates are low**, **borrowing increases** because it's cheaper.\n##### Creditworthy Score\nThe risk associated with making money for the lender comes from whether or not they believe the borrower to be able to pay back the principal in worst-case situations. For example, a lender would probably not reach this agreement with someone who wants to gamble all the borrowed money in a casino. Usually, borrowers have some private financial assets they can put down as **collateral** for the loan. In case they go bankrupt, the lender can rely on the borrower to sell the asset and repay the borrower with that money. However, keep in mind that this is still **not** completely safe, as the price of financial assets can go up or down depending on the demand and supply for that asset. However, income and collateral are still widely used to judge the creditworthy-ness of individuals to decide how much credit can be trusted to them. \n\nWhen any two people engage in this agreement, with a borrower promising to repay and a lender who believes in the borrowers and gives him the money, **credit is created**. Yes, out of thin air. \n## Debt\nCredit goes by many names. Debt is like the twin-brother of credit. As soon as credit is created, so is debt. Let's say a lender $A$ loans Rs. $x$ to a borrower $B$. Immediately, credit and debt are created. The lender receives a credit **asset** of value Rs. $x$ + $I$ , where $I$ is the interest to be paid, and the borrower receives a debt **liability** of value Rs. $x+I$.\n\n![pasted-image-20240307042628](/images/pasted-image-20240307042628.webp)\n### How Is Credit Destroyed?\nIn the future, when the borrower repays the loan, plus the interest amount, he completes the transaction and both the asset (credit) and liability (debt) associated with that loan disappear. \n## Common Misconceptions\nThe word credit is often used to describe many different things. In the context of this template, we will use the term credit to describe two things:\n1. **AN ASSET** - Here, we mean credit as essentially a bookkeeping entry for the lender that he expected to make some $x + I$ amount by a future date. \n2. **EXTRA CAPITAL** - Credit can also refer to the 'extra' borrowed money that the borrower now has access to. This is mostly what we'll be referring to when we use the word credit in the future. This is important, because this is a major driving force of economic growth. Essentially, think of it this way, no new money is actually created, but the amount of money involved in transactions is now greater. Why? The money given by the lender is money that the lender would've **otherwise just let lie dormant as cash**. However, because he sees a net positive ROI on lending the money to the borrower, he gives away this money and then this money is **injected into the economy via the borrower who uses this extra capital to make transactions**. \n This extra capital is just money that the borrower borrows from his future self. However, by borrowing money and making transactions with it, he essentially allows this 'extra' money which would've not partaken in transactions to now help increase the expenditure in the economy at the present. \n# The Operation of the Machine \nYou can probably figure out everything from scratch as long as you keep in mind this single mantra, that pretty much dictates all the why's in the operation of this machine.\n\n\u003e **One person's spending is another person's income.** Every dollar you spend, someone else earns. And every dollar you earn, someone else has spent. So when you spend more, someone else earns more. When someone's income rises, it makes lenders more willing to lend him money because now he's more worthy of credit. ...\n\u003e So increased income allows increased borrowing, which allows increased spending. And since one person's spending is another person's income, this leads to more increased borrowing and so on. (Remember, the opposite also holds true). \n\n![pasted-image-20240307134703](/images/pasted-image-20240307134703.webp)\n## Productivity Growth\nRemember, in a transaction, we get income (money) for whatever goods / services / financial assets we offer in return. The price we can put on the sold quantity depends on the quality and usefulness of the quantity. Over a period of time, humans learn, humans innovate and humans evolve. In essence, over a period of time, an experienced professional is able to do quality work in a shorter period of time. Or a startup founder who has studied the market and worked on a solution for years is able to use that experience to develop new goods that solve difficult problems. We see further by standing on the giants of yesterday. Those that are hardworking and inventive increase their productivity and living standards faster over a long period of time. Productivity growth is stable, strong, predictable and grows slowly. Because it doesn't fluctuate much, it is **not a big driver of economic swings**.\n\n![pasted-image-20240307235115](/images/pasted-image-20240307235115.webp)\n## Short-Term Debt Cycle\nDebt on the other hand, is far more short-term focused and occurs in cycles. When you take up debt, it allows you to **consume more than you produce**. And when you are paying back debt, you are forced to **consume less than you produce**. Debt swings, according to Dalio, occur in two big cycles. Short-term debt cycles occur $\\approx 5-8$ years. These swings are primarily caused based on how much credit there is active in the economy. \n\n![pasted-image-20240308043010](/images/pasted-image-20240308043010.webp)\n\nLet's see how it works. These 'cycles', as described by Ray, occur because of our **human nature** to borrow money. Borrowing money is essentially just a way to manage buying power over a period of time. To spend more than you make today, you borrow from your future self. Essentially you create a time in the future when you need to spend less than you make in order to pay it back. One reason why you would do this is to smooth-en your buying power over time. Remember that in the future, your income increases because you're more productive and you are likely to make more money. So it makes sense to you to borrow more from your future self as you know your future self will be likely making a lot more money. And getting access to more money early is a lot more profitable because if you're able to invest it well, you can generate a lot of money from it due to the simple [POWER OF COMPOUND INTEREST](/blog/power-of-compound-interest). \n\n![pasted-image-20240308043345](/images/pasted-image-20240308043345.webp)\n\nThe very act of borrowing, creates a very **mechanical** and **predictable** series of events that will happen in the future. If you borrow now, you **must** pay it back later. This is just as true as it for an individual, as it is for the economy as a whole. This series of events will always occur in a cycle. \n### Is Credit Always Good?\nNot really. If you borrowed credit just to spend more on liabilities and you do not increase your productivity or income growth in any significant way, it it bad. You now create a time in the future where you need to pay back this debt with the same income level you have today. It wasn't used to create any meaningful growth. \n\n\u003e\"_If you can't buy it twice_, _don't buy it_\"\n\n### Super-Charging the Economy\nWhen interest rates are low, a lot of people are encouraged to borrow, and this has the effect of supercharging incomes in an economy. This is how it works, let's say an individual $A$ earns $100,000\\$$ a year and has no debt. He is credit worthy enough to borrow $10,000\\$$. Now, let's say he spends all his money. His spending is now $110,000\\$$ instead of just $100,000\\$$. And applying the single most important principle used to describe this whole machine, _since one person's spending is another person's income_, somebody else makes $110,000\\$$. This makes them credit worthy to borrow $11,000\\$$. Then they spend $121, 000\\$$ (instead of just $100,000\\$$), and again since their spending is somebody else's income... this process self-reinforces itself and causes living standards and income to rise for everyone!\n\n![pasted-image-20240308170223](/images/pasted-image-20240308170223.webp)\n\nSince a lot of the money flowing in the economy is actually _credit_ (money that would've otherwise **not** been involved in transactions), it encourages people to borrow since they are credit worthy to borrow more money due to their increased income (which was a result of others spending borrowed money) and become a self-reinforcing pattern. This pattern, like discussed before, will be a cycle. And if the cycle goes up, it eventually must come down. \n### Phase 1 - Economic Expansion\nAs economic activity increases, there's an **economic expansion**. This is the **first phase** of the short-term debt cycle. People's income rises, and spending continue to increase. However, this rise in spending is not matched with an increase in the quantity and quality of goods produced. This is because the increase in quantity / quality of goods is associated with productivity growth, and that is a much slower and stabler force than the fast-paced credit cycle. Now because we have increased spending (demand) and the supply remains the same, prices of goods go up. This is called **inflation**. However, the central bank doesn't want too much inflation, because it causes problems. Seeing prices rise, it raises interest rates, launching us into the second phase of the short-term debt cycle. \n### Phase 2 - Economic Recession\nWith higher interest rates, borrowing becomes less appealing. Fewer people borrow money, and the cost of existing debts increases. This naturally causes humans to prioritize debt repayments more and borrow less. Put in this situation, the amount of money individuals have left, after paying debt repayments and **not** borrowing credit, is lesser than what it would've been in the previous phase. And *since one person's spending is another person's income*, people's income drops. This pattern is exactly the same as we had previously during the first phase, but in the opposite direction. Incomes drop, borrowing slows and the economy falls as the money spent in transactions drops sharply. And in contrast to before, the supply remains constant but the demand drop since people spend less. This causes the price of goods to go down, we call this **deflation**. Economic activity decreases and we have a **recession**.\n### Flipping the Switch\nRecession is clearly not good for the economy. Remember that the central bank increased interest rated on credit borrowing to help beat inflation. If they caused this to happen, they have the power to pull the economy out of a recession to. If the recession becomes too severe and inflation is no longer a problem, they can simply lower interest rates again and plunge the economy back into the first phase of the cycle. \n\nThe short-term debt cycle is primarily controlled by the central bank. When there is surplus credit flowing through the economy, there is expansion and people's incomes and living standards rises. When there is too little credit in the economy and spending slows, there's a recession. And it is the central bank which carefully varies interest rates on credit borrowing to keep this cycle going in a mechanical fashion. According to Ray, this cycle typically lasts 5-8 years and repeats itself over and over again for decades. \n## Long-Term Debt Cycle\nHowever, in practice, we will often see that the bottom and top of each cycle finish with more growth than the previous cycle **and more debt**. \n\n![pasted-image-20240308172045](/images/pasted-image-20240308172045.webp)\n\nThis occurs primarily due to **human nature / greed**. We as humans have an inclination to borrow and spend more instead of paying back debt. Humans push it too far. Paying back debt is boring and painful, spending more than you earn gives you more gratifying experiences. But this mountain of debt can't keep growing forever, we eventually reach a point where debts grow faster than incomes and this creates the **long-term** debt cycle. \n\n### The Expansion Phase\nThe crazy thing is, despite borrowers debt-burdens growing, lenders even more freely extend credit. Because when living in a particular point along this curve, people focus only on what's been happening lately. Looking at a curve from too up-close just shows you a straight line. When we are in the expansion phase of the long-term debt cycle, incomes are rising, financial asset prices are sky-rocketing, the stock market hits new peaks. Why? Because of all the credit flowing through the economy, a lot of credit is being used in transactions. Causing all the aforementioned events to occur. It's an economic boom! When people do a lot of this, and most of it is actually using credit, we call this a **bubble**.\n\nEven though debts are growing, incomes have been growing nearly as fast to offset them. And more importantly, remember the factors that make someone creditworthy? Income levels and collateral. Because financial asset prices are sky-rocketing, people have more collateral to lay down and take debts on. Note that the value of these financial assets are not fixed, and depends on the supply-and-demand for said asset in the market. And even though its increased price is mostly caused due to the artificial-demand induced due to credit in the market, most people don't realize this as they're looking at just what's been happening lately. \n#### Debt Burden\nWe call the **ratio** of **debt** to **income** the **debt-burden**. $DB = \\frac{debt}{income}$. As long as incomes rise, the debt-burden remains manageable. \n\nBecause people own financial assets as investments that make their prices rise even higher. People feel wealthy and remain creditworthy despite being under major debt. \n\n### The Recession Phase - The Deleveraging\nHowever, over decades, the debt-burden eventually begins to tip towards the unmanageable side. Incomes can't rise fast enough to match the debt-repayment amounts. Just like in the recession phase of the short-term debt cycle, people cut spending. This again becomes a self-enforcing pattern like in the short-term debt cycle case. But with one **crucial difference**. \n\nIn a short-term debt cycle, there was a savior installed in place to flip the switch and get the economy up and growing again. The central bank could just flip the switch, lower interest-prices and allow people to borrow credit again. Increasing spending, and creating a self-reinforcing pattern that boosts the economy. However, in a deleveraging, **interest prices are already at $0\\%$**. The central-bank cannot easily 'flip a switch' and fix the economy.\n\nDebt-burdens have simply become too big. In a deleveraging, people cut spending, the stock market crashes, prices of financial assets drop as demand for them subsides, social tensions rise and the whole imaginary bubble pop, leaving behind a total shit-show. Because people's income falls and debt repayments rise, borrowers are squeezed. They are no longer creditworthy, and borrowers **cannot borrow more to payoff debts**. This causes borrowers to go to their backup plan, the same backup plan the lenders trusted to give borrowers their loans. Collateral. Borrowers are forced to sell off their financial assets in order to complete their debt repayments. But this happens at a massive scale. Because the supply for financial assets suddenly soars, at the same time people cut their spending, the prices of these financial assets drop. To much lower than what it was previously, making borrowers even less credit worthy and still unable to payoff debts. \n\nPeople feel poor. Less spending -\u003e Less Income -\u003e Less Wealth -\u003e Less Credit -\u003e Less Borrowing -\u003e Less spending -\u003e ...\n\nLenders don't lend more money because they are still waiting for debt repayments that they're no longer sure if the borrower can payback and they need all their money with them to survive the deleveraging. Borrowers can't borrow any more money and can't pay off their existing debts. Their entire economy becomes not credit worthy. \n\n#### How to Get Out of a Deleveraging?\nThere are 4 ways to reduce the debt burden. Some inflationary and some deflationary. They usually tend to happen in the following order:\n##### Cut Spending (↓)\nThis is usually the first and instinctive reaction to most people during a deleveraging. Because the news is filled with bad news about the economy and because they realize they have a lot of debts to pay and are unable to borrow, people cut spending in order to pay off their debts. However, *because one person's spending is another person's income*, this process is deflationary in nature. Demand drops, making asset prices and incomes drop. This has the opposite effect of helping because although debts might reduce, incomes also reduce. And often incomes fall faster than debts, causing the debt-burden to grow even worse.\n##### Debt Restructuring (↓)\nThe next call to action is usually debt restructuring. In this economic-crisis, many borrowers find themselves **unable** to pay off their debts. And remember, a **borrower's debts are a lender's assets**. When borrowers are unable to repay the bank (lender) and default, people get nervous that the bank won't be able to repay them and rush to withdraw money from the bank. Banks get squeezed and banks default on their debts, often starting a chain of bank collapses because there is often a *lot* of money shared between banks. \n\n\u003eMuch of an economic-depression is people discovering much of what they thought was their wealth isn't really there. \n\nIn such situations, lenders often agree to **debt restructuring**. This means lenders agree to restructure the terms of the original agreement on which the loan was taken. This could include changes where the interest is lowered, the amount to be paid back could be reduced, or the time-frame over which the debt has to be payed back could be increased. Lenders are forced to agree in these situations because the choice for them is essentially between \"all of nothing\" or \"half of something\". This is however, still deflationary in nature because income and asset values fall as a result of debt restructuring. Money that could be injected into the system via transactions doesn't enter because the agreement no longer requires it to. \n##### Wealth Redistribution (↓)\nThe next option to help get out of a deleveraging is to redistribute wealth between the rich and the have-not. Because people's income falls, the amount they pay in taxes falls and this directly impacts the income of the central government. At the same time the government finds itself in a situation where it needs to increase its spending because unemployment has risen due to companies laying off workers, etc. and left many people with inadequate savings to survive. This situation is called a **budget deficit**. A budget deficit occurs when expenditures surpass revenue and then up impacting the financial health of a country. In such a situation, especially since wealth is typically concentrated in the hands of a small percentage of the population, governments raise taxes on wealthy which allows to redistribute wealth in the economy from the rich to the have-not. This naturally increases resentment between the rich and the poor and could lead to social disorder. This sort of a situation lead to Hitler coming to power, war in Europe, and depression in the United States.\n##### Print Money (↑)\nThe issue in the first place is that a lot of what people thought was their money didn't really exist. It was credit which was borrowed from their future self. While the central bank cannot lower interest rates further, like in the case of short-term debt cycles, it still has one final card up it's sleeve. It can print money! Unlike the aforementioned methods, printing money is inflationary. This is because the act of printing money increases the amount of money in circulation (being used in transactions), thus increasing the income of people. And because _one person's spending is another person's income_, this causes incomes to rise and asset prices to increase. However, note that the central bank cannot do this alone. Because it can only buy financial assets (which still drives up asset prices) and bonds, it focuses on buying government bonds, essentially loaning money to the central government, which then uses this money and distributes it to the poor and helps combat the budget deficit. But note that this still **increases government debt**. \n\nFurther, note that printing money essentially lowers the worth / purchasing power of a single note. Policy makers need to carefully balance the inflationary ways with the deflationary ways to smooth-en the debt cycle while keeping panic and social disorder at minimums. Dalio terms a deleveraging where these forces are balanced perfectly a *beautiful* deleveraging. In a beautiful deleveraging, debts decline relative to income, real economic growth is positive and inflation is kept under control. \n\nThis isn't easy to balance. Printing money is an easy solution but this could lead to unacceptably high inflation by thrashing the power of a rupee / dollar / *insert-currency-name-here* note. Because printing money is the only method that allows incomes to rise, it must be done. The only way to reduce the debt-burden is to increase incomes in the country relative to interest rates on debt repayments so that incomes are *just* high enough to outgrow debt payments. When this balance is achieved, it allows the country to grow out of the deleveraging phase of the long-term debt cycle in a less-dramatic and smooth fashion. \n\nEventually, incomes increase, people become credit worthy and we enter the expansion phase of the long-term debt cycle again. Dalio expects the deleveraging phase to take around a **decade**. \n\n# Rules for Handling Macro-Economics\n1. Don't have debts rise faster than income.\n\t-\u003e Fairly obvious, debt-burdens getting out of control is why we have depression phases of cycles. If we can handle it smoothly by increasing our future self's income (via increasing productivity), we would be dealing with a much smoother and less-dramatic entry back to the expansion phases. \n2. Don't have income rise faster than productivity.\n\t-\u003e This makes you unmotivated and non-competitive if you don't have a reason to keep pushing yourself. \n3. Do all that you can to raise your productivity. \n\t-\u003e Fairly obvious again, raising productivity is what let's you increase income in the long run and let you handle the debt repayments caused by your past-self borrowing money from you on the trust that his future self would have much higher income and buying power. \n\nResources referred to:\n1. [How The Economic Machine Works by Ray Dalio](https://www.youtube.com/@principlesbyraydalio)\n2. [Investopedia](https://www.investopedia.com/)", - "lastmodified": "2026-01-11T05:05:58.568914823+05:30", + "lastmodified": "2026-01-11T10:59:11.712217985+05:30", "tags": [ "domain-finance-economics" ] @@ -216,7 +216,7 @@ "/blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting": { "title": "How to Analyze Algorithms? Proving a Lower Bound for Comparison Based Sorting", "content": "# Analysis\nBefore we can attempt to solve problems, we must answer 3 important questions about the algorithm\n## Is it correct?\nThis is of the highest importance. An algorithm is simply _wrong_ if it does not produce the correct output for all input. How do we prove correctness? We cannot simply rely on test cases, this isn't \"proof\" enough as discussed in [Are there computational problems that computers cannot solve?](/blog/are-there-computational-problems-that-computers-cannot-solve). There may be many edge cases it can fail on that remain undiscovered. An inductive proof is far superior to \"proof by AC.\" Only in the case of highly complicated algorithms can we rely on test cases (as it gets more complicated to prove it mathematically.)\n\nThere are also probabilistic algorithms that give the correct output _most_ of the time. However, we will cover probabilistic algorithms at a later point in time and focus on deterministic algorithms for now.\n## How much time?\nHow much time? (or any other resource, does it require to compute the solution). Time complexity is usually expressed as an function in terms of input size and underlying model of computation used. Knowing this is important in knowing when and how to feasibly execute an algorithm. If we know an algorithm is $O(2^n)$ there's no point executing it for a very large $n$ as it would take more than our lifetime to compute. \nNote that for computing time complexity, it is very important to fix the underlying model of computation. On a single tape Turing machine, each bit in the number corresponds to an unit of computation. On the other hand, if we think of a modern machine with 64-bit registers and we restrict our integer inputs to $\\leq 2^{64}$ then we can model arithmetic as constant time operations. But for a single tape Turing machine, addition would require $O(n)$ time. \n## Can we do better?\nWe have to see if we can do better, and if not, we must prove the optimality of our current solution. But how do you prove the optimality of a solution? The number of possible \"programs\" as discussed in [Are there computational problems that computers cannot solve?](/blog/are-there-computational-problems-that-computers-cannot-solve) is countable, but it is still infinite. We cannot attempt to go through all of them individually. To solve this issue, we need to instead build an image of the capabilities of a given model of computation and enforce constraints that allow us to come up with a proof for the same. Remember, time complexity is defined for a fixed model of computation. So we use that as the base and draw implications from there. For example, we know that the lower bound for comparison based sorting is $O(nlog_2(n))$ in the order of comparisons made. How? Let's discuss how we prove this.\n### Proving a lower bound of $\\Omega(nlog_2(n))$ for comparison based sorting\n#### Decision Tree Model\nTo prove this bound, let's come up with an useful model to visualize / enumerate the number of comparisons based in an arbitrary sorting algorithm. Any comparison based sorting algorithm must perform $x$ number of comparisons to end up with a result. Further, which comparison it makes on the $i^{th}$ iteration depends on the result of the $(i-1)^{th}$ iteration. In short, we can imagine the comparisons made to be a decision tree. Here's what the decision tree would look like for sorting an array of size three. \n\n```mermaid\ngraph TD\n A --\u003e B{A \u003c B?}\n B --\u003e|Yes| C{B \u003c C?}\n C --\u003e|Yes| D[A, B, C]\n C --\u003e|No| E{A \u003c C?}\n E --\u003e|Yes| F[A, C, B]\n E --\u003e|No| G[C, A, B]\n B --\u003e|No| H{A \u003c C?}\n H --\u003e|Yes| I{B \u003c C?}\n I --\u003e|Yes| J[A, B, C]\n I --\u003e|No| K[A, C, B]\n H --\u003e|No| L{B \u003c A?}\n L --\u003e|Yes| M[B, A, C]\n L --\u003e|No| N[C, B, A]\n```\nNote that every non-leaf node in this diagram represents a comparison, and every leaf-node represents one of the resultant permutations generated by the sort. Every **path** from the root to a leaf node represents a sequence of execution of the sort. No matter in what order the sorting algorithm does the comparisons, it needs to take one of these paths to end up at a leaf node. And since we have this nice model, we can say that the worst case number of comparisons performed will be the height of the tree. If we can prove a lower bound for the height of the tree, we have proved the lower bound for comparison based sorting algorithms. \n\nNote that for any array of size $n$ (containing unique numbers), there exist $n!$ different ways to order these positions. Every inverse permutation of all possible permutations of $1 \\cdots n$ can be an input. This implies that there must be $n!$ leaf nodes, corresponding to every such input permutation. If there are less than $n!$ leaf nodes, we would not be able to solve each of the $n!$ inputs correctly since the solution for that input would not exist in the decision tree. Second, note that the decision tree is a **binary tree**, which implies that if the tree has height $h$, it can have at max $2^h$ leaf nodes. From this, we get\n$$\n\\begin{aligned}\n2^h \\geq n! \\\\\nh \\geq log_2(n!) = log_2(n\\cdot(n-1)\\cdot(n-2)\\cdots1) \\\\ \nh \\geq \\sum_{i=1}^nlog_2(i) = \\sum_{i=1}^{\\frac{n}{2}-1}log_2(i) + \\sum_{i=\\frac{n}{2}}^nlog_2(i) \\\\\nh \\geq 0 + \\sum_{i=\\frac{n}{2}}^nlog_2(i) \\geq \\frac{n}{2} \\cdot log_2(\\frac{n}{2}) \\\\\n\\implies \\Omega(nlog_2(n))\n\\end{aligned}\n$$\n### Alternate method\nFor any given sequence of $n$ elements, there exist $n!$ permutations in which they can be arranged. The sorting problem requires us to find one such permutation out of all $n!$ permutations such that $a_i \u003c a_{i+1} \\ \\forall a_{i\u003cn}\\in A$ where A is our sequence.\n\nLet's suppose that we can find the sorted array using just $k$ comparisons. This would imply that we can represent permutations $0$ to $(n! - 1)$ using just $k$ bits. To represent $(n! - 1)$ in bits, we need $log(n! - 1)$ bits, therefore $k \\geq log(n! - 1) \\implies k \\geq log(n!)$ bits is required.\n\n\u003e **Why is this true?** We assumed that our algorithm will be able to find the answer to the sorting problem using just $k$ comparisons. If we consider the result of a single comparison, it can have **exactly** 2 values. True or false. If we consider $k$ comparisons, we can have $2^k$ possible results. In essence, each comparison is able to cut **half** of the permutations we need to consider. Each of those $2^k$ results can be used to uniquely identify some permutation. Now, the answer to the sorting problem must be one of the $n!$ possible permutations. For our algorithm to successfully find the right permutation for every given input, it must be able to _at_ _least_ uniquely identify each of the $n!$ possible permutations. This leads us to the conclusion that $2^k \\geq n! \\implies k \\geq log_2(n!)$ bits/operations are required.\n\n$$ \nlog(n!) = log(n)+log(n-1)+\\dots+ log(\\frac{n}{2})+\\dots+log(2)+log(1)\n$$\n\nLet's discard the bottom half of this sum. The sum we get after discarding the bottom half must be **lesser** than the original sum as all the values are **positive.** This gives the following,\n\n$$ log(n!)\\geq log(\\frac{n}{2})+log(\\frac{n}{2}+1)+\\dots+log(n-1)+log(n) $$\n\nBy reducing all the terms inside $log$ to just $\\frac{n}{2}$, we will only get a sum **lesser** than the above as $log$ is an increasing function. This gives,\n\n$$ log(n!) \\geq log(\\frac{n}{2})+\\dots+log(\\frac{n}{2}) \\\\ log(n!) \\geq \\frac{n}{2}log(\\frac{n}{2}) \\\\ = \\Omega(nlogn) $$\n\nIt is also not very difficult to find an **upper bound** to $log(n!)$. Similar to the calculation above,\n\n$$ log(n!) = log(n)+log(n-1)+\\dots+log(2)+log(1) $$\n\nReplacing each of the above terms with $log(n)$ will only give us a sum greater than the above one as $log$ is an increasing function. This gives,\n\n$$ log(n!) \\leq log(n)+\\dots+log(n) \\\\ log(n!) \\leq nlog(n) \\\\ = \\Theta(nlogn) $$\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [CLRS, Introduction to Algorithms](https://en.wikipedia.org/wiki/Introduction_to_Algorithms)", - "lastmodified": "2026-01-11T05:05:58.578904075+05:30", + "lastmodified": "2026-01-11T10:59:11.720711506+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -224,7 +224,7 @@ "/blog/in-search-of-an-understandable-consensus-algorithm-extended-version-raft": { "title": "\"In Search of an Understandable Consensus Algorithm (Extended Version)\" - RAFT", "content": "# Abstract\nThese notes are taken from my reading of the original paper, [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf) by [Diego Ongaro](https://scholar.google.com/citations?user=oBe2P2EAAAAJ\u0026hl=en) and [John Ousterhout](https://scholar.google.com/citations?user=nDH-AqwAAAAJ\u0026hl=en), a video lecture by Diego on YouTube: [Designing for Understandability: The Raft Consensus Algorithm](https://www.youtube.com/watch?v=vYp4LYbnnW8) and another by [Core Dump](https://www.youtube.com/@core_dump): [Understand RAFT without breaking your brain](https://www.youtube.com/@core_dump). \n\n\u003e **Abstract**\n\u003e Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.\n\nSo if you've read that, you'll realize that Raft was preceded by Paxos, the first popular consensus algorithm. A natural question to ask is why I'm reading / covering RAFT before Paxos. I'm doing this primarily trusting this tiny \"study\" conducted by Diego and John. People were roughly taught Paxos \u0026 Raft in differing orders and then made to take a \"comparably equal\" in difficulty quiz. You'll notice that Raft is comparably easier (according to the participants) to implement / explain \u0026 the test results do seem to show higher \"understanding\" of RAFT. Regardless, you'll also notice that people who were taught Paxos, and then Raft, did statistically significantly worse in both tests :) I personally am inclined to believe this could be because of an artifact / bias in the testing process but it is interesting. Anyways, on to Raft.\n\n![pasted-image-20240904182856](/images/pasted-image-20240904182856.webp)\n\n# Designing Algorithms for Understandability\nIn the field of algorithms research, the two primary most common and important criteria used for evaluation are correctness and efficiency. In fact, that's more or less what I wrote about a year back in [How to Analyze Algorithms? Proving a Lower Bound for Comparison Based Sorting](/blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting) as well. However, Diego claims that a crucial yet frequently overlooked aspect of evaluation is the algorithm's *understandability*. Most people interested in / researching algorithms (including me) often tend to chalk up how \"intelligent\" or \"great\" an algorithm is based on how difficult \u0026 complex it is. Complex algorithms are harder to understand, and we attribute more 'respect' to them, but an algorithm that is a lot more 'understandable' is often considered \"inferior.\" \n\nBut the true 'value' or impact of an algorithm is often in its clarity and ease of understanding. When we want to actually move from theory to practice, the ability to implement and adapt an algorithm is crucial. Especially in the field of distributed systems, reasoning about correctness of algorithms is very difficult. Adding complexity here spawns various branches and adapted versions of the same algorithm \u0026 makes it difficult to implement, greatly impacting the ability of academia to come to an agreement on the 'best' version and also on how much impact the algorithm can have in the real world. \n\nRaft was a great example of the above. The paper was rejected 3 times at major conferences before it was finally published in USENIX ATC 2014. \n- \"Understandability\" was hard to evaluate\n- Reviewers at conferences were uncomfortable with understandability as a metric\n- Complexity impressed reviewers\nHowever, on the adoption side:\n- 25 implementations were already in the wild before the paper was even published\n- It was taught in MIT, Stanford, Harvard, etc. in graduate level OSN classes before the paper was published. Kind of ironic that the same people thought teaching it to students was a great idea but didn't think it was \"good enough\" to be accepted in conferences.\n- Today, RAFT is the goto algorithm of sorts for most distributed consensus problems, forming the base for sharded and replicated database systems like [TiDB - A Raft-based HTAP Database](/blog/tidb-a-raft-based-htap-database).\n## Consensus Algorithms: An Overview\nConsensus algorithms are fundamental in distributed systems as they enable a collection of machines to operate cohesively despite individual failures. They ensure that a group of unreliable machines can function as a single reliable entity. The Paxos algorithm, developed by Leslie Lamport in the late 1980s, has long been the gold standard for consensus algorithms.\n\n**Paxos** works by agreeing on a single value through a two-phase process involving proposers and acceptors. It is proven to be theoretically robust, but the proof is extremely difficult to understand. This also means it is very difficult to implement in practical scenarios. The complexity increases when you extend Paxos to manage replicated logs, which is essential for ensuring that all machines maintain a consistent view of the system's state.\n### The Challenges with Paxos\n\u003e We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree protocol works. The composition rules for multiPaxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.\n\u003e - [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n\n1. **Complexity in Understanding**: Paxos, while theoretically sound, is challenging to grasp due to its intricate processes and numerous edge cases. The basic algorithm involves selecting a proposal number, handling responses from acceptors, and ensuring consistency across distributed nodes. This complexity often leaves practitioners struggling to understand why and how Paxos works, which hampers its practical application.\n2. **Scalability and Practicality**: Paxos's basic form addresses only a single value agreement and does not inherently cover the full range of issues needed for building a replicated log. Extending Paxos to handle multiple values, log replication, and system failures introduces additional complexity and can lead to inefficiencies and inconsistencies.\n3. **Lack of Agreement on Solutions**: Various enhancements and adaptations of Paxos have been proposed (e.g., Paxos Made Simple, Paxos Made Practical), but there is no consensus on the best approach. This fragmented understanding contributes to the difficulty in implementing Paxos effectively.\n### Properties of Consensus Algorithms\n- **Safety:** They guarantee safety under non-Byzantine conditions (e.g., network delays, packet loss).\n- **Availability:** They remain functional as long as a majority of servers are operational. For example, in a five-server cluster, any two can fail. Servers fail by stopping but can recover and rejoin from stable storage.\n- Timing issues (faulty clocks, delays) may affect availability, but not consistency.\n- Generally, a command completes when a majority respond to one round of calls, so a minority of slow servers don’t impact the performance of the entire cluster.\n# The RAFT Algorithm\n## Designing For Understandability\nRAFT was built from the ground up keeping understandability as their north star. \"Understandability\" is obviously a subjective term and one that is difficult to quantify and analyze. However, the author's attempted this challenge by using two techniques that are \"generally\" acceptable.\n\n1. **Problem Decomposition** - Wherever possible, divide the problems into separate subproblems that can be solved independently. As a competitive programmer, I can definitely agree that this is a popular implicit technique most competitive programmers use for problem solving. Take a problem, break it into individual components that you can solve independently, and link the output of one sub-problem to the input of another. Instead of solving a single complex problem, we can now easily reason the solution(s) of the smaller \u0026 simpler components and then understand how they link to one another to understand the complete solution.\n2. **Reduction of State Space** - Again, one of the things a problem solver tries to do is always reduce the number of cases to consider. When constructing a solution for a problem, you always want to try to apply operations to the problem which tries to combine multiple output scenarios into a single scenario that you can handle separately. In short, if the state space is large, you will end up with too many individual components to solve \u0026 link back together, making the solution extremely difficult to understand. On the other hand, if you can merge multiple scenarios into one, the number of components to solve \u0026 link reduces, making the solution much easier to understand.\n\nThe authors use these two techniques extensively whenever faced with design decisions for their algorithm, and that is how they came up with the final RAFT algorithm.\n## Novel Features\nRAFT is similar to many existing consensus algorithms, but it also has several novel features.\n\n1. **Strong leader:** Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand. This is similar to how [The Google File System](/blog/the-google-file-system) handles write operations.\n2. **Leader election:** Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly. This is one of the situations in the paper where the idea of \"Reduction of State Space\" makes them choose an unconventional idea as the main tie-breaking mechanism for their election. This might be non-deterministic, but it is incredibly useful to reduce the state space \u0026 simplify the solution.\n3. **Membership changes:** Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.\n## The Algorithm\nAs mentioned previously, RAFT works by decomposing the log consensus algorithm into smaller components that are solved independently and then linked together. RAFT works by first electing a distinguished *leader* who is responsible for managing the replicated log. All writes go to the leader, who writes to log and then replicates the same on all other servers. When the leader fails (or is disconnected from other servers), a new leader is elected. RAFT can be divided into the following phases:\n- **Leader election:** a new leader must be chosen when an existing leader fails\n- **Log replication:** the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own\n- **Safety:** the key safety property for Raft is the State Machine Safety Property. if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. The solution involves an additional restriction on the election mechanism.\n### Basics\n#### Server States\nServers in Raft operate in one of three states: \n- **Leader**: The active server managing log entries and communication with followers.\n- **Follower**: Passive servers that wait for instructions from the leader. If a client contacts a follower, the follower redirects it to the leader.\n- **Candidate**: A server that becomes active when it times out and tries to become a leader.\n#### Terms\n![pasted-image-20240923005022](/images/pasted-image-20240923005022.webp)\n\nTerms are time intervals of arbitrary length. A term begins with an *election*. Elections occur until a candidate is chosen as leader. There is at most one leader for a given term.\n\n\u003eDifferent servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.\n\u003e - [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n#### RPCs\nRAFT assumes that the servers communicate with one another through a procedure similar to RPCs. RAFT only requires servers to support two unique RPCs to work. This is another example of how \"simple\" RAFT is. \n- **RequestVote RPCs** -\u003e These are initiated by candidates during elections to request for votes\n- **AppendEntries RPCs** -\u003e These are sent from the leader to the follower servers to replicate log-entries. It also has an additional role of serving as a heartbeat message to the follower servers. (An AppendEntries RPC with an empty entry is considered a heartbeat message)\nFinally, if we want to also allow for transferring snapshots between servers we need a third RPC.\nServers will retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.\n### Leader Election\nRAFT relies on a heartbeat mechanism to trigger leader elections. The key idea here is using randomization to introduce non-determinism into the election procedure. This technique allows RAFT to drastically reduce the size of the resultant state space, greatly simplifying the solution and making it easier to understand intuitively. On startup, all servers start out as followers. Followers will continuously listen for the `AppendEntries` RPC from the leader in a timely manner. What do I mean by timely manner? \n\nLet's say we have servers $s_1, s_2, \\cdots, s_n$. Each server is assigned a randomly picked timeout value $t_1, t_2, \\cdots, t_n$. This value $t_i$ is called the server $i$'s *election timeout*. If a follower $i$ receives no communication from the leader for an entire time period of $t_i$, it assumes that there is no viable leader and begins a new election. So in short, each server has a ticking timer that starts from $t_i$ and ticks to 0. Let's workout the possible states.\n\n**State Space:**\n1. **Success:** Server $s_i$ receives communication from leader within the time period $t_i$, i.e., before its ticking timer hits 0. $s_i$ immediately resets it's timer back to $t_i$ and restarts the countdown. If the communication from the leader contained an action, it will respond accordingly. (Actions are described below under \"Log Replication.\")\n2. **Probable Failure:** Server $s_i$ does not receive any communication from the leader before its timer hits 0. This could've occurred due to several possibilities:\n\n\t- The leader crashed -\u003e No communication was sent to $s_i$. We need to start an election to get a new leader.\n\t- There was a network partition -\u003e If there is a network partition separating the leader from its followers, this is as good as not having a leader (leader crash). Server $s_i$ may be a part of the minority or majority of the network partition. In either case, it needs to start an election. If it's in the minority, a leader will never be chosen. If it's in the majority, a leader is chosen and majority in the network partition continues to work as normal.\n\t- There was severe message delivery latency between leader \u0026 follower -\u003e This is likely a sign of an unhealthy leader or unhealthy network layer. If the leader's outgoing message latencies are high, the leader is in an unhealthy state and should be replaced, i.e., we need to start an election. On the other hand, if the network layer for messages delivered between all servers has degraded, we would either need to fix the network layer or increase the lower bound on the randomized timeout periods assigned to the servers. \n\t\n\tAs we can see, in all possible root-cause states, the follower timing out before a heartbeat was received implies that we must start a new election to vote for a new leader and solve the problem. Network layer degrading is not very common and signifies bigger problems for the entire deployment as a whole. But we can easily modify the algorithm to just increase the lower bound on the randomly assigned $t_i$ for each server if a spike is observed in the number of elections conducted to solve this issue.\n\nAlright, so from the initial state, one of the servers must timeout before all the others (due to randomized $t_i$ values, the chance of two independent servers picking the same $t_i$ is relatively low). This server will immediately start an election by **incrementing its term** and sending the **Request Vote RPC** in parallel to all the other servers. A candidate continues in this state until one of three things happens: \n\n- It wins the election\n- Another server wins the election and establishes itself as leader\n- A period of time goes by with no winner.\n\nBefore we discuss these three outcome states, let us discuss how servers vote for other servers. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis. That is, the server which time out first will vote for themselves and send Request Vote RPCs to all the other servers. If a server receives a Request Vote RPC before timing out, they will immediately vote for the first server in a given term that they receive a Request Vote RPC from. Let's analyze the state space of an individual server and see how it handles the three possible situations.\n\n**State Space:**\n1. **One Candidate Wins:** This is the positive path. With high probability, one of the servers that time out the earliest will send RequestVote RPCs first to the majority of the other servers and collect votes from all of them. If a candidate receives majority of the votes, it promotes itself to leader and begins issuing heart beat Append Entries RPCs to the rest of the servers to establish itself as leader and prevent further elections.\n2. **Another Candidate Wins:** This server timed out, became a candidate \u0026 sent out Request Vote RPCs. However, another server also timed out in a close interval and due to variance in message communication latency, was able to receive more votes from the remaining servers. In this case, the server that won the election will promote itself to leader and send out heartbeat RPCs to this candidate server. To verify if it was beaten in the election, it only has to check if the heartbeat message received was from a server with a term greater-than or equal to its own. If it is, then the leader is legitimate and the candidate immediately transitions into a follower state. If the term is lesser, the RPC is rejected. (Likely it is from a server that is far behind sync due to performance issues or crash-recovery).\n3. **Split-Vote:** It is possible that multiple followers became candidates at roughly the same time, and were able to all win a significant portion of the votes due to message latency. In this case, no candidate server would've received sufficient votes to become a leader \u0026 thus all the candidate servers would be waiting in limbo indefinitely. To solve this issue, each candidate again has a randomized timeout (can be the same $t_i$) that is used to timeout and start a new election. When this happens, the server increments its term again. **This is crucial.** \n\nIt is important to understand that in the split-vote case, the solution reduces back to the original problem of missing a leader and requiring a new election only because of the term increment step. If this is not done, it is possible for the server to vote for multiple candidates in the same term \u0026 also doesn't solve the no-majority issue. But incrementing the term allows the problem to be reduced back to the original no-leader case. If the split-vote case happened frequently, we would be stuck in an indefinite loop of starting elections and voting for leaders and not have any compute available to process action requests. Therefore, it is important to ensure that the split-vote case happens rarely. \n\n\u003eRaft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out.\n\u003e- [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n\n**Note:** This is a *slightly* simpler version of the actual leader election implemented / described by the RAFT algorithm. In practice, this solution works for electing a leader correctly. However, when we try to use RAFT to maintain a consistent log across multiple servers we will need to add another condition to make it work. Again, sticking true to the original idea of solving subproblems independently, the above algorithm is a complete solution for the leader election problem. We will address the problem of maintaining a consistent log under the following section on \"Log Replication\" as a minor-tweak to leader election.\n#### Visual Demo\nThe visualization tool on the [Raft Website](https://raft.github.io/) does an incredible job at explaining the algorithm visually. Here are a few pictures demonstrating it at work, although I highly recommend you to actually try playing around with it to get a better intuition for the state space \u0026 its transitions. \n\n1. Let's assume all the servers are on second term and S3 is the leader. Normal operation would look like this:\n\t \n\t![pasted-image-20240923014316](/images/pasted-image-20240923014316.webp)\n\t\n\tThe circles around each server show what stage of the timeout they are at. The orange circles represent the parallel heartbeat RPC communication between leader S3 and the rest of the servers. Now let's say we crash S3. \n\t\n2. Without the heartbeat. The other servers eventually timeout.\n\t \n\t![pasted-image-20240923014440](/images/pasted-image-20240923014440.webp)\n\t\n\tWe can clearly see that S2 is the closest to timing out here. Therefore, with high probability, S2 times out first and manages to increase its term and send the Request Vote RPC first to the majority of the other servers. These servers then see that S2 has a higher term and since it is the first Request Vote RPC they have received for this term, vote for S2. S2 receives the majority of the votes and becomes a leader. It then starts issuing heartbeats to the other servers and acts like a normal leader.\n\t\n\t![pasted-image-20240923014708](/images/pasted-image-20240923014708.webp)\n\n### Log Replication\nIn short, the leader receives client commands, appends them to its log, and replicates them to follower logs. Once a command is safely replicated, it is executed by all servers. The RAFT log can be used to simulate a consistent state across any set of finite state machines. You can think of each entry $e_{t_i}$ in the log as an input which triggers a transaction from the server state $s_i$ to some state $s_j$ based on the log entry at time $t_i$. Inductively, as long as the servers all started with the same initial state and as long as all the log entries before time $t_i$ match between all the servers, all the servers at time $t_i$ would be on the exact same FSM state.\n\nLogs are organized as shown below:\n\n![pasted-image-20240923031234](/images/pasted-image-20240923031234.webp)\n- [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n#### Short Description of Log Replication\nNote that log entries have two values associated with them, an `index` and a `term`. These two metadata integers are sufficient for RAFT to replicate logs correctly. As mentioned previously, all client interactions are modeled as \"append write transition entry\" to the master node. When the leader node gets a new log entry from a client, it performs something similar to a 2-phase commit. Upon receiving the entry, it immediately broadcasts a Append Entry RPC to all the other nodes in the cluster. These nodes respond back to the leader once they have appended their logs. Once the leader observes that a majority of the nodes in the cluster have appended the log, it applies the log entry operation to its own FSM and issues a **commit** to all the other followers. On receiving a commit, the follower nodes also apply the log entry operation to their own FSM. \n\nHowever, due to leader failures, randomization in elections and variance in message communication latency, it is possible that different nodes can have differing logs. It is the leader's responsibility to ensure that all its followers eventually reach a log state that is consistent with it's own log. To achieve this, when a leader sends a log entry to a follower, it also sends with it the index and term of the previous entry in its log. If the follower's previous log entry does not match the one described by the leader, then the follower deletes the last entry in its log and the RPC is rejected. Then the leader retries appending its previous entry to the follower until an RPC succeeds. After this, the leader can simply append the rest of the entries one by one as they are guaranteed to match. And obviously,\n\n\u003eIf desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower 7 can include the term of the conflicting entry and the first index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently and it is unlikely that there will be many inconsistent entries.\n\u003e- [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n\nNow, we also introduce the one modification we said we would place on leader election previously. A follower node votes for a candidate only if:\n- The candidate node's last log entry has a higher term than the last log of the follower node\n- OR the candidate node's last log entry is on the same term as the last log of the follower node and the candidate node's log is at least as long as that of the follower\nNote, we are talking about the log here. Including committed and non-committed entries. \n\nNow as an exercise, I would recommend trying to prove that this construction ensures safety \u0026 availability requirements we mentioned earlier.\n#### Intuitively Attempting to Prove the Construction Works\nLet's try to intuitively prove that the above construction works. Let's note a couple of useful features of RAFT. \n- **A term and leader are basically equivalent** -\u003e This directly follows from the leader election construction. Only one node can have the majority in an election \u0026 therefore if it wins, that term is associated with the winner node and only that node. In case of split-vote, a new election is held. Note that each election has its own term. \n- **If two log entries have the same term number \u0026 index, they must be equivalent** -\u003e Also directly follows from the above claim. Note that in any given term, there is exactly one leader who could have issued Append Entry RPC calls. And for a given leader, it cannot issue Append Entry for the same entry multiple times with distinct indices. Therefore any two log entries with the same term number were emitted by the same leader, and therefore, if they have the same index, the contents must also be the same.\n- **If two log entries have the same term number \u0026 index, then the prefixes of both the logs until that entry are equivalent** -\u003e This can be proven by a recursive argument. We are given two logs $L_1$ and $L_2$. Let's say the entry at index $i$ for both $L_1$ and $L_2$ has term $t_i$. From the previous statement these entries are equivalent. Let us suppose that the leader for the term $t_i$ was $S$. Note that entry $i$ would've only been appended by $S$ to some log $L$ if the $(i-1)^{th}$ entry of $L$ was equivalent to the $(i-1)^{th}$ entry of $S$. Therefore, if the $i^{th}$ entry of $L_1$ and $L_2$ match, we know that their $(i-1)^{th}$ entries must also be equivalent. This follows inductively until the base state where the previous state was an empty state.\n- **The leader will never have a log that does not match the logs committed by the majority:** This is slightly harder to see. But if this is not true, the servers can all end up with matching logs but mismatched FSM state. What do I mean by this? Consider this example from the paper:\n\t\n\t![pasted-image-20240923103722](/images/pasted-image-20240923103722.webp)\n\t- [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n\t\n\tLet's suppose the leader crashes and $(f)$ becomes the leader again. Only this time, it is able to send messages to the rest of the servers. Also let us assume that the leaders for terms 4, 5 \u0026 6 were able to *commit* their entries since they had majority entry append success. In this scenario, the leader, $(a), (c) \\ \\\u0026 \\ (d)$ have committed the entries in terms 4, 5 \u0026 6. Therefore, they have also applied these operations to their FSM state. Now, if $(f)$ becomes the leader, it will begin to force the follower nodes to copy its own log. This means the leader, $(a), (c) \\ \\\u0026 \\ (d)$ will pop off committed logs and apply logs from terms 2 \u0026 3 on top of the operations from 4, 5 \u0026 6 thus causing the servers to have inconsistent state with each other. \n\t\n\tNote that this error occurs only because the leader that was elected did not have a log that matched the state committed by the majority of the servers in the cluster. We claim that such a situation can **never** happen. The proof / intuition for this comes from the special modification we made to the leader election algorithm. Let's suppose a log $L$ was committed in the majority of the servers in the pool, but node $S$ never received it. Now, node $S$ was voted the leader for a term, i.e., node $S$ acquired a majority of votes from the pool of servers. We can apply the pigeonhole principle here to prove that if both of these statements are true, then there **must** exist a server in the set of servers who voted for $S$ which also has the log $L$ that was committed in the majority of the servers in the pool. Furthermore, we choose $S$ such that it is the **first** leader (smallest term $T$) that satisfies this property. \n\t\n\tNow, this server $F$ would've only voted for $S$ if:\n\t- (a) The last log entry in both $S$ and $F$ share the same term, and the length of $S$'s log is greater than or equal to the log of $F$. \n\t\t-\u003e In this scenario, the last logs both $S$ and $F$ have been receiving logs are from the same leader. Consider the first log they received on this term. This must have matched since it's from the same leader and the log will append only if the previous log is correct. This also means they had the same length at this point. Using the same argument(s), if $S$ has a longer length log, it must include every entry in $F$. Hence $S$ would contain $L$.\n\t- (b) $S$'s last log entry has a higher term than the last log entry of $F$\n\t\t-\u003e Let's call the last log of $S$, $L_T$. Now, let's suppose that the missing committed log $L$ was issued by some leader in term $U$. We know that $U \\lt T$ since $F$ contains the log but $F$'s last log has a term lesser than $T$ (Remember that the terms in the log are always monotonically increasing). Now, the node that was the leader for term $T$ contained $L$, since we assumed that node $S$ is the first leader without this committed log. Now, in the leader for term $T$'s log, $L$ would appear before $L_T$, since it contained $L$ before it received $L_T$ $(U \\lt T)$ and it cannot append $L_T$ to any other node's log without the other node's log matching it's log till $L_T$. Therefore, since $S$ contains $L_T$, it must also contain $L$. \n\t\n\tIn both cases, we arrive at a contradiction. Therefore, the result of this tweak is that, if a log has been committed by a leader in a previous term, it **must** be present in the log of every future leader. This allows RAFT to ensure that the leader will **never** have to delete any entry from it's log. This means there is a strict one-way flow of logs from leaders to followers only. This is one of the key features that makes RAFT much easier to understand than other consensus algorithms where the write-flow occurs in both ways based on far more conditions that make the state space much more complex and difficult to reason about.\n- **The FSM's are consistent:** This follows from the above where we reasoned why it is OK for the leader of any term to not have to delete any entry from it's log. This means that \"writes\" flow in only one direction, from leader to follower. Remember that an entry is safe to apply to a node's state machine only when it is committed. Commits also only flow from leader to follower. Since the leader never removed entries from it's log \u0026 since it always contains all the logs committed by the majority in previous terms, it will never issue a log to a follower which causes a follower to have to \"remove\" a committed log. Therefore, for each node in the RAFT system, commits are append only and never have to be rolled back. This means that if an entry was committed by a node at any position (term, index) in it's log, then every other server will also consequently commit the same operation at that position in the log. The FSM will be consistent. \n \n This last property is why consensus algorithms are so powerful in distributed systems. It allows a set of machines to act as a single unit (resiliency).\n \nWhen you walk through all of these scenarios, it really highlights how the specific design choices – the strong leader, the term logic, and especially the voting restriction based on log completeness – combine to make the safety properties feel intuitive rather than magically asserted. While not a formal proof, building this kind of step-by-step intuition first makes approaching the formal guarantees (like Leader Completeness and State Machine Safety) much less daunting. **It's a direct payoff of their 'design for understandability' goal**. \n#### Formal Properties\nLet's formally state the properties we proved above into terms used by the paper.\n\n- **A term and leader are basically equivalent -\u003e Election Safety**\n- **The leader will never have a log that does not match the logs committed by the majority -\u003e Leader Append-Only** \n- **If two log entries have the same term number \u0026 index, they must be equivalent -\u003e Log Matching**\n- **The leader will never have a log that does not match the logs committed by the majority -\u003e Leader Completeness**\n- **The FSM's are consistent -\u003e State Machine Safety**\n \n![pasted-image-20240923095248](/images/pasted-image-20240923095248.webp)\n- [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n### Follower \u0026 Candidate Crashes\n\u003e Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request. \n\u003e - [In Search of an Understandable Consensus Algorithm (Extended Version)](https://raft.github.io/raft.pdf)\n## Cluster Membership Changes\nTBD\n# Conclusion (Personal Thoughts)\nRAFT is a great \"academic\" face of how simplicity is often the most powerful solution. In general, I empirically notice that the more complex, convoluted and difficult to understand something is in academia, the more 'respect' it's given. In software engineering, the opposite is usually true. The biggest impact you can have as a SWE is often via the simplest solutions. This is also why there's a lot of disdain between the two roles, but RAFT is a great example of proving that sometimes (especially in complex fields like distributed systems), making an algorithm understandable is just as crucial as making it correct and efficient. Researchers can reason about it's behavior more easily and build even more research on top of this easier. You're enabling future researchers to build even higher. One great example of this is [TiDB - Architecture](/blog/tidb-architecture) being built using RAFT for consistency. \n\n", - "lastmodified": "2026-01-11T05:05:39.818905025+05:30", + "lastmodified": "2026-01-11T10:58:58.327527503+05:30", "tags": [ "domain-cs-systems-distributed-systems" ] @@ -232,7 +232,7 @@ "/blog/introduction-to-complexity-theory": { "title": "Introduction to Complexity Theory", "content": "# Introduction to Complexity Theory\n\nIn most algorithms courses, students are taught a plethora of algorithms that are capable of solving many interesting problems. It often tends to internally suggest to the student that most problems have solutions. Solutions that are feasible to compute on their machines should they need to. On the contrary, most problems are unsolvable and even fewer are computable in any feasible amount of time.\n\nComputation complexity theory is a field of study where we attempt to classify \"computational\" problems according to their resource usage and relate these classes to one another. We begin by defining a few to classify algorithms based on running time.\n\n1. $P$\n2. $EXP$\n3. $NP$\n4. $R$\n\n## P or PTIME\n\nThis is a fundamental complexity class in the field of complexity theory. We define $P$ as the set of **all** decision problems that can be solved by a **deterministic** Turing machine in polynomial time.\n\n\u003eA more formal definition is given below: A language $L$ is said to be in $P$ $\\iff$there exists a **deterministic** Turing machine $M$ such that:\n\u003e\n\u003e1. $M$ runs for polynomial time on **all** inputs\n\u003e2. $\\forall l \\in L$, $M$ outputs 1\n\u003e3. $\\forall l \\in L$, $M$ outputs 0\n\nWhen we talk about computational problems, we like problems in $P$. These problems are feasible for computers to compute in a reasonable amount of time.\n\n## EXP or EXPTIME\n\nThis is the class of **all** decision problems that can be solved by a **deterministic** Turing machine in exponential time. Similar to how we gave a formal definition for $P$, it is easy to see that we can modify the same formal definition to fit $EXP$ as well.\n\n## R or RTIME\n\nThe $R$ here stands for \"recursive.\" Back when complexity theory was being developed, there was a different idea of what the word 'recursive' meant. But in essence, $R$ is simply the set of all decision problems that can be solved by a deterministic Turing machine in some finite amount of time.\n\nIt might seem as though all problems are solvable by a Turing machine in some finite time and hence unnecessary to have a class dedicated to it. But this is not true.\n\n## Undecidable problems\n\nAn undecidable problem is a decision problem for which it has been proven that it is impossible to develop an algorithm that always leads to a valid yes-or-no answer. To prove that there exist undecidable problems, it suffices to provide even just one example of an undecidable problem.\n\n## The Halting Problem\n\nOne of the most famous examples of undecidable problems is the halting problem, put forth by Alan Turing himself. Using this, Turing proved that there do indeed exist undecidable problems. But this isn't just the only reason why the halting problem is \"special.\"\n\nThe halting problem poses the following question: _\"Given the description of an arbitrary program and a finite input, decide whether the program finishes running or will run forever.\"_\n\nIn fact, if the halting problem were decidable, we would be able to know a LOT more than what we do today. Proving conjectures would be a LOT easier and we might have made a lot of progress in many fields.\n\n### Solve Goldbach's conjecture?\n\nConsider [Goldbach's conjecture](https://en.wikipedia.org/wiki/Goldbach%27s_conjecture). It states that _every even whole number greater than 2 is the sum of two prime numbers._\n\nUsing computers, we have tested the conjecture for a large range of numbers. This conjecture is \"probably\" true, but till today, we have **no** proof for this statement. Simply checking for large ranges is simply not enough. Finding even just one counter-example, even if this counterexample is 10s of digits long is enough to prove the conjecture **false**.\n\nLet's say we constructed a Turing machine $M$ that executes the below algorithm (given in pseudocode).\n\n```python\niterate from i : 0 -\u003e \\\\infty:\n\t\titerate from j : 0 -\u003e i:\n\t\t\t\tif j is prime and (i-j) is prime: \n\t\t\t\t\t\tmove to next even i\n\t\tif none of its summation were both prime:\n\t\t\t\toutput i\n\t\t\t\thalt # We have disproved Goldbach's conjecture!\n```\n\nThis definition of our Turing machine is capable of disproving Goldbach's conjecture. But the question is, how long do we let it run for? If the number is not small, it might take years to find this number. Maybe millions of years. We do not know. And even worse, if the conjecture is indeed true, then this machine will **never** halt. It will keep running forever.\n\n**However, what if the halting problem was decidable?**\n\nWhat if, we could construct another such Turing machine $M_1$ this time which solves the halting problem? We can feed it $M$ as input, and let $M_1$ solve the halting problem.\n\nIf $M_1$ outputs \"halt\" then there **must** be some input for which Goldbach's conjecture fails. We have disproved it.\n\nIf $M_1$ outputs \"run forever\" then Goldbach's conjecture **must** be true. It is no longer a conjecture, we have managed to prove it!\n\nBeing able to solve the halting problem would help us solve so many such conjectures. Take the twin primes conjecture, for example, we would be able to solve it. We would be so much more powerful and armed in terms of the knowledge available to us. However, sadly, Alan Turing proved that the halting problem is undecidable. And the proof is quite fascinating to describe\n\n### The proof\n\nWe will prove that the halting problem is undecidable using contradiction. Therefore, we begin by assuming that there exists some Turing machine $M$ that is capable of solving the Halting problem.\n\nMore formally, there exists some deterministic Turing machine $M$which accepts some other Turing machine $A$ and $A$'s input $X$ as input and outputs 1 or \"Halt\" if $A$ will half on that input and 0 or \"Run forever\" if $A$ will not halt on that input.\n\nNow, let's construct another Turing machine \"Vader\" which does something quite interesting. Given some Turing machine $A$ and its input $X$, Vader first runs $M$on the input. If $M$ returns \"halt\", Vader will run forever. And if $M$ returns \"run forever\", Vader will halt.\n\nThis is still fine, but the masterstroke that Turing came up with was to give Vader, itself as input!\n\nIn the above explanation, we make $A = Vader$ and $X = Vader$. For Vader to work, it will first run $M$on this input. For simplicity, we will call the input program Vader as iVader. There can only be two possible outputs,\n\n1. **$M$ returns \"Halt\"**\n \n This means that $M$ thinks that iVader will halt when run on itself. _However_, when $M$ returns \"halt\", Vader will run forever. Remember that Vader is given itself as input. The input iVader and the program Vader are identical. $M$ predicts that iVader will halt but we know that Vader will run forever. We have a contradiction.\n \n2. **$M$ returns \"Run forever\"**\n \n Again, we have ourselves a contradiction. Just like before, $M$ thinks that iVader will run forever, but we know that Vader will halt. The emphasis here is that iVader and Vader here are the same Turing machines that run on the same input.\n \n\nTherefore, neither of the cases can be true. The fact that we have a contradiction here arises from the fact that our assumption is wrong. There can exist no such Turing machine $M$ which can solve the Halting problem.\n\n## NP or NP-TIME\n\nThere are a couple of different definitions used to define the class $NP$.\n\nOne of these definitions says, NP is the set of problems that can be solved in polynomial time by a **nondeterministic** Turing machine. Notice that the keyword here is **nondeterministic.** What this essentially means that at every \"step\" in the computation, the machine _always_ picks the right path. Let's say a Turing machine had states similar to the below picture. A non-deterministic machine would accept any input string that has **at least one accepting run** in its model. It is \"lucky\" in the sense that it is always capable of picking the right choice and moving to the right state which guarantees ending at a **YES** result as long as such a run exists in its model.\n\n![pnp-1](/images/pnp-1.webp)\n\nThe second definition for $NP$ calls it the set of decision problems for which the problem instances, where the answer is \"yes\", have proofs verifiable in polynomial time by a deterministic Turing machine. To understand this, we must understand verification vs decision.\n\n## Verification vs Decision\n\nWe covered what it means to solve what a decision problem is, ([Defining Computational Problems](/blog/defining-computational-problems), [Church-Turing Hypothesis](/blog/church-turing-hypothesis)) verification is on the other hand is something you can send along with a solution. In most intuitive terms, let's say someone claims that they are very good at the game of Tetris and can win the game for some specified input. Here we consider a modified version of Tetris where all the next pieces are known in advance. How does this person **prove** to you that they can indeed win the game? By playing it out of course! It might be very difficult to figure out the strategy to win, but given the proof (the sequence of moves), implementing the rules of Tetris and playing it out to check if the person is correct can be done easily.\n\nEssentially, to be in $NP$, our machine can take an arbitrary amount of time to come up with proof for its solution for all possible inputs, but this proof must be _verifiable_ in polynomial time.\n\nWe'll attempt to explain further via means of an example. Consider the clique problem.\n\n$$ \\text{CLIQUE} = \\{\\langle G, k\\rangle : G \\text{ is an undirected graph with a k-clique} \\} $$\n\nHow would a _verifier_ verify this answer? Let's say the input to the verifier is given in the form $\\langle \\langle G, k\\rangle, c\\rangle$ where $c$ is the answer to our problem defined by $G$ and $k$.\n\n1. First, check if the answer $c$ contains exactly $k$ unique nodes $\\in G$ or not. If no, the answer can be trivially rejected. This can be done in $O(V)$ time.\n2. Next, check if there exists an edge between **every** pair of nodes in $c$. This is done in $O(V+E)$ time. If no, reject the answer.\n3. If both the above checks passed, accept the answer!\n\nHence we can say that the clique problem is in $NP$ because we've demonstrated that it is indeed possible to write a verifier that can check the \"correctness\" of an answer. In the field of complexity theory, we call such 'solution paths' or 'proofs' or 'witnesses' a **certificate** of computation.\n\n## NP-Complete\n\nFor a problem $p$ to be $\\in NP-Complete$ it must fit 2 criteria.\n\n1. $p$ must be $\\in NP$\n2. _Every_ problem $\\in NP$ must be _reducible_ to $p$\n\nWe cover reductions in depth later, but essentially, if we can come up with a polynomial-time algorithm(s) to 'reduce' the inputs and outputs $\\langle I, O\\rangle$ given to some machine $s$ to new inputs/outputs $\\langle I', O' \\rangle$ such that when applied to another machine $t$, $O' = O$. If this can be done, we say that we have reduced the problem solved by $s$ to $t$.\n\nNow, onto [P vs NP](/blog/p-vs-np)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [P vs. NP - The Biggest Unsolved Problem in Computer Science - Up And Atom](https://youtu.be/EHp4FPyajKQ?si=YrgWuQpxfDbc0dmW) (Great Channel, recommend checking out)", - "lastmodified": "2026-01-11T05:05:58.645680908+05:30", + "lastmodified": "2026-01-11T10:59:11.791906277+05:30", "tags": [ "domain-cs-algorithms-complexity" ] @@ -240,15 +240,15 @@ "/blog/knapsack-using-branch-and-bounding": { "title": "Knapsack Using Branch and Bounding", "content": "# Branch \u0026 Bound\nWhen trying to solve problems in general, (especially optimization problems) it's always a good idea to formulate a mathematical definition of the problem. To formulate it in mathematical terms, we assign each item in the knapsack a _decision variable $x_i \\in \\{0, 1\\}$._ Now, each item in the knapsack also has some weight $w_i$ and value $v_i$ associated with it. Let's say our knapsack capacity is denoted by $W$.\n\nThe decision variable $x_i$ simply indicates whether we include item $i$ in the knapsack or not. Using this, we can define the objective function that we wish to optimize (maximize) as:\n\n$$ V = \\sum_{i=1}^n v_ix_i $$\n\nUnder the constraints that\n\n$$ K = \\sum_{i=1}^n w_ix_i \\leq W $$\n\nNow we have a formal definition of the function we want to maximize under some constraints.\n\n## Branching\n\nNow, one way to solve the knapsack problem would be to perform an \"exhaustive\" search on the decision variables. This would mean checking every possible combination of values that our decision variables could take. Visualized as a tree, it would look like this:\n\n![bnb-1](/images/bnb-1.webp)\n\nHere, we compute the answer by **branching** over all possible combinations. This is, of course, $O(2^n)$. Increasing input size by one would literally mean doubling the amount of computation done.\n\n## Relaxation\n\nThis is where the idea of relaxation comes in. Branching was basically us splitting the problem down into subproblems by locking or _constraining_ each variable and then deciding on the constraints for other variables. Exhaustive search isn't ideal, hence we try to reduce the search space by implementing some **\"bound.\"** Essentially, at each step of the recursion, we place an optimistic bet or estimate on just how good the solution to our subproblem can be. If at any point in the recursion, this estimate is lower than the best-found value so far, we can kill off the recursion.\n\nAs we've mentioned before, **relaxation** is the key to optimization. The original problem is **NP-Complete**. This means that until someone can prove $P=NP$, these problems have **NO** polynomial-time optimal solution. Relaxation is the art of how we deal with these problems.\n\n### Using relaxation on the exhaustive search\n\nSo, what constraints can we try to \"relax\" in the knapsack problem? The only constraint there is the weight of the Knapsack. So let's start by relaxing it to let us have an **infinite** knapsack. A picture is worth a thousand words, so let me just show you what the search would like with this relaxation.\n\n![bnb-2](/images/bnb-2.webp)\n\nLet's try to see what we did here. First, we begin by letting root have $W = 10$ space and have $V = 0$ as it is completely unfilled. This is the $(0, 0, 0)$ state. The **estimate** here is our relaxation. Assuming infinite space, we see that the **most optimistic** value we can reach from here is $\\$128$ if we include all the items. Remember, the _relaxation_ is for calculating this _estimate_.\n\nOnce this is done, we're just performing an exhaustive search. But now, notice that the node on the left stops its recursion once the room in the knapsack has become negative. This is a simple base case on the original recursion.\n\nThe interesting part is the recursion that we've killed on the rightmost node. The leaves marked with crosses went till the end until they were discarded in favor of the left bottom leaf which is our optimal score of 80. Now, the rightmost node was killed even **before** it reached the leaves. This is because we had already achieved a better score (80) than the best possible estimate from this node. Hence we know for a fact that following the recursion can never give us a _better_ score. Recall that our relaxation was an infinite knapsack. If we cannot do better with an infinite knapsack from that point, there is no point in searching further down that track. This is the key idea behind relaxation and how using bounds can help us optimize exhaustive search. However, this particular relaxation was not very effective and did not help much in optimizing our search. But maybe with a _better_ heuristic, we can optimize the search further.\n\n## Coming up with a better heuristic\n\nLet's think about how we would normally solve the Knapsack problem **if we were allowed to take rational amounts of an item**. That is, it is no longer a 0-1 problem where we must either take or discard an item. We can take items in parts now. This problem has a fairly straightforward and greedy solution. We simply sort items by $\\frac{v_i}{w_i}$. This is essentially their \"value per 1 unit room.\" Simply pick the element that gives the best value per weight. So the strategy is now picking the element with the highest $\\frac{v_i}{w_i}$ ratio, and when we run out of space pick the last element in a fractional amount such that it fills up the entire knapsack.\n\nThis is the **optimal** solution for this version of the knapsack problem. But what about when we apply this relaxation to the original exhaustive search model instead of the infinite bag relaxation?\n\n![bnb-3](/images/bnb-3.webp)\n\nNotice how much better we've managed to optimize the exhaustive search. The right child of the parent node is cut off at $\\$77$ and does not search further, because our \"estimated\" cost is lesser than the highest value we have found so far ($\\$80)$.\n\n### Optimality\n\nNotice that the fractional knapsack is the best-case version of the knapsack as we can optimally fill every unit of space according to a greedy strategy. Therefore if even the greedy estimation is below previously found maxima, then this quantity cannot be optimal. This means this branch and bound relaxation will still give us the **optimal** solution to the 0-1 knapsack problem.\n\n### Complexity\n\nAnalyzing the running complexity of branch and bound algorithms has proved notoriously difficult. The following blog gives some intuition as to why we find placing a bound on such techniques very difficult. [https://rjlipton.wpcomstaging.com/2012/12/19/branch-and-bound-why-does-it-work/](https://rjlipton.wpcomstaging.com/2012/12/19/branch-and-bound-why-does-it-work/)\n\nGeorge Nemhauser is one of the world's experts on all things having to do with large-scale optimization problems. He has received countless honors for his brilliant work, including membership in the National Academy of Engineering, the John Von Neumann Theory Prize, the Khachiyan Prize, and the Lanchester Prize. To quote him,\n\n\u003e _“I have always wanted to prove a lower bound about the behavior of branch and bound, but I never could.” -_ George Nemhauser\n\nPutting a good bound on branching and bounding is very difficult and is an open problem. One alternative measure that is used to better estimate the efficiency of branch and bound algorithms is its **effective branching factor (EBF).**\n\nWe define EBF as the number $b$ so that your search took the same time as searching a $b$-ary tree with no pruning. If you are searching a tree of depth {d}, this is well-defined as the $d$-th root of the total number of nodes you searched.\n\nThis is computed in practice and is used a lot in solving optimization problems as it is quite effective in practice, even if it is difficult to put bounds on theoretically. The fact that the runtime can be altered significantly by simply changing the relaxation criteria also makes it a great option to try out when coming up with relaxation ideas.\n\n## Why not just stick with Dynamic programming?\n\nThis is a natural question. ([A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem)) DP seems to give us an approach where we can be comfortable in sticking with an $N_W$ or $N_V$ complexity solution. However, what if we modify the question ever so slightly and allow items to have fractional or real weights? This seems like a problem that might surface in the real world a fair amount. The DP table approach is no longer feasible. In such a situation the branch and bound algorithm might come in clutch. As we explore such problems and minor variations of such problems, the need to expand our tool-belt and come up with more and more optimization algorithms becomes clear.\n\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. [Discrete Optimization - University of Melbourne - Coursera](https://www.coursera.org/learn/discrete-optimization)", - "lastmodified": "2026-01-11T05:05:58.934004424+05:30", + "lastmodified": "2026-01-11T10:59:12.098716692+05:30", "tags": [ "domain-cs-algorithms-analysis" ] }, "/blog/levenshtein-edit-distance": { "title": "Levenshtein Edit Distance", - "content": "Previously, we looked at a few famous dynamic programming problems ([DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)). Today we'll be looking at a pretty common problem that we have our computers solve for us, almost every single day. **Spellchecking**. Our computers are great at suggesting good auto-correct solutions for us whenever we misspell something. But to recommend one choice over the others, there must be some measure of ranking them. The problem is as follows:\n# The problem\n\n_Given two strings X \u0026 Y, what is the minimum number of edit operations that we must perform on X to transform it to Y? Here, an edit operation can be one of three things._\n\n1. _Insert character $c$ at any position $i$ in $X$_\n2. _Delete character $c$ at any position $i$ in $X$_\n3. _Substitute character $c$ at position $i$ in $X$ with any other character $c'$_\n\nThis computed quantity is also known as the **Levenshtein edit distance** between the two strings.\n\nIn essence, the Levenshtein distance is a very good heuristic to measure just how close two strings really are. It's a very good metric to rank words that the user might've wanted to type but accidentally misspelled due to one of the three possible edits. It is understandable why this is often used for spellchecking.\n\n## An alternate view\n\nAnother way to think about edit distance is as an alignment problem. Given two strings $X$ and $Y$, to what extent can they be matched up? An example should make this question more clear.\n\n$$ X = SNOWY, Y = SUNNY \\\\ S \\ \\_ \\ N \\ O \\ W \\ Y \\\\ S \\ U \\ N \\ N \\ \\_ \\ Y $$\n\nNotice that with this alignment,\n\n1. The `_` 's in $X$ represents an **insertion** edit\n2. The `_` 's in $Y$ represents a **delete** edit\n3. And a character mismatch represents a **replacement** edit.\n\nAt position 2, we have an insertion edit $(\\_, U)$. At position 4 we have a replacement edit $(O, N)$. At position 5 we have a delete edit $(W, \\_)$. This is in fact the optimal answer, and hence, the Levenshtein distance between the two strings.\n\nIn short, if we look at it as an alignment problem, the cost is the number of mismatched columns. The edit distance would then be the best possible alignment which minimizes mismatches.\n\n## Finding a recursive solution\n\nAt first glance, finding the solution to this question seems very difficult. There are a lot of different ways to convert say \"Dinosaur\" to \"Paragraph.\" It is not very clear how to solve this question without brute-forcing a lot of pairs. However, a key insight we can make here is that once we have optimally matched some prefix or suffix, we can discard away the matching prefix or suffix and recursively solve for the rest of the string.\n\nAn example will help illustrate this point. Consider the strings \"Dog\" \u0026 \"Dinosaur\". What the above point means is that the Levenshtein distance between **Dog** \u0026 **Dinosaur** will be the **same** as the Levenshtein distance between **\"og\"** \u0026 **\"inosaur\".** This key observation lets us write a nice recursive algorithm to calculate the Levenshtein distance for two strings.\n\n### The algorithm\n\n$$ Lev(X, Y) = \\begin{cases} |X| \u0026 \\text{if } |Y| = 0 \\\\ |Y| \u0026 \\text{if } |X| = 0 \\\\ Lev(tail(X), tail(Y)) \u0026 \\text{if } X[0] = Y[0] \\\\ 1 + min \\begin{cases} Lev(tail(X), Y) \\\\ Lev(X, tail(Y)) \\\\ Lev(tail(X), tail(Y)) \\end{cases} \u0026 \\text{otherwise} \\end{cases} \\\\ \\text{Here, } tail(X) \\text{ means the string X without the first symbol} $$\n\nThe top three cases are the base cases. If $Y$ is empty, we have to delete every character in $X$. If $X$ is empty, we have to insert every character in $Y$ in $X$. There is no other way to optimally transform $X$ to $Y$.\n\nThe third case is the key point discussed above. If the first characters match, we can simply discard it and compute the answer for the rest of the string.\n\nIf none of the above cases are true, we can do any of the three edit operations. Notice that there is sadly no way of _greedily_ picking what the best option would be here. Every operation influences the alignment of the rest of the substring and it is not possible to determine how a local choice affects the global structure we end up with. Hence the only possibility here is to recursively try out every possible combination and pick whichever gives us the minimum. Notice that each of the cases corresponds with an edit operation.\n\n1. $Lev(tail(X), Y) \\implies$**Insertion** operation. We are inserting a `_` in $X$ and computing the answer on the rest of the string.\n2. $Lev(X, tail(Y)) \\implies$**Delete** operation. We are inserting a `_` in $Y$ and computing the answer for the rest of the string.\n3. $Lev(tail(X), tail(Y)) \\implies$**Replacement** operation. We are substituting the character. This corresponds to letting the mismatch exist and align the rest of the string.\n\n#### Optimum substructure exists!\n\nThis algorithm has exponential complexity because in the worst case, it is trying out three different operations at every step. But the good thing about defining this problem recursively is that we have found an optimum substructure for this problem. If we brute force all possibilities at some position $i, j$ in both the strings, we can discard this character and recursively solve on the suffix. This hints us towards using DP to solve our problem more efficiently.\n\n## Coming up with a DP solution\n\nIn general, when we try to find a DP solution to some problem, the following is a good mental checklist to follow/answer.\n\n1. **Define the subproblem \u0026 count how many we'll have**\n2. **Guess some part of the solution \u0026 count how many possibilities we'll have to brute force over. This is the transition we want from the problem to its subproblem.**\n3. **Write the recurrence in terms of the guessed solution + the subproblem**\n4. **Figure out how to memoize/use a dp table for storing computed calculations. Notice that the recursive structure must follow a DAG structure as stated previously or we'll have an infinite recursion which implies our algorithm is wrong.**\n5. **We solve the problem**\n\nLet's go over them one by one.\n\nLooking at the recursive definition we have for edit distance, it becomes clear that we must be able to compute the edit distance between any two prefixes of string $X$ and $Y$. These are all the different subproblems encapsulated by the recursion.\n\nNote, from here on forth we denote prefixes of $X$ by $[\\ :i]$ and prefixes of $Y$ by $[ \\ :j]$. Here, we get the answer to the first point in our checklist.\n\n1. Computing edit distance for all possible pairings of prefixes between $X$ and $Y$. We will have of the order quadratic subproblems. For every value of $i$ we have $j$ possibilities to pair it with. Hence the number of problems is of the order $O(|X|.|Y|)$\n\nFor computing the answer at every point, we either have the base case or we have **three** possible operations to take.\n\n1. We can perform one of three operations. Substitute, insert, or delete. In essence, given two suffixes we have exactly three operations that we can use to transform the first character of $X$ to the first character of $Y$. Replace $X[i] \\to Y[j]$. Insert $Y[j]$. Delete $X[i]$.\n\n![edit-dist-1](/images/edit-dist-1.webp)\n\nSince we already have a recursive expression of the algorithm, we already know the recurrence.\n\n3. The recurrence is the same as stated previously\n\n$$ \n1 + min \\begin{cases} Lev(tail(X), Y) \\\\ Lev(X, tail(Y)) \\\\ Lev(tail(X), tail(Y)) \\end{cases}\n$$\n\n1. We already said we will have $O(|X|.|Y|)$ subproblems where we match every $i$ with every $j$. This should have hinted at a 2D dp table. In this table, **every** cell corresponds to the edit distance computed between two suffixes of strings $X$ and $Y$.\n\nFor example, the highlighted yellow cell represents the edit distance between `LITY` and `ARITY`. Further, notice that each of the three highlighted boxes around it corresponds to an **edit operation.** This observation is key to figuring out the topological ordering of our problems.\n\n![edit-dist-2](/images/edit-dist-2.webp)\n\n1. The `Substituion` box means we swap \"L\" with \"A\" and move to state $(i+1, j+1)$.\n2. The `Insertion` box means we insert \"A\" and move to state $(i, j+1)$\n3. The `Deletion` box means we delete \"L\" and move to state $(i+1, j)$\n\nHence for computing the answer at any cell, we only need the answers at cells $(i+1, j), (i, j+1) \\text{ and } (i+1, j+1)$. This is enough information to get the topological ordering. A simple nested for loop from $i :n \\to 0$ and $j:m\\to0$ should be sufficient.\n\nNotice that due to the nature of the problem I can go from $0\\to n$ and $0 \\to m$ as well and define the dp for the prefixes. However, the suffixes idea in my opinion makes the most sense and we'll be using the suffix definition for the dp.\n\nFurther, notice that in the real dp table we would have an extra row and column padding at the very ends to account for the base case where $|X| = 0$ or $|Y| = 0$.\n\n\u003eThus far, we have implicitly assumed that the **cost** associated with each operation is 0. However, this need not be true. Each operation can have any defined cost. In fact, we can even define the cost for conversion from one specific symbol to another and our algorithm would still work. The above DP table can simply be thought of as a DAG with $O(n^2)$ nodes and each edge $(u, v)$ can be **weighted** with the cost of the corresponding transformation from the symbol at position $u$ to the symbol at position $v$. Our final answer is in fact just the shortest path from position $(|Y|, |X|) \\to (0, 0)$\n\n### Visualization as a DAG\n\n**Note**: This is the image from the lecture slides and shows the path for the approach using prefixes. For the suffix-based state transformation used by me, simply reverse the direction of each edge in the graph and the problem remains the same.\n\n![edit-dist-3](/images/edit-dist-3.webp)\n\n1. Now to solve the problem :) Notice that the runtime of the algorithm is $O(|X|.|Y|)$\n\n### Single row optimization\n\nThe time complexity of our algorithm was $O(|X|.|Y|)$ and the space complexity was also $O(|X|.|Y|)$. This is considerably better than exponential, but _can we do better?_\n\nAre there any redundancies that we may be computing/storing? It turns out that in fact, there is.\n\nNotice that to compute the value of $dp[i][j]$ at any location, we **only** care about the values of $dp[i][j+1]$, $dp[i+1][j]$ and $dp[i+1][j+1]$. However, notice that we are storing the **ENTIRE** dp table from $dp[0][0] \\to dp[n][m]$. This is redundant and can have great practical limitations on our algorithm.\n\nFor example, computing the edit distance between two strings of length $10^4$ would require 100 MB of memory. This in turn would give a lot of cache misses and slow down the algorithm as well. Further, if we wanted to compute the distance between a string of length $10^5$ and $10^4$, it would only take a few seconds to a minute on most machines but it would **require 1 GB memory.**\n\nThat's a lot of memory wasted for storing redundant information. The single row optimization for DP is as follows.\n\nWe only ever store two rows in our DP table. When computing $dp[i][j]$, we only store the dp table at row $dp[i]$ which we are computing, and the row $dp[i+1]$, which contains the already computed values (as enforced by the topological ordering).\n\nNotice that with this simple optimization,\n\n1. To compute any $dp[i][j]$, notice that all the required states are always in memory. We are never losing/erasing dp values that we require for the computation of $dp[i][j]$ before computing $dp[i][j]$.\n2. We have reduced the space complexity of our algorithm from $O(|X|.|Y|)$ which is quadratic, to $O(2*|X|) = O(|X|)$. Our space complexity is now **linear!**\n\n## Applications\n\nWhile we only discussed how Levenshtein distance was a great heuristic for spell checkers, it is also extensively used in the field of biology for comparing DNA sequences. The more general version where each transformation is given some cost $c_{transform \\ type,\\ s1 \\to s2}$ is used here.\n\nFor example, the mutation $C \\to G$ is more common than $C \\to A$.\n\nNotice that we can now give $C \\to G$ a low cost and $C \\to A$ a high cost. This represents that the first mutation is more likely than the other. This gives us a measure of how similar two DNA sequences are. Mutations also have insertions/deletions. This makes Levenshtein distance a great tool to use here.\n\nIf we wish to not use insertions or deletions, notice that we can simply give them $\\infty$ cost. In computational terms, they're given a very high value like\n\n## Code\n\nWhile it is much easier to visualize the bottom-up dp as finding the solution to suffixes, it is much easier to code the prefix definition of the dp. Note that there really isn't any difference in which direction we pick, at least not conceptually. It is just easier to implement the prefix solution in code.\n\nThe single row optimized dp code for calculating the Levenshtein distance between two strings can be found here: [Levenshtein Edit Distance](https://github.com/akcube/algorithms-notebook/blob/main/code/strings/levenshtein-edit-distance.cpp)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [How do Spell Checkers work? Levenshtein Edit Distance - Creel](https://youtu.be/Cu7Tl7FGigQ?si=pFru3JaBAeKStvtz) (Excellent channel, do check him out. Has a lot of unique amazing content!)\n\n\n---\n\n## Related Reading\n\n- [Chain Matrix Multiplication](/blog/chain-matrix-multiplication)\n- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem)\n", - "lastmodified": "2026-01-11T05:05:59.685800944+05:30", + "content": "Previously, we looked at a few famous dynamic programming problems ([DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)). Today we'll be looking at a pretty common problem that we have our computers solve for us, almost every single day. **Spellchecking**. Our computers are great at suggesting good auto-correct solutions for us whenever we misspell something. But to recommend one choice over the others, there must be some measure of ranking them. The problem is as follows:\n# The problem\n\n_Given two strings X \u0026 Y, what is the minimum number of edit operations that we must perform on X to transform it to Y? Here, an edit operation can be one of three things._\n\n1. _Insert character $c$ at any position $i$ in $X$_\n2. _Delete character $c$ at any position $i$ in $X$_\n3. _Substitute character $c$ at position $i$ in $X$ with any other character $c'$_\n\nThis computed quantity is also known as the **Levenshtein edit distance** between the two strings.\n\nIn essence, the Levenshtein distance is a very good heuristic to measure just how close two strings really are. It's a very good metric to rank words that the user might've wanted to type but accidentally misspelled due to one of the three possible edits. It is understandable why this is often used for spellchecking.\n\n## An alternate view\n\nAnother way to think about edit distance is as an alignment problem. Given two strings $X$ and $Y$, to what extent can they be matched up? An example should make this question more clear.\n\n$$ X = SNOWY, Y = SUNNY \\\\ S \\ \\_ \\ N \\ O \\ W \\ Y \\\\ S \\ U \\ N \\ N \\ \\_ \\ Y $$\n\nNotice that with this alignment,\n\n1. The `_` 's in $X$ represents an **insertion** edit\n2. The `_` 's in $Y$ represents a **delete** edit\n3. And a character mismatch represents a **replacement** edit.\n\nAt position 2, we have an insertion edit $(\\_, U)$. At position 4 we have a replacement edit $(O, N)$. At position 5 we have a delete edit $(W, \\_)$. This is in fact the optimal answer, and hence, the Levenshtein distance between the two strings.\n\nIn short, if we look at it as an alignment problem, the cost is the number of mismatched columns. The edit distance would then be the best possible alignment which minimizes mismatches.\n\n## Finding a recursive solution\n\nAt first glance, finding the solution to this question seems very difficult. There are a lot of different ways to convert say \"Dinosaur\" to \"Paragraph.\" It is not very clear how to solve this question without brute-forcing a lot of pairs. However, a key insight we can make here is that once we have optimally matched some prefix or suffix, we can discard away the matching prefix or suffix and recursively solve for the rest of the string.\n\nAn example will help illustrate this point. Consider the strings \"Dog\" \u0026 \"Dinosaur\". What the above point means is that the Levenshtein distance between **Dog** \u0026 **Dinosaur** will be the **same** as the Levenshtein distance between **\"og\"** \u0026 **\"inosaur\".** This key observation lets us write a nice recursive algorithm to calculate the Levenshtein distance for two strings.\n\n### The algorithm\n\n$$ Lev(X, Y) = \\begin{cases} |X| \u0026 \\text{if } |Y| = 0 \\\\ |Y| \u0026 \\text{if } |X| = 0 \\\\ Lev(tail(X), tail(Y)) \u0026 \\text{if } X[0] = Y[0] \\\\ 1 + min \\begin{cases} Lev(tail(X), Y) \\\\ Lev(X, tail(Y)) \\\\ Lev(tail(X), tail(Y)) \\end{cases} \u0026 \\text{otherwise} \\end{cases} \\\\ \\text{Here, } tail(X) \\text{ means the string X without the first symbol} $$\n\nThe top three cases are the base cases. If $Y$ is empty, we have to delete every character in $X$. If $X$ is empty, we have to insert every character in $Y$ in $X$. There is no other way to optimally transform $X$ to $Y$.\n\nThe third case is the key point discussed above. If the first characters match, we can simply discard it and compute the answer for the rest of the string.\n\nIf none of the above cases are true, we can do any of the three edit operations. Notice that there is sadly no way of _greedily_ picking what the best option would be here. Every operation influences the alignment of the rest of the substring and it is not possible to determine how a local choice affects the global structure we end up with. Hence the only possibility here is to recursively try out every possible combination and pick whichever gives us the minimum. Notice that each of the cases corresponds with an edit operation.\n\n1. $Lev(tail(X), Y) \\implies$**Insertion** operation. We are inserting a `_` in $X$ and computing the answer on the rest of the string.\n2. $Lev(X, tail(Y)) \\implies$**Delete** operation. We are inserting a `_` in $Y$ and computing the answer for the rest of the string.\n3. $Lev(tail(X), tail(Y)) \\implies$**Replacement** operation. We are substituting the character. This corresponds to letting the mismatch exist and align the rest of the string.\n\n#### Optimum substructure exists!\n\nThis algorithm has exponential complexity because in the worst case, it is trying out three different operations at every step. But the good thing about defining this problem recursively is that we have found an optimum substructure for this problem. If we brute force all possibilities at some position $i, j$ in both the strings, we can discard this character and recursively solve on the suffix. This hints us towards using DP to solve our problem more efficiently.\n\n## Coming up with a DP solution\n\nIn general, when we try to find a DP solution to some problem, the following is a good mental checklist to follow/answer.\n\n1. **Define the subproblem \u0026 count how many we'll have**\n2. **Guess some part of the solution \u0026 count how many possibilities we'll have to brute force over. This is the transition we want from the problem to its subproblem.**\n3. **Write the recurrence in terms of the guessed solution + the subproblem**\n4. **Figure out how to memoize/use a dp table for storing computed calculations. Notice that the recursive structure must follow a DAG structure as stated previously or we'll have an infinite recursion which implies our algorithm is wrong.**\n5. **We solve the problem**\n\nLet's go over them one by one.\n\nLooking at the recursive definition we have for edit distance, it becomes clear that we must be able to compute the edit distance between any two prefixes of string $X$ and $Y$. These are all the different subproblems encapsulated by the recursion.\n\nNote, from here on forth we denote prefixes of $X$ by $[\\ :i]$ and prefixes of $Y$ by $[ \\ :j]$. Here, we get the answer to the first point in our checklist.\n\n1. Computing edit distance for all possible pairings of prefixes between $X$ and $Y$. We will have of the order quadratic subproblems. For every value of $i$ we have $j$ possibilities to pair it with. Hence the number of problems is of the order $O(|X|.|Y|)$\n\nFor computing the answer at every point, we either have the base case or we have **three** possible operations to take.\n\n1. We can perform one of three operations. Substitute, insert, or delete. In essence, given two suffixes we have exactly three operations that we can use to transform the first character of $X$ to the first character of $Y$. Replace $X[i] \\to Y[j]$. Insert $Y[j]$. Delete $X[i]$.\n\n![edit-dist-1](/images/edit-dist-1.webp)\n\nSince we already have a recursive expression of the algorithm, we already know the recurrence.\n\n3. The recurrence is the same as stated previously\n\n$$ \n1 + min \\begin{cases} Lev(tail(X), Y) \\\\ Lev(X, tail(Y)) \\\\ Lev(tail(X), tail(Y)) \\end{cases}\n$$\n\n1. We already said we will have $O(|X|.|Y|)$ subproblems where we match every $i$ with every $j$. This should have hinted at a 2D dp table. In this table, **every** cell corresponds to the edit distance computed between two suffixes of strings $X$ and $Y$.\n\nFor example, the highlighted yellow cell represents the edit distance between `LITY` and `ARITY`. Further, notice that each of the three highlighted boxes around it corresponds to an **edit operation.** This observation is key to figuring out the topological ordering of our problems.\n\n![edit-dist-2](/images/edit-dist-2.webp)\n\n1. The `Substituion` box means we swap \"L\" with \"A\" and move to state $(i+1, j+1)$.\n2. The `Insertion` box means we insert \"A\" and move to state $(i, j+1)$\n3. The `Deletion` box means we delete \"L\" and move to state $(i+1, j)$\n\nHence for computing the answer at any cell, we only need the answers at cells $(i+1, j), (i, j+1) \\text{ and } (i+1, j+1)$. This is enough information to get the topological ordering. A simple nested for loop from $i :n \\to 0$ and $j:m\\to0$ should be sufficient.\n\nNotice that due to the nature of the problem I can go from $0\\to n$ and $0 \\to m$ as well and define the dp for the prefixes. However, the suffixes idea in my opinion makes the most sense and we'll be using the suffix definition for the dp.\n\nFurther, notice that in the real dp table we would have an extra row and column padding at the very ends to account for the base case where $|X| = 0$ or $|Y| = 0$.\n\n\u003eThus far, we have implicitly assumed that the **cost** associated with each operation is 0. However, this need not be true. Each operation can have any defined cost. In fact, we can even define the cost for conversion from one specific symbol to another and our algorithm would still work. The above DP table can simply be thought of as a DAG with $O(n^2)$ nodes and each edge $(u, v)$ can be **weighted** with the cost of the corresponding transformation from the symbol at position $u$ to the symbol at position $v$. Our final answer is in fact just the shortest path from position $(|Y|, |X|) \\to (0, 0)$\n\n### Visualization as a DAG\n\n**Note**: This is the image from the lecture slides and shows the path for the approach using prefixes. For the suffix-based state transformation used by me, simply reverse the direction of each edge in the graph and the problem remains the same.\n\n![edit-dist-3](/images/edit-dist-3.webp)\n\n1. Now to solve the problem :) Notice that the runtime of the algorithm is $O(|X|.|Y|)$\n\n### Single row optimization\n\nThe time complexity of our algorithm was $O(|X|.|Y|)$ and the space complexity was also $O(|X|.|Y|)$. This is considerably better than exponential, but _can we do better?_\n\nAre there any redundancies that we may be computing/storing? It turns out that in fact, there is.\n\nNotice that to compute the value of $dp[i][j]$ at any location, we **only** care about the values of $dp[i][j+1]$, $dp[i+1][j]$ and $dp[i+1][j+1]$. However, notice that we are storing the **ENTIRE** dp table from $dp[0][0] \\to dp[n][m]$. This is redundant and can have great practical limitations on our algorithm.\n\nFor example, computing the edit distance between two strings of length $10^4$ would require 100 MB of memory. This in turn would give a lot of cache misses and slow down the algorithm as well. Further, if we wanted to compute the distance between a string of length $10^5$ and $10^4$, it would only take a few seconds to a minute on most machines but it would **require 1 GB memory.**\n\nThat's a lot of memory wasted for storing redundant information. The single row optimization for DP is as follows.\n\nWe only ever store two rows in our DP table. When computing $dp[i][j]$, we only store the dp table at row $dp[i]$ which we are computing, and the row $dp[i+1]$, which contains the already computed values (as enforced by the topological ordering).\n\nNotice that with this simple optimization,\n\n1. To compute any $dp[i][j]$, notice that all the required states are always in memory. We are never losing/erasing dp values that we require for the computation of $dp[i][j]$ before computing $dp[i][j]$.\n2. We have reduced the space complexity of our algorithm from $O(|X|.|Y|)$ which is quadratic, to $O(2*|X|) = O(|X|)$. Our space complexity is now **linear!**\n\n## Applications\n\nWhile we only discussed how Levenshtein distance was a great heuristic for spell checkers, it is also extensively used in the field of biology for comparing DNA sequences. The more general version where each transformation is given some cost $c_{transform \\ type,\\ s1 \\to s2}$ is used here.\n\nFor example, the mutation $C \\to G$ is more common than $C \\to A$.\n\nNotice that we can now give $C \\to G$ a low cost and $C \\to A$ a high cost. This represents that the first mutation is more likely than the other. This gives us a measure of how similar two DNA sequences are. Mutations also have insertions/deletions. This makes Levenshtein distance a great tool to use here.\n\nIf we wish to not use insertions or deletions, notice that we can simply give them $\\infty$ cost. In computational terms, they're given a very high value like\n\n## Code\n\nWhile it is much easier to visualize the bottom-up dp as finding the solution to suffixes, it is much easier to code the prefix definition of the dp. Note that there really isn't any difference in which direction we pick, at least not conceptually. It is just easier to implement the prefix solution in code.\n\nThe single row optimized dp code for calculating the Levenshtein distance between two strings can be found here: [Levenshtein Edit Distance](https://github.com/akcube/algorithms-notebook/blob/main/code/strings/levenshtein-edit-distance.cpp)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [How do Spell Checkers work? Levenshtein Edit Distance - Creel](https://youtu.be/Cu7Tl7FGigQ?si=pFru3JaBAeKStvtz) (Excellent channel, do check him out. Has a lot of unique amazing content!)\n", + "lastmodified": "2026-01-11T10:59:50.847111066+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -256,7 +256,7 @@ "/blog/log-structured-merge-tree-lsm-trees": { "title": "Log-Structured Merge Tree (LSM Trees)", "content": "Just to preface this, this is not going to be a detailed paper deep dive. Why am I doing this one differently? Mainly because I'm bottlenecked on reading and writing time. I've not posted anything in the recent few months because of an overload of things to read about and not enough time to write blogs / notes in. The original [The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf) paper here by Patrick O'Neil, Edward Cheng, Dieter Gawlick \u0026 Elizabeth O'Neil is 32 pages long and I've not had the chance to more-than-skim it. I don't want to bottleneck my blogs, so I'll be starting with a high level set of notes / content I've amassed from watching CMU's [#04 - Database Storage: Log-Structured Merge Trees \u0026 Tuples (CMU Intro to Database Systems)](https://www.youtube.com/watch?v=IHtVWGhG0Xg) (Have I mentioned I'm a fan of Andy Pavlo? You should watch his courses now!), some blogs / talks about RocksDB given by folk from Facebook, similar content from folk at PingCap (about it's use in TiKV) \u0026 some experience working with TiDB at Databricks.\n\n# What motivated LSM Trees? \nLong story short, in terms of real-world performance, writes are a lot more valuable to optimize for than reads. Especially in database applications. Why? Because writes in other databases (that provide SOLID guarantees) often require updating several secondary data-structures like indexes, undo/redo logs and also have to possibly propagate through multiple layers of cache. On the other hand, the world was moving towards SSDs and back then, saving memory update cycles on SSDs was also a key metric to improve. This is however, not so important today because SSDs can usually sustain way more program-erase cycles compared to then. Regardless, lesser write operations does mean increased longevity for SSDs, and probably for HDDs as well (lesser arm movement). Reads on the other hand usually just need to traverse data-structures to find the location to read from disk / buffer pool (cache).\n\nIn short, the state of the art B+ tree solutions which are theoretically best for the usual (fetch, insert, delete, range_scan) operations may not be best in practice because of the skew / asymmetry between write heavy \u0026 read heavy workloads. We wanted to optimize for write speed \u0026 efficiency. And that eventually gave birth to the LSM tree. \n\n# What is a LSM Tree? How does it work?\nA picture is worth a thousand words, and a video is worth a thousand pictures... I guess? Regardless, would highly recommend watching [#04 - Database Storage: Log-Structured Merge Trees \u0026 Tuples (CMU Intro to Database Systems)](https://www.youtube.com/watch?v=IHtVWGhG0Xg), at least from 26:53 - 44:59. It's a great description of how LSM trees work. \n\nThere's two parts to a LSM tree. \n1. The **in-memory** section\n2. The **on-disk** section\n## In-Memory\nA LSM tree primarily functions as a key-value store. So the main operations it's seeking to support is PUT / DELETE operations, but we can also do range scans. Let's start by demoing how `PUT` works on a LSM tree. The In-memory section is mainly what's called a mem-table.\n### Mem-table\nThe \"mem-table\" is an in-memory data structure that is a sort-of cache layer \u0026 the primary receiver of `PUT` operations. It can be any BBST (balanced binary search tree) or any other data structure that supports fast ($O(\\log(n)))$ insertions, searches \u0026 updates. A hash-map also works. We define a constant \"limit\" for the size of this data structure. Let's say the limit is $6$. Here's what the mem-table looks like after 3 insertions, \n\n![pasted-image-20250414050236](/images/pasted-image-20250414050236.webp)\n\nWe just insert the elements as the `PUT` operations arrive into the BBST. If the same key is updated, say I issued a `PUT(key974, val_69)`, the node containing `key974` is updated. No new node would be created. However, once the BBST hits a size of 6, the BBST is converted into an **immutable** SST (Sorted String Table) and stored to disk. \n\nHere's the mem-table after 6 insertion operations. \n\n![pasted-image-20250414050457](/images/pasted-image-20250414050457.webp)\n\nTo convert this mem-table to a SST, we do a simple linear-time traversal of this tree to obtain the sorted list of keys:\n\n```\nkey100 -\u003e val_13\nkey104 -\u003e val_84\nkey132 -\u003e val_1\nkey605 -\u003e val_62\nkey974 -\u003e val_68\nkey867 -\u003e val_26\n```\n\nThis structure is called an SST. We now declare this structure **immutable**. That means, we **will not** modify this data structure in the future. This SST is then \"flushed\" to disk to be stored in the \"level-0\" layer. More on this later.\n\n## On-Disk\n(Thanks to [Claude](https://claude.ai/) for helping me generate these images with minimal effort.)\n\n**Note:** The below is a description / run-through of a level based compaction strategy. There are other compaction strategies (tiered, dynamic, etc.) as well. You can check out [RocksDB](/blog/rocksdb) for descriptions of other compaction strategies and how they compare. We'll go with leveled here because it's simple enough and is what the original LevelsDB used.\n### SSTs\nThe sorted string tables are stored in what is known as \"levels\" in LSM-tree speak. What each level contains depends on the 'compaction' strategy that the LSM tree uses. For now, let's just focus on what it looks like on disk. The on-disk representation of our previously full mem-table looks as follows:\n\n![pasted-image-20250414051034](/images/pasted-image-20250414051034.webp)\n\nOnce I add 6 more records to our LSM tree, the next SST is constructed and flushed to this \"level-0\" disk storage. Now, we have SSTable 1 \u0026 SSTable 2 in our level-0 storage. \n\n![pasted-image-20250414051144](/images/pasted-image-20250414051144.webp)\n\nAnd that's it. Each time a `PUT` occurs, the new key is added to the mem-table, and the mem-table is periodically flushed to disk as SSTs. Writes are blazing fast because it's just an insertion into an in-memory, tiny BBST. Pretty much constant. However, reads would suffer a lot because the best we can do is go through every single SSTable on disk and binary search on them. That would be pretty bad. \n\nThis is where the idea of 'compaction' / 'deferred writes' comes in and helps change the equation to benefit read performance by allowing asynchronous or deferred write operations. As you can see, the size of Level-0 SSTables are size 5. Let's say we allow Level-1 SSTables to be as big as 8 in size. We can then asynchronously \"merge\" two SSTables (in linear time, using logic similar to the `merge` function in `merge_sort`) to \"compact\" 2 level-0 SSTables into a larger level-1 SSTable. For example, if I compacted the above two SSTables, we would get this:\n\n![pasted-image-20250414051622](/images/pasted-image-20250414051622.webp)\n\nI can insert 12 more new records to let us have 2 new SSTables in level-0 as follows:\n\n![pasted-image-20250414051729](/images/pasted-image-20250414051729.webp)\n\nCompacting them, gives us:\n\n![pasted-image-20250414051825](/images/pasted-image-20250414051825.webp)\n\nPay attention to how the SSTables merged. Notice that in level-1, the SSTables are each responsible for **disjoint, non-overlapping** key ranges. That is, the 2 SSTables in level-0 did not just merge and get shoved into level-1 as new SSTable entries. Corresponding entries in level-1 (both the original level-1 SSTable 1 \u0026 2 entries were deleted, and re-created) were modified and re-written to disk. This is how we choose to compact. Similarly, if we choose to compact the SSTables in level-1, we can combine and push them out to level-2 like so:\n\n![pasted-image-20250414052137](/images/pasted-image-20250414052137.webp)\n\n**Special note about level-0:** What I said about each SST being responsible for non-overlapping regions is not true for only level-0. This is mostly an implementation detail, but I believe this is how [RocksDB](/blog/rocksdb) implements it. For Level-0 alone, we just flush the mem-table to disk as is. Earlier entries to the left. When merging some SSTables from level-0, we can use the same merging logic the other layers use (level-0 does not need to be governing disjoint key-spaces). I suggest thinking about it yourself, but the idea is just that the SSTables that need to be affected in the $(i+1)^{th}$ level are still constant regardless of which 2 level-0 SSTs I pick to merge. The only caveat is that level-0 SSTables cannot be binary searched on. Any of the search strategies I describe below implicitly assumes that we do a full linear scan on each of the level-0 SSTs. (We can binary search **inside** an SST, but I still need to check every SST in level-0). This is mostly a non-issue since level-0 was the mostly recently written to section of disk, which most likely means that the SSTs are in the cache / buffer-pool \u0026 very quick to linear search on. \n#### MVCC\nLet's quickly understand the important implications of this kind of a data structure. This data structure implicitly implements MVCC (multi-version concurrency control). Well the concurrency control might not be relevant here, but the point is that multiple 'versions' of a key over different instances in time may exist in the data structure. That is, given our current data structure state, if I perform a `PUT(key104, val_85)` operation, there will be 2 instances of `key104` in the data structure. One in level-2, and one in the mem-table. This is fine, because when reading, the version that comes the earliest is the dominant / correct value to read from the data structure.\n\nSimilarly, when merging, the version from the higher level overwrites the value of the key in lower levels. We never have trouble working out same-level version problems because we established the invariant during merging that each SSTable in a level is responsible for non-overlapping key spaces. \n#### Point-`GET`\nHow do point queries work here? We need to work our way from the mem-table to each of the levels in increasing order. We cannot skip levels because earlier versions of a key are the correct (most up-to-date) ones. If an entry is found in the mem-table, we are done. Otherwise, we have to scan each level one-by-one. To scan any level $i \\ (i \\gt 0)$, we can exploit the property that each SST is responsible for non-overlapping sections of the key space. That is, we can binary search on the SSTables in each level to find the only (if one exists) SST that is capable of containing the key. \n\nTo facilitate this binary search, each \"level\" of a LSM tree contains some metadata in a 'summary table' that contains the minimum \u0026 maximum key each SSTable in the level is responsible for. These metadata values can be on-the-fly updated in $O(1)$ during the merge operations. \n\nRemember that in the worst case (the key does not exist in the LSM tree), we still need to go through every level. So in the worst case, for each level we perform one binary search to find the right SSTable to look at then perform a binary search **on** that SSTable to find the key (if it exists). However, we can do even better, but we'll get to this in a later section. \n#### Range-`GET`\nI said we can support range scans too. But how? Let's reason out how we can do range scans on LSM. First things first, defining hard constraints. We have to check every level. Just **one** of the keys in the range we're scanning might've been written a long time ago and might have been compacted down to the last layer. So we have to scan every layer. \n\nThe first obvious idea might be to:\n1. Binary search the lower bound for the key in each range to identify the first valid key in a level in logarithmic time\n2. Aggregate all the valid keys for each level one by one\n3. Merge sort the final results\n4. Send them back to the client\n\nThis would work, but it's pretty inefficient since we need to aggregate and merge possibly lots and lots of keys. (What about a range scan for the whole table?)\n\nInstead, we can use a more online / streaming method. Consider the following picture as the current state of our mem-table, and we are trying to stream the results for a `RANGE_SCAN(key640, key694)`. I'm going to assume level-0 is sorted here for simplicity, in practice, level-0 would need some aggregation logic.\n\n![pasted-image-20250414130416](/images/pasted-image-20250414130416.webp)\n\nThe results of our binary search identify the highlighted three keys across each level. `key640` from level-0, `key707` from level-1 \u0026 `key692` from level-2. Now, how can we avoid aggregating results for each run and merging them offline? We need to make our queries work online but without much space / time overhead. What properties can we use here? Let's assume each insertion was an unique key first.\n\n1. Note that in each range, advancing the 'iterator' to the currently highlighted cell will always give us a key that is **greater** than the currently highlighted key\n2. This means that the first valid key in our range scan will always be the minimum value among the highlighted cells (in this case, `key640`).\n3. Let's denote $key^{L}_{\\geq k(i)}$ as the **first** key record in level $L$ that is greater than or equal to the key value of the $i^{th}$ record. Then we can see that, in fact, after some record $i$ is streamed back to the client, the **next** record to stream back will always be the $\\min(key^{L}_{\\geq k(i)})_{\\forall L}$. \n4. Remember our iterators in the above diagram? We define that our iterators will always be present at position $key^{L}_{\\geq k(i)}$ after the streaming of record $i$. Is this easy to maintain?\n5. Yes, once `key640` is streamed, just move the iterator by 1 to `key778` in level-0. Remember that we always stream the $\\min(key^{L}_{\\geq k(i)})_{\\forall L}$ key. This key **must** belong to some level $L$. We can then always move the iterator by 1 to the next key in this level, and this key **must** be the $key^{L}_{\\geq k(i)}$ for that level because the previous record streamed was $k(i)$ and by definition, this key is the next greater element in that level. \n\nSo back to our diagram, taking the min over the values again, we would see that the next key to be streamed is `key692`:\n\n![pasted-image-20250414131805](/images/pasted-image-20250414131805.webp)\n\nLet's stream it and move the iterator forward. Taking the min again, we stream `key694`. After that, all the keys are greater than the $R$ of our query range $[L, R]$. So we are done.\n\n![pasted-image-20250414131747](/images/pasted-image-20250414131747.webp)\n\nAlso see [FAST '21 - REMIX: Efficient Range Query for LSM-trees](https://www.youtube.com/watch?v=9F4AzqBp8Ng) for more ideas / follow up reading (or watching).\n##### What about multi-versioning?\nFair question. Does this still work if key inserts aren't unique? What if a key $x$ exists in $level_i$ first and then also in some $level_{j \\gt i}$? To handle multiple versions, we only need to tweak our comparator slightly. When streaming back the next record, the problem is a conflict in resolving the expression $\\min(key^{L}_{\\geq k(i)})\\forall L$ because there might now be multiple **equal** values of $key^{L}_{\\geq k(i)}$ for the same level. We just change the comparator to sort by $(key^{L}_{\\geq k(i)}, L)$ instead. That is, sort by $key^{L}_{\\geq k(i)}$. If the values of two $key^{L}_{\\geq k(i)}$ match for $L = a, b$, and $a \u003c b$, pick the $key^{a}_{\\geq k(i)}$. \n\nWhy does this work? Let's think about streaming a particular key $x$ back in the response of some range-get query. If key $x$ is the next record to be streamed, it means that $x \\gt k(i)$, and also, there is NO other record $r$ satisfying $k(i) \\lt k(r) \\lt x$. This means that $key^{L}_{\\geq k(i)} \\forall L \\geq x$ . So the 'iterators' in each level must all be $\\geq x$. If they are $\\gt x$, they will automatically be streamed later. If they are $= x$ however, then we pick the **last updated** version of $x$ because we pick the record with key value $x$ in the lowest possible level.\n\nImplementation detail, but note that for correctness, this also means that once record $x$ is streamed back to the client, there needs to be a small loop that pops off any remaining $key^{L}_{\\geq k(i)} = x$ before continuing. This does mean that 'too many old versions' is a problem \u0026 can negatively affect performance, but we'll get to this later.\n##### What about efficiency?\nHow do we implement this \"sort my elements online while supporting insertions \u0026 min element query\" operation quickly? This is pretty standard data structure problem which is solved by min-heaps in logarithmic time. The size of the min-heap only needs to be the number of levels in the LSM, so it should be pretty small / very efficient.\n#### DELETE\nLSM trees were designed to be very fast for insertion queries. But remember that every SSTable is by definition, **immutable**. This means that we cannot (should not) modify any of the SSTable files for correctness reasons. Remember that SSTable's are just lists, so deleting the record from a SSTable even if we drop the immutability constraint is NOT cheap (And if you suggest storing SSTables as BBSTs... we might as well be using B+ trees).\n\nGiven our immutability constraint, the only way to delete entries is by introducing *tombstone* keys. A record with a special bit turned on to signify that it's a tombstone. Then when scanning / reading keys, if the earliest version of the key was a tombstone, we just pretend it doesn't exist.\n### WAL\nGreat. Things almost work for database level applications. But there's one thing we cannot guarantee with the above construction. We can't guarantee durability. If the server the mem-table is on crashes before the BBST in the mem-table is flushed to disk as a SST, we lose all the data in the mem-table, which is bad, for obvious reasons. So we just use the age-old trick and add a write-ahead-log (WAL) to the construction. All `PUT/DELETE` operations are first persisted to a WAL on disk (constant time add) before the operation returns as successful. Additionally, also persist other useful metadata information like when the last SSTable was flushed. Then if the server ever crashes, we can just reconstruct the mem-table using the operations in the WAL after the last successful SSTable flush. And there, we have durability.\n\n# Okay... How does it compare to a B+ Tree? (LSM Trees vs B+ Trees)\n## Preface\n### RUM Conjecture\nThere's an open conjecture called the RUM conjecture (Read, Update, Memory) which suggests that there's an inherent three-way tradeoff between read efficiency, update efficiency and memory / space overhead. A variation of RUM is the RWS conjecture which suggests the same three-way tradeoff, but for: read, write \u0026 space amplification. [Designing Access Methods: The RUM Conjecture](https://www.eecs.harvard.edu/~kester/files/rum_conjecture.pdf). Need to read sometime, but I'll just accept it for now (sounds pretty reasonable). \n### How do you define \"Amplification\"?\n**Note:** For the following section, we're comparing the B+ tree implementation that's used by MySQL engines like InnoDB that is designed to handle large volumes of data on-disk. In particular, B+ tree \"nodes\" or \"pages\" are stored on disk and reads and write happen in page units. \n\nTo compare between LSM \u0026 B+ Trees, we need to define the metrics we're using to compare. However, read / write / space amplification is kind of ambiguous and can be measured using many different metrics. In theory, you can say read / write amplification are something like \"$x$ units of work / logical request\" \u0026 space amplification is how much space the database files take up with respect to the size of the keys inserted. But you can refer to \"work\", \"operation\" \u0026 \"space\" using many different measurable metrics. \n\nWe're going to define them as follows:\n- **Write Amplification:** The ratio of the amount of data *actually written* to the storage device versus the amount of data *logically written* to the database by the application. Example: Writing 10MB logically results in 30MB written to disk -\u003e Write Amplification = 3. \n- **Read Amplification:** The number of *disk reads* required per logical query. Example: Reading 5 distinct pages from disk for one query -\u003e Read Amplification = 5. This often differs for point lookups vs. range scans. Also note that caching can significantly impacts this, but we'll just consider cold-cache cases. \n- **Space Amplification:** The ratio of the *physical space* used on the storage device versus the *logical size* of the data in the database. Example: 10MB of logical data takes 100MB on disk -\u003e Space Amplification = 10.\n### How do you define \"Attribution\"?\nAnother difference is that, in the case of LSM trees, assigning query-level attribution to amplification is much harder, because based on what you define as amplification, LSM trees either have **very low** write amplification, or **very high.** Why? Because of the asynchronous nature of \"deferred compaction\" of the LSM tree. In a B+ tree, the write IOPS introduced by a single write operation is easy to measure because it in-place affects a few log files, and a few pages are written to cache buffers / disk before the operation atomically commits. This is something we can measure because these write IOPS are all **attributed** to a single write operation. This is not the case in a LSM tree. Certain **patterns** of writes can cause a lot more *amplification* during the compaction stage. The large amount of IOPS consumed during deferred compaction cannot be attributed to a single write operation. \n## Analysis\n### B+ Trees\nB-Trees (\u0026 B+ Trees) are great for fast insertion, search \u0026 delete operations (in-theory). These are the exact same APIs we want our KV stores to support. In theory, you cannot achieve a better complexity than $\\log_d(n) | \\text{ where } d \\text{ is the branching factor}$ for these operations. Here's a sample B+ Tree:\n\n![pasted-image-20250122180411](/images/pasted-image-20250122180411.webp)\n- [B+ Tree - Wikipedia](https://en.wikipedia.org/wiki/B%2B_tree)\n\nThe leaf nodes of a B+ tree contains the data records. The other nodes in the tree are internal nodes and contain a variable (bounded by the branching factor) number of children nodes. The internal nodes only contain key values and are also sorted by key. They point to some pre-defined range in the key-space. Also, the leaf nodes are linked to one another to allow fast range scans. \n\nTo analyze the performance characteristics, particularly amplification effects, we establish the following parameters: \n* $N$ represents the total number of records stored in the database. Assume records are of approximately constant $O(1)$ size. \n* $B$ denotes the capacity of a leaf node block (page) in terms of the number of records it can store. Since we assume records are $\\approx O(1)$ in size, each page stores $O(B)$ records. \n* $D$ represents the branching factor of the internal nodes, signifying the maximum number of child pointers an internal node can contain (including leaf nodes).\n#### Read Amplification\nIf the entire B+Tree would fit in memory, a B+Tree would indeed be great. However, in large data use-cases like in InnoDB, the minimum unit of interaction with the storage layer is a database page. A page in a database may be around 4kb or higher (configurable) and contains multiple row entries. The same is true for all indexes. This means, that to read a single row, you need to read the entire page into memory. And similarly, to write a single row, you need to write the entire page back to disk. In the worst case scenario, let's say a single row is 4KB and your page size is 128KB, and every row read belongs to an unique page, your *read amplification* is $32 \\times$. Not ideal. \n\nLet's assume that the block size is $O(B)$ \u0026 that the branching factor of each node is $O(D)$. That is, each node contain $O(D)$ children (including the leaf nodes). Let's also assume that the size of all records and pointers etc. are constant. Then the total number of nodes my tree needs to maintain $O(N)$ records is $O(\\frac{N}{B})$. Given a branching factor of $D$, the depth of my tree is $O(log_D(\\frac{N}{B}))$. A point lookup on a B+ Tree requires traversing the tree from the root down to a leaf page. Since the height of the tree is approximately $O(log_D(\\frac{N}{B}))$ a single query requires reading one page at each level of the tree. This gives a read amplification of $O(log_D(\\frac{N}{B}))$ disk pages. For range scans, once the starting leaf page is located, the subsequent pages can be read sequentially using the sibling pointers, which is efficient. \n#### Write Amplification\nFor every write of a record, we would need to write the entire page to memory. Which means, the write amplification is $O(B)$ (records are constant size). \nHowever, also note that technically, if we insert into a node that's already full, we'd trigger a split operation. This split operation in the node can further cascade up the height of the tree triggering more splits. Given that the height of the tree is $O(log_D(\\frac{N}{B}))$, in the worst case, we'd trigger $O(log_D(\\frac{N}{B}) \\times B)$ writes. However, for a B+ tree, amortized over $O(N)$ insertions, the number of splits per insertion is $O(1)$, so this doesn't matter as much. (Remember that the number of nodes in the tree is $O(N)$ for $O(N)$ insertions, which implies an amortized cost of $O(1)$ splits per insertion. \n#### Space Amplification\nThe tree (after $N$ insertions) contains $O(\\frac{N}{B})$ nodes (pages), which has a memory footprint of $\\approx O(N)$. However, the constant factor is likely a significant bit higher than 1. To avoid costly page splits on every insert, pages are often left partially empty. I can't recall / find the source for this claim, but I remember that on average, B+ Tree pages are about 67% to 75% full... If my memory serves correctly :) \n### LSM Tree\nWe're going to make some assumptions here to simplify the analysis, especially since the deferred nature of writes requires us to do amortized analysis to understand the amplification factors. We're modeling leveled compaction here. We assume each level to have exactly 1 'run'. The new assumptions we make for simplifying analysis are as follows:\n1. **When does compaction occur?** \n When the size of a level $L_i$ reaches some defined constant limiting size $S_i$. \n2. **How is $S_i$ modeled?** \n We define a constant 'scaling / fanout factor' $k$ and define $S_0 = c$ (where $c$ is some constant) and then define the relationship between the levels as $S_{i+1} = S_i \\cdot k$. So for example, if the number of records we allow in $L_0$ is $6$, and we define $k = 2$, then $L_1$ would fit $12$ records, $L_2$ would fit 24, $L_3$ 48, and so on. \n#### Read Amplification\nA point query, in the worst-case scenario (the key doesn't exist), must check every level of the tree. The search path is:\n\n1. Check the active in-memory mem-table. \n2. Check Level 0 on disk. Since Level 0 SSTs can have overlapping key ranges, we may have to check every SST file in this level.\n3. For every subsequent level ($L_1, L_2, \\dots, L_{max}$), we find the single SST that could contain the key and check it.\n\nThis means a single logical read can turn into many physical disk reads, making the worst-case read amplification proportional to the number of levels. To mitigate this, most LSM implementations use [Bloom Filters](/blog/bloom-filters). A Bloom filter is a probabilistic data structure that can quickly tell you if a key might be in an SST file, or if it is definitely not. By checking the Bloom filter for an SST (which is small and kept in memory), we can avoid most of the expensive disk reads for keys that don't exist in that file. \n\nLet's get into a little more detail here. Remember our assumptions: We're using a leveled compaction strategy with a fanout factor $k$, where each level $L_{i+1}$ can hold roughly $k$ times more data than $L_i$​. The number of levels is approximately $O(\\log_k (N / c))$, where $c$ is the mem-table size limit and $N$ is the total number of records. For point queries:\n\n1. You have to check the mem-table first. This is fast and in-memory: $O(\\log(c))$ if it's implemented as a skip-list or BST. You could probably implement it as a hash table and get $O(1)$ as well. But regardless, it's some constant time operation and not very relevant so the exact DS implementing it doesn't matter much. \n \n2. You then need to scan all the SSTs in level-0. This is again, some fixed value. All the SSTs in $L_0$ are pretty small in size and very likely to be in cache since they are the most likely to be read / written from (hot data). Note that most SST scans would be skipped thanks the the bloom filter. And the ones that are scanned are done in $\\approx$ constant time. We can more or less consider this set of scans $O(c)$ as well. \n\n3. For higher levels, we need to identify the right SST. This is done by binary searching on the level's metadata information (which contains information about the start and end keys of each SST in that level). The time complexity of this is $\\approx O(\\log_k(\\frac{N}{c}))$ (There's $\\frac{N}{c}$ records, so we'd have those many levels). \n \n a. For the highest level $L_{max}$, the data size is $\\approx O(N)$. So binary searching on the metadata here is $\\approx O(\\log(\\frac{N}{c}))$. \n b. For the previous level $L_{max-1}$, the data size is $\\approx O(\\frac{N}{k})$. So binary searching here is now $\\approx O(\\log(\\frac{N}{kc}))$ \n c. And so on...\n \n Summing this up, we get the total number of disk reads required as: $O(\\log(\\frac{N}{c})) + O(\\log(\\frac{N}{kc})) + O(\\log(\\frac{N}{kc^2})) + \\cdots + O(\\log(\\frac{N}{k^nc})) = O(\\frac{\\frac{\\log^2N}{c}}{\\log k})$ \n \n But a couple of things come handy here. One, you can store metadata and do all your binary searching on the metadata instead of opening each SST. This makes the binary searches essentially constant time at the cost of some additional (constant) space / extra-work during writes. Further, you'd assume scanning each SST is a disk read, but the bloom filters come in very handy here. Since they help determine (with high probability) if a key is present or not in an SST, they reduce a lot of unnecessary SST opens. Also, since hotter (more recent) data is in upper levels or the mem-table, caching helps a lot. So you could roughly say it's only $\\approx O(\\log_k(\\frac{N}{c}))$. \n\nFor range scans, you may need to merge results from iterators across **all** levels. Think about it this way, you identify the range $[ST, EN]$ in each level that may contain the keys belonging to the result set (all of $L_0$). However, since we have MVCC, we can have multiple versions of keys, and we could have a situation where $ST \\lt a \\lt b \\lt EN$ but $a \\in L_i$ and $b \\in L_{j \\gt i}$. So we have to somehow read across all levels and merge results together. One common solution is to have essentially a \"merge operator.\" You have an iterator at the first element greater-than or equal to $ST$ in each range (and less than equal to $EN$). You put the elements each iterator points to in a \"merge\" priority-queue (with timestamp). The smallest element is popped and streamed to the result set. The iterator that was pointing to this element moves forward and we repeat until each iterator has crossed $EN$. This doesn't really change the complexity much. For $R$ records, you can expect $O(R \\cdot \\log_k(\\frac{N}{c}))$ complexity (the priority queue would be of size = number of levels).\n#### Write Amplification\nThe complexity for a single write is constant. $O(1)$ to add to the WAL and time-complexity of chosen DS for the mem-table. In any case, it's constant. However, most of the write amplification related to LSM trees come during the compaction phase. To model amplification here, we need to try to understand how many disk writes a single `write(k, v)` operation triggers over the life span of the record $k$. Initially, a single disk-write happens when the mem-table is flushed as a SSTable to disk. After that, every time it's compacted, we have a disk write. Lets think about what happens during compaction from $L_i$ to $L_{i+1}$:\n\n1. Select the SST from $L_i$ for compaction. Let's assume the size is $S_i$ records.\n2. Identify overlapping key ranges in $L_{i+1}$. \n3. Read _all_ records from the selected $L_i$ SST and the overlapping $L_{i+1}$​ SSTs into memory (SSTs are immutable).\n4. Merge them: Sort, resolve duplicates (keep the latest version for MVCC), drop tombstones if they cover older data.\n5. Write the merged result as _new_ SSTs back to $L_{i+1}$.\n6. Delete the old SSTs from both levels.\n\nThe bytes written can't really be made sense of on a *per-key* basis. However, during this compaction, you can say that the initial $S_i$ records contributed to a write of size $S_i$ + the records in the SSTables in level $L_{i+1}$ that it was merged with. You can compute amplification here for all of those $S_i$ records as $\\frac{\\text{total records written}}{S_i}$. Let's see how to compute this better. \n\nNote that all levels span the same *key space*. However, because $L_{i+1}$ holds $k$ times more data than $L_i$, the **data density** (number of records per unit of key space) is $k$ times higher in $L_{i+1}$. Let's assume the size of the key space is $P$. If we compute the data densities for $L_i$ and $L_{i+1}$:\n\n1. $L_i$: The density is $\\frac{S_i}{P}$. \n2. $L_{i+1}$: Compared so $L_i$, the number of records in this level is now $\\approx S_i \\cdot k$. So the density is $\\frac{S_i \\cdot k}{P}$. \n\nLet's assume keys are uniformly random for the sake of analysis. When we select $S_i$ records from $L_i$ for compaction, these records span some key space $P' \\subset P$. The width of this $P'$ depends on the spread of keys, but for the uniformly random case, we can expect the number of records in $L_i$ over $P'$ is $S_i$. So in $L_{i+1}$, because the density is $k$ times that of $L_i$, we can assume that the same range $P'$ will contain $\\approx k \\cdot S_i$ records. \n\nSo, if we look at the original calculation:\n\n$$\n\\frac{\\text{total records written}}{S_i} = \\frac{S_i + k\\cdot S_i}{S_i} = \\frac{S_i\\cdot (k+1)}{S_i} = k+1 \\approx k \\text{ | for large } k\n$$\n\nThis happens _at each level_ a record passes through. A record starts at $L_0$, gets compacted to $L_1$ (rewritten with amplification $\\approx k$), then later when that part of $L_1$ compacts to $L_2$ (rewritten again with amplification $\\approx k$), and so on, down $\\log_k (\\frac{N}{c}))$ levels. This gives us a total write amplification $\\approx k \\cdot \\log_k (\\frac{N}{c})$. Note that this is average case since we assume uniformly random key distributions. However, in practice, the fact that compaction happens \"async\" and SSTable's being immutable and compressible gives **huge** write amplification wins in comparison to a B+ tree. Merges can be done in parallel and in the \"background\", making them much faster for writes. \n#### Space Amplification\nSpace amplification in an LSM Tree comes from data that is no longer \"live\" but has not yet been garbage collected by compaction. This includes old versions of updated rows and tombstone markers for deleted rows. Also note that each SSTable in a level in a LSM is responsible for some continuous key range. During compaction, we ensure that only one version of the key (the latest or a tombstone) is preserved. This means that in the worst case, we can have a single key have stale versions copied over once per level. So worst case, we can expect the write amplification to be $O(\\log_k(\\frac{N}{c}))$. However, in this [benchmark](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html) by [Mark Callaghan](https://www.blogger.com/profile/09590445221922043181) on MyRocks (a MySQL engine based on RocksDB which is based on LSM trees) vs InnoDB (the default MySQL engine):\n\n\u003e While an LSM can waste space from old versions of rows, with leveled compaction the overhead is ~10% of the database size compared to between 33% and 50% for a fragmented B-Tree and I have confirmed such fragmentation in production. MyRocks also uses less space for [per-row metadata than InnoDB](http://dev.mysql.com/doc/refman/5.7/en/innodb-multi-versioning.html). Finally, InnoDB disk pages have a fixed size and more space is lost from rounding up the compressed page output (maybe 5KB) to the fixed page size (maybe 8KB).\n\n## Immutability\nI figured this is worth spending a H2 heading on :) Most media marketing praises LSM's for the very fast write speeds and efficient usage of disk. But one very important feature that's not appreciated enough is their immutability. To be fair, I didn't really give it much thought either until I met [Sunny Bains](https://www.linkedin.com/in/bainssunny/) during a PingCap event in Bangalore and he brought this up. All the fast writes and amplification stuff is good, but one of the best selling points of LSM was the design decision to make SSTable's immutable. This has a bunch of very profound benefits:\n\n1. **Simplicity:** That simple decision allows many things to become extremely simple. For example, concurrency is very simple of implement on top of SSTs since they're immutable. There's no locks to grab or any other kind of contention to deal with. Concurrent accesses are super simple to implement (in comparison to the shared + exclusive lock complications of a B+ tree). This parallelization allowed Meta (then Facebook) to do significant optimizations on the compaction stage and is one of the key reasons why LSM's perform so well against B+ trees today. \n \n2. **Crash Safety:** This also means that things like crashes are easy to recover from. A single WAL is enough to ensure durability. If a system crashes mid-compaction, it doesn't matter since the original SSTable's are still intact and valid. We can always resume from a WAL instead of doing pointer shenanigans or maintaining more complicated logs since a crash during a split operation or lock release / grab propagating up the tree is more difficult to model. \n\n3. **Efficient Compression + BR:** Immutable files are easy to compress / cache. The contents never change, so you don't need any complex cache invalidation logic. Every SSTable (not in cache) is heavily compressed when stored on disk. Only the SSTable's which move to memory are uncompressed. This has significant space savings and also means disk space can be used better (\u003c- this is huge). Further, implementing backup restore type operations is super easy since you can just copy the current set of live SSTable's as is to s3 or something and you're mostly good for the backup. You don't need to pause writes (You can pause compactions for a short while instead). \n\nIn short, immutability makes so many things awesome and simple and they all convert to pretty important wins for the LSM eventually. ", - "lastmodified": "2026-01-11T05:06:14.194277607+05:30", + "lastmodified": "2026-01-11T10:59:26.66383188+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-cs-systems-databases" @@ -265,7 +265,7 @@ "/blog/master-s-theorem-strassen-s-matrix-multiplication-kth-order-statistics": { "title": "Master's Theorem, Strassen's Matrix Multiplication \u0026 KTH-Order Statistics", "content": "# Master's Theorem\n\nSolving recurrence relations can prove to be a difficult task, especially when there are many terms and factors involved. The master's theorem is a very useful tool to know about, especially when trying to prove operation bounds for _divide and conquer_ type algorithms. Most such algorithms have some constant factor by which they divide the initial input a certain number of times and recursively perform some operation.\n\nIn general, the master's theorem states that:\n\n$$ If \\ \\ T(n) = aT(\\frac{n}{b})+O(n^d) \\ for \\ some \\ constants \\ a\\gt 0, b \\gt 1, and \\ d \\geq 0 $$\n\n$$ T(n) = \\begin{cases} O(n^d) \\ if \\ d\\gt log_ba \\\\ O(n^dlogn) \\ if \\ d = log_ba \\\\ O(n^{log_ba} \\ if \\ d \u003c log_ba \\end{cases} $$\n\n## A visual depiction of the proof\n\n![masters-theorem-tree](/images/masters-theorem-tree.webp)\n\n## Proof of Master's Theorem\n\nWe can see that the size is divided by $b$ on every level. Therefore, for the size $n$ to go to 1, it will require $log_{b} n$ times. Therefore, the depth of the tree is $log_{b} n$. Also, the number of nodes at level $k$ is $a^k$, therefore the number of leaf nodes is $a^{log_{b} n} = n ^{log_{b}a}$\n\nAt the root level, you have 1 node. This node needs to do the following recursive operations.\n\n$T(n) = O(n^d) \\ + \\ aT(n/b)$ The number of operations the algorithm performs is essentially defined for some input n by this quantity $T(n)$. We notice that for the Master's theorem, my function is recursively defined. The $O(n^d)$ term is the number of operations I'm doing at the node my recursive algorithm is on. So if I visualize this as a tree, it will have $log_ba$ depth because I'll have those many divisions of my original input **n**.\n\nIf I had 1 node at the root, at the 2nd level it will split into $a$ nodes on the next level. At the next \"level\" of my recursion tree, each of these $a$ nodes will split into another set of $a$ nodes. So the depth at the $k^{th}$ level is $a^k$. So at the $k^{th}$ level, I'll have work equal to the work accumulated by each of my $a^k$ nodes. $a^k$ nodes do $O(n^{'k})$ work. But this is the size of $n^{'}$ at the $k^{th}$ level. The input size $n^{'}$ at the $k^{th}$ level is $\\frac{n}{b^k}$ in terms of the original input $n$. (n has been divided by b at each level) So the accumulation of work done at the $k^{th}$ level is essentially\n\n$$ a^k \\times O(\\frac{n}{b^k})^d = O(n^d)\\times(\\frac{a}{b^d})^k $$\n\nNow if we take the sum of this quantity over all $log_bn$ levels, we notice that this is just a geometric series with first term $a = O(n^d)$ and ratio $r = \\frac{a}{b^d}$\n\nCalculating the geometric series will give us the following three results for three different cases of our ratio $\\frac{a}{b^d}$.\n\n### Cases:\n\n1. $\\frac{a}{b^d} \\lt 1 \\implies a\\lt b^d \\implies log_ba \\lt d$\n \n The series is decreasing and the dominant term is our first term. This gives us the result, $T(n) = O(n^d)$\n \n2. $\\frac{a}{b^d}=1 \\implies a = b^d \\implies log_ba=d$\n \n In this case, there are exactly $O(log_bn)$ terms in the series (depth of the tree) and each term is equal to $O(n^d)$. This gives us a simple summation,\n \n $T(n) = O(n^dlog_bn)$\n \n3. $\\frac{a}{b^d} \\gt 1 \\implies a \\gt b^d \\implies log_ba \\gt d$\n \n The series is increasing and the dominant term will be the last term of the series.\n \n $$ n^d(\\frac{a}{b^d})^{log_bn} = n^d(\\frac{a^{log_bn}}{(b^{log_bn})^d}) = n^d(\\frac{a^{log_bn}}{n^d}) \\\\ a^{log_bn} = a^{log_ba.log_an} = (a^{log_an})^{log_ba} = n^{log_ba} $$\n \n This gives us the result,\n \n $T(n) = O(n^{log_ba})$\n# Matrix Multiplication\n\nNaïve Algorithm: $O(n^3)$\n\n**Strassen's: $O(n^{log_{2}7})$**\n\nWe imagine the two matrices we have to multiply as consisting of 4 $\\frac{n}{2}$ matrices in each matrix.\n\n$$ X = \\begin{bmatrix} A \u0026 B \\\\ C \u0026 D \\end{bmatrix}, Y = \\begin{bmatrix} E \u0026 F\\\\ G \u0026 H \\end{bmatrix} \\\\ XY = \\begin{bmatrix} A \u0026 B \\\\ C \u0026 D \\end{bmatrix} \\begin{bmatrix} E \u0026 F\\\\ G \u0026 H \\end{bmatrix} = \\begin{bmatrix} AE+BG \u0026 AF+BH \\\\ CE+DG \u0026 CF+DH \\end{bmatrix} $$\n\nNotice that this multiplication ends up with us having to calculate the product of **8** such submatrices and the addition of 4. It is evident that multiplication is the bottleneck here. For such an algorithm, we have $T(n) = 8T(n/2) + O(n^2)$ which makes the time complexity $O(n^3)$ as per Master's theorem. (As $log_ba \\gt d$)\n\nHowever, using a method similar to the same technique used by the **Karatsuba** multiplication algorithm ([Analyzing Fibonacci \u0026 Karatsuba Multiplication](/blog/analyzing-fibonacci-karatsuba-multiplication)), we can bring down the number of products to just 7.\n\n**Note:** This observation is _not_ trivial and does _not_ have a simple construction. But for the sake of documentation, it is shown below.\n\n## Strassen's Matrix Multiplication\n\nThe algorithm is as follows. Given\n\n$$ X = \\begin{bmatrix} A \u0026 B \\\\ C \u0026 D \\end{bmatrix}, Y = \\begin{bmatrix} E \u0026 F\\\\ G \u0026 H \\end{bmatrix} $$\n\nCompute the following terms,\n\n$$ P_1 = A(F-H) \\quad P_5 = (A+D)(E+H) \\\\ P_2 = (A+B)H \\quad P_6 = (B-D)(G+H)\\\\ P_3 = (C+D)E \\quad P_7 = (A-C)(E+F) \\\\ P_4 = D(G-E) \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\ \\ $$\n\nNotice that to compute each of these 7 terms, we only need 7 multiplication operations in total. Now, once computed, we can write the expression $XY$ as follows:\n\n$$ XY = \\begin{bmatrix} P_5+P_4-P_2+P_6 \u0026 P_1+P_2 \\\\ P_3+P_4 \u0026 P_1+P_5-P_3-P_7 \\end{bmatrix} $$\n\nAgain as mentioned above, this construction is **not** intuitive or easy to come up with. With some working out on pen and paper it can be seen that the above construction does indeed yield us the correct result. While it is more complicated, notice that we now only have to perform **7** multiplication operations. The work done at each node is the $n^2$ additions. This lets us write, $T(n) = 7T(n/2)+O(n^2)$. Applying Master's theorem to this result, we find $log_ba \\gt d$ which implies that the time complexity of Strassen's Matrix multiplication is $O(n^{log_ba}) = O(n^{2.81})$\n\n# Finding Median in $O(n)$\n\nThe problem of Median finding is as follows. It simply asks, given a list of numbers $S$, find the $\\lfloor\\frac{n}{2}\\rfloor^{th}$ smallest element in the list. In fact, we can generalize this problem to asking, _\"Given a list of numbers $S$, find the $k^{th}$ smallest element in the list.\"_\n\nThe naïve solution to this problem is simply sorting the list and then just picking the $k^{th}$ smallest element. The correctness of this approach is fairly easy to prove. In a sorted list the condition $a_i\u003ca_{i+1}$ holds $\\forall a_{i\u003cn}\\in S$. Hence picking the $k^{th}$ element of the sorted list is equivalent to picking the $k^{th}$ smallest element in S. However, the time complexity of this algorithm is $O(nlogn)$. The question is, _can we do better?_\n\nLet's try to apply the same concepts of divide and conquer that netted us promising results in our previous endeavors. Notice that for this particular problem, since our list is **unordered**, there is nothing to gain by simply splitting the list into $\\frac{n}{2}$ halves and solving recursively. The notion of \"$k$\" does not exist in these halves as we do not have any information about them. Instead, let us consider the division that happens when we pick some arbitrary element $a\\in S$.\n\nMore formally, if we have some **unordered** list S, and we pick some element $a \\in S$, we can divide the set $S$ into **three** parts.\n\n$$ S \\begin{cases} S_L = \\{ x\\in S \\mid x \\lt a \\} \\\\ S_a = \\{ x\\in S \\mid x = a \\} \\\\ S_R = \\{ x\\in S \\mid x \\gt a \\} \\end{cases} $$\n\nOnce such a decision is made, we can recursively call our $k^{th}$ order statistic finding algorithm as follows. Let's call our algorithm `selection(S, k)` where $S$ is the input list and $k$ is the $k^{th}$ order statistic we wish to find. Then,\n\n$selection(S, k) = \\begin{cases} selection(S_L, k) \u0026 \\text{if} \\ k \\leq |S_L| \\\\ a \u0026 \\text{if} \\ |S_L| \\lt k \\leq |S_L|+|S_a| \\\\ selection(S_R, k-|S_L|-|S_a|) \u0026 \\text{if} \\ k \\gt |S_L|+|S_a| \\end{cases}$\n\nLet's try to parse this recursion.\n\n1. If $k$ is smaller than or equal to $|S_L|$, it means that our _\"pivot\"_ $a$, was pictorially too far to the right in the sorted list $S$. This essentially means that we _overshot_ our guess for the $k^{th}$ order statistic. Hence we discard every element to the right (pictorially) and recurse on the left part of the list.\n2. If $|S_L|$ is lesser than $k$ and $k$ is smaller than or equal to the ranges encompassed by $S_L$ and $S_a$, then it must be true that our $k^{th}$ order statistic **is** our pivot. Visually, $k$ lies in the range of elements equal to our pivot in the sorted list.\n3. If $k$ is greater than the range described in #2, then we _undershot_ our guess for the $k^{th}$ order statistic. This implies that we can discard the left portion of the list and recurse on the right portion. However, we will need to adjust the value of $k$ which is input to it as required.\n\nNotice that in the recursion, we have **no guarantee** about the size of $S_L$ or $S_R$. The approach of divide and conquer has allowed us to shrink the number of elements we're worrying about at some step $i$ from $S$ to $max\\{|S_L|, |S_R|\\}$. However, we have no guarantee about their size.\n\nIf we analyze the worst case of this algorithm, notice that the worst case occurs when we pick our pivot in sorted order. In this case, the complexity looks something like $T(n) = T(n-1) + O(n)$. This gives us $T(n) = O(n)+O(n-1)+\\dots+O(2)+O(1)$, which essentially gives us a $O(n^2)$ algorithm which is **worse** than the sort and pick approach.\n\n_However_, if we analyze the \"best\" case for this algorithm, notice that if we pick the **median** as the pivot at each step, we can divide $|S_L|=|S_R|=\\frac{n}{2}$. This gives us $T(n) = T(\\frac{n}{2}) + O(n)$. Evaluating this, we get $T(n) = O(n)+O(\\frac{n}{2})+\\dots+O(1)$. This would give us a time complexity of $O(2n) = O(n)$. So in our best case, our algorithm **outperforms** the sort and pick approach.\n\n\u003eNotice the similarity between this algorithm and quicksort. Both have a great best-case time complexity and a very poor worst-case time complexity. They are also very similar in the fact that both their running times are severely affected by the choice of the pivot.\n\u003e\n\u003eThis should clue us into the fact that perhaps trying a randomized approach would give us a desirable result. And we will soon see that a randomized approach does indeed give us a linear **expected** time complexity. However, it is also possible to provide a deterministic approach that can yield us the linear time complexity we desire. This approach is called the _Median of medians_ and we shall discuss this below. Moreover, notice that since this algorithm is _**linear**,_ it can be used as a **subroutine** in the quicksort algorithm to deterministically pick the median as the pivot. This would give us a _theoretically_ very fast quick sort as it can be proved to execute in $O(nlogn)$ for any input. However, practically speaking, the constant factor incurred from running the linear median finding algorithm for every step in quicksort makes it slower in real-time execution. This makes it more preferable to use the randomized quicksort. However, it is pretty cool to note that we can prove quicksort to have an upper bound of just $O(nlogn)$.\n\nNow, coming back to the original problem. If we can pick the median as our pivot, we would get good running time. However, this problem seems counter-intuitive. Our algorithm essentially _needs_ the answer to perform its computation. This isn't possible. But perhaps it's possible to put some bound on the size of $S_L$ and $S_R$. Notice that, if at every step of the division, we can guarantee $max\\{|S_L|, |S_R|\\}$ to be some **ratio** of $|S|$, we can guarantee linear running time of our algorithm. Evaluating $T(n) = T(\\frac{n}{r})+O(n)$, we get\n\n$$ T(n) = O(n)+O(\\frac{n}{r})+O(\\frac{n}{r^2})+\\dots+O(1) = O(n(1 + \\frac{1}{r}+\\dots)) = O(n(\\frac{1}{1-r})) = O(cn) = O(n) $$\n\nThis observation has simplified our problem a little and paved the path for the success of the _\"Median-of-Medians\"_ approach.\n\n## Median-of-Medians\n\nThe idea behind the algorithm is as follows. Given some input list $S$ with $n$ elements, perform the following operations recursively.\n\n1. Divide the n elements into groups of 5\n2. Find the median of each of the $\\frac{n}{5}$ groups\n3. Find the median $x$ of the $\\frac{n}{5}$ medians\n\nNotice that the time complexity of this is pretty similar to the \"linear\" running time proof of finding $k^{th}$ order statistics when we are able to divide the input into some ratio at every step. We get $T(n) = 5T(\\frac{n}{5})+O(1)$ . Here we consider median finding among 5 elements, a constant time operation. We can solve this recurrence using the Master's theorem. Notice that $log_ba\u003ed$ which implies the time complexity is $O(n^{log_ba}) = O(n)$.\n\nThis means that we now have a **linear** time algorithm, which can obtain the _Median of Medians_ for some input $n$. Now the question is, how does this help us split $S_L$ and $S_R$ in such a way that we can bound them to some ratio of the original input $n$?\n\nThey say a picture is worth a thousand words, and I think you will find the below image quite insightful for the explanation of this proof.\n\n1. Let us divide our set $S$ into $\\frac{n}{5}$ lists of 5 elements each and call them $\\sigma_1, \\sigma_2, \\dots, \\sigma_{\\frac{n}{5}}$.\n2. For visualization's sake, let us picture each of these lists in sorted order vertically. For example, in $\\sigma_1$, $a_1\\leq a_2 \\leq a_3 \\leq a_4 \\leq a_5$ holds. Notice that this implies that in every list, the third element must be the median.\n3. Now, let us sort the lists themselves horizontally by their median value. That is, in the picture below, $a_3\\leq b_3 \\leq c_3 \\leq d_3 \\leq e_3$ is true. Notice that this implies that in the below picture, the 3rd element in list $\\sigma_{\\frac{n}{10}}$is the median-of-medians.\n\nNow that our elements are now ordered both vertically and horizontally, let us try to place bounds on the division that picking the median of medians grants us.\n\n![median-of-medians-1](/images/median-of-medians-1.webp)\n\nNotice that in the above picture, because $c_3$ is the median of medians, it **must** be greater than $a_3$ and $b_3$. More formally, $x_3\\in \\sigma_{\\frac{n}{10}} \\geq x_3\\in \\sigma_{i\u003c\\frac{n}{10}}$. Further, because $x_3\\in\\sigma_i$ is greater than or equal to all $x_1, x_2 \\in \\sigma_i$, our median-of-medians is greater than or equal to every $x_{i\\leq3}\\in\\sigma_{i\\leq\\frac{n}{10}}$. Or to put it more simply, it must be greater than equal to everything on this picture that is painted in pink.\n\n![median-of-medians-2](/images/median-of-medians-2.webp)\n\nA similar statement can be made about everything it is lesser than or equal to. Everything painted blue in the above diagram must be greater than or equal to our median-of-medians.\n\nIf that makes sense, let's try to formalize and state our argument more _quantitatively_ now. Once we have chosen our pivot as the median-of-medians, the set of all elements lesser than equal to the pivot is essentially just $S_L$. So... how do we enumerate $|S_L|$ or $|S_R|$?\n\nNotice that there are $\\lceil \\frac{n}{5} \\rceil$ lists in total. Out of these, we can enumerate one-half of the lists (including the list containing the median-of-medians) as $M = \\lceil \\frac{1}{2} \\lceil \\frac{n}{5} \\rceil \\rceil$. Each of these $M$ lists contains 3 elements that are lesser than equal to the pivot. This gives us $|S_L| \\geq 3M.$ We can obviously remove the pivot itself while recursing and this would give us $|S_L| \\geq 3M-1$. Similarly, for $|S_R|$, we might have to remove the last set if $n$ was not perfectly divisible by 5. This would give us the same bound $\\pm c$. Since $c$ is pretty small we'll just choose to ignore it in our calculations.\n\nThis gives us\n\n$$ |S_L| \\geq \\frac{3n}{10}, \\quad |S_R| \\geq \\frac{3n}{10} \\\\ \\implies |S_L|\\leq n-|S_R|, \\quad |S_R|\\leq n-|S_L| \\\\ \\implies |S_L| \\leq \\frac{7n}{10}, \\quad |S_R| \\leq \\frac{7n}{10}, $$\n\n### Conclusion\nWe came up with an algorithm to find the median-of-medians in **linear time.** And we have managed to prove that picking the median-of-medians as the pivot, lets us divide the original set into $S_L$ and $S_R$ such that their size is always bound to be greater than some ratio of the input n. These 2 facts combined give us the linear time $k^{th}$ order statistics finding algorithm.\n\nTo state this more formally,\n\n- We can find the median of medians for some input $n$ in linear time.\n- Using the median-of-medians as pivot, we guarantee a division of $S$ into sets such that the next step of our $selection(S, k)$ algorithm will receive as input $S'$ which can be expressed as a ratio of the input $n$. $|S'| \\leq \\frac{7}{10}n$\n- This implies that the total runtime of our algorithm is **linear**\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:06:14.762895311+05:30", + "lastmodified": "2026-01-11T10:59:27.238924097+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -273,15 +273,15 @@ "/blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms": { "title": "Mega-Project - kBLAS (Writing a Benchmark Library in C \u0026 Optimizing L1, L2 Basic Linear Algebra Subprograms)", "content": "# Preface\n\nI love optimizing small handy-dandy C/C++ programs in general and spent quite a bit of time working on optimizing a simple C hash table previously, but that’s about as far as I’ve gotten in this field. This will be my first time trying my hand at optimizing a fairly large Math library. Plus time to use all that stuff I've written about HPC thus far ([Brent's Theorem \u0026 Task Level Parallelism](/blog/brent-s-theorem-task-level-parallelism), [Vectorization \u0026 Analyzing Loop Dependencies](/blog/vectorization-analyzing-loop-dependencies), [Parallelism with OMP](/blog/parallelism-with-omp), [Profiling a Program](/blog/profiling-a-program), [Theoretical Metrics for Benchmarking](/blog/theoretical-metrics-for-benchmarking), etc.) to some good use. Preliminary Googling tells me there’s quite a few libraries which have implemented BLAS operations to varying levels of success (Blaze, Intel MKL, BLIS, Armadillo, etc.), so it’ll be fun to see just how far I can get against some of these libraries. If time permits I’ll hopefully be able to benchmark my code against them. Note that I'm doing this primarily for educational purposes and I will have an unfair advantage since I have the opportunity to dedicate a lot of time to fine-tuning operations specifically for my test hardware (personal laptop). \n\n# Getting Started - Installing BLIS \u0026 Tools\n\nBefore we can get our hands dirty writing code and profiling programs, I’ll need to have the library setup. I also want to get the damned Intel C/C++ compiler installed. The only issue is I’m currently running Manjaro Linux and Intel does **not** support this distribution officially, this makes installing Intel oneAPI tools and VTune much harder than it needs to be. Being in college, there’s going to be some days when I’ve got nothing to do and can afford to spend a lot of time working on this project, but today isn’t one of these days.\n\nMarch 25, 2022 The current goal is to just get the BLIS library up and compiled on my system and also get the Intel C/C++ compiler working by tonight. I want to be able to run a `saxpy` program which basically just computes $S = \\alpha X + Y$ (where $S, X \\text{ and } Y$ are vectors and $\\alpha$ is a scalar) without any optimizations. Just a simple `for` loop program compiled and running so I can make sure my setup works.\n\n## Installing Intel oneAPI and HPC Tool-kits\n\nOkay, admittedly I’ve gone through this pain before while installing it on my VM, but that doesn’t make it any better. The AUR package [`intel-compiler-base`](https://aur.archlinux.org/packages/intel-compiler-base) does not work and asks for some license. I’ll have to install it from [Intel’s page](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html), but Intel’s installer does not recognize half the required packages on my system. Hopefully it’ll still work.\n\nLol. To save anyone else in the same situation, here’s what you’ll want to do. Use the _offline_ installer instead of the online one. And install `libcryptx-compat` on your system before launching the installer. Then install the [oneAPI HPC Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit-download.html) next. Once everything is installed, `cd` to the installation folder. Default should be `/opt/intel/oneapi/`. Here you will find `setvars.sh`. Sourcing `setvars` will allow you to use the `icc` command for that shell session. You can add a line in `.zshrc` to source this file every time you enter a session. `source /opt/intel/oneapi/setvars.sh \u003e /dev/null`. There’s a noticeable slowdown when I launch a shell session, so will probably find a workaround for this soon.\n\nIt looks like I’ve got `icc` working. I compiled a few files with different flags and everything seems to work as expected.\n\n## Installing the BLIS library\n\nInstalling and setting up BLIS was relatively easy. It was interesting reading through their [build system doc](https://github.com/flame/blis/blob/master/docs/BuildSystem.md) and I’d highly recommend reading their [doc on multi-threading](https://github.com/flame/blis/blob/master/docs/Multithreading.md) as well. I’ve installed BLIS using the **zen3** config for my system and multi-threading enabled using `openMP`. There’s a section on why to use `openMP` vs `pthreads` (tl;dr BLIS not providing support for setting thread affinity via `pthreads`). The entire section on thread affinity is pretty interesting to read though.\n\nA little more effort was required to setup the dev environment I wanted. I recompiled BLIS using `--enable-cblas` to get CBLAS support and we’ll have to source add `/usr/lib/` to `LD_LIBRARY_PATH`. So add that to `~/.zshrc` as well. Then I just setup a simple Makefile and now I can just `#include\u003ccblas.h\u003e` and things will work as expected. Remember to link `-lpthreads` as BLIS requires this. And that’s about it, my simple SAXPY program works. I’ve got the `cblas` and `blis` libraries setup to benchmark against as well. That’ll be it for tonight. The plan is to get some roof-line analysis done tomorrow and get to know my hardware better so I know what all I have at my disposal and how much performance I can reasonably expect to squeeze out of this machine.\n\n### Some other tools\n\nAlso installed `perf` and `kcachegrind` and `gptrace`. `gprof` and `valgrind` are already installed. However, as mentioned in the doc about profilers, I’m more interested in trying to get stack samples during program execution. I believe `gdb` and `gptrace` should help me out here.\n\n# Roof-line analysis and inspecting hardware\n\nMarch 26, 2022 I more or less just plan on getting to know my hardware well so I can try to exploit as many hardware features as I can that I have at my disposal and setup a bench-marking environment. If I make any changes to some program in the name of optimization I want to be able to see the effect it has. Further, I should probably get roof-line analysis done so I know what theoretical peak I can hope to achieve.\n\n## Inspecting hardware\n\nI’m not dealing with any specialized hardware here, so consequently there won’t be (much) inspection to do either. Running `htop` and `cat /proc/cpu` should provide plenty of information. The official AMD website + wiki-chip should be enough to provide all the spec information. [Official AMD Website](https://www.amd.com/en/products/apu/amd-ryzen-7-5800h). [Wikichip Website.](https://en.wikichip.org/wiki/amd/ryzen_7/5800h)\n\nNote: Got the memory bandwidth from a third part source ([https://nanoreview.net/en/cpu/amd-ryzen-7-5800h](https://nanoreview.net/en/cpu/amd-ryzen-7-5800h))\n\n|CPU Details||\n|---|---|\n|CPU|Ryzen 7 5800H|\n|Cores|8|\n|Threads|16|\n|Base Clock|3.2GHz|\n|Max. Boost Clock|4.4GHz|\n|Memory bandwidth|69.27 GB/s|\n\n|Cache Details|(64 bit alignment)||||\n|---|---|---|---|---|\n|L1|512KB|8-way set associative|-|Per-core|\n|L1I|256KB 8x32KB|8-way set associative|-|Per-core|\n|L1D|256KB 8x32KB|8-way set associative|Write-back|Per-core|\n|L2|4MB 8x512KB|8-way set associative|Write-back|Per-core|\n|L3|16MB 1x16MB|16-way set associative|Write-back|Shared|\n|TLB Size|2560 4K pages|-|-|-|\n\n### Supported x86 extensions\n\nMMX Extension, Extended MMX Extension, Streaming SIMD Extensions (SSE), SSE2, ..., AVX, AVX2. FMA3, ABM and BMI1, BMI2.\n\nThe important ones here are **AVX2** and **FMA3.** We can use intrinsics up to AVX2, which means we have access to $256$ bit vector registers and FMA3 support means we can perform fuse add multiply operations in 1 instruction. There seems to be unofficial support for FMA4 as well according to Agner Fog but it can give incorrect results as well. The only real difference here seems to be that FMA4 allows writing to any free register while FMA3 needs to overwrite an operand register. But this will rarely give FMA4 a lead as FMA3 lets us pick just about any vector register as operand. Hence we should be fine just sticking to the officially supported FMA3 instruction set.\n\n### RAM\n\nMy RAM is DDR4 and I have 16GB of dual channel memory.\n\n### Disk\n\nI have one $1TB$ SSD and the operating system I use (linux) is allocated $512GB$ of disk space.\n\nMarch 28, 2022 - I wasn’t really able to get much of Saturday’s goals done, so I guess I’ll just be starting from there. Goal is to run some benchmark software to see theoretical GFLOPS/sec vs practical and the same for bandwidth. After that if time permits I’ll try to setup a nice environment for bench-marking C programs.\n\n# System Benchmarks\n\n## Computing max GFLOPS/sec\n\nLet’s start by computing the theoretical GFLOPS/sec that my machine should be able to achieve. As a refresher, we compute max achievable GFLOPS/sec as\n\n$$ GFLOPS/sec = \\text{Number of cores} \\times \\frac{Clocks}{Cycle} \\times \\frac{FLOPS}{Cycle} = 8 \\times 3.2GHz \\times 32 = 819.2GFLOPS/sec $$\n\nIf we go through the [WikiChip](https://en.wikichip.org/wiki/flops) page for our architecture, Zen3 is capable of doing $16 \\ FLOPS/Cycle$ for double precision numbers and $32 \\ FLOPS/Cycle$ for single precision numbers. However, there is also a section about extension instruction sets like **FMA** which might allow us to get _even more_ GFLOPS/sec.\n\n32 seems to be the peak FLOPS/sec we can hit if we consider allowing all the fancy floating point SIMD vectorzations. However, as mentioned before, theoretical bounds like this are very _very_ inaccurate and do not measure realistic performance. There are several reasons for this, not all FLOPS take 1 cycle, division for example is notoriously slow, SIMD vectorization and of course, the memory wall.\n\n### Whetstone benchmark\n\nThere’s a bunch of programs out there written to stress-test your machine and record the max `insert-quantity-you're-measuring` that your machine is capable of. One such tool is the [Whetstone Benchmark](https://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026ved=2ahUKEwj2mdK7vuX2AhVCR2wGHbtFAocQFnoECAcQAQ\u0026url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWhetstone_(benchmark)\u0026usg=AOvVaw0Letk1QFpBEe0FOy7rJpyU). I found code for it on [Netlib](https://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026ved=2ahUKEwiPmKrKvuX2AhX4RmwGHfzNB_4QFnoECBAQAQ\u0026url=https%3A%2F%2Fwww.netlib.org%2Fbenchmark%2Fwhetstone.c\u0026usg=AOvVaw0axRP6mDSotBx0v1OCU_Ho). Although, looking at the code and results it produces, it looks like Netlib’s version measures MIPS, not FLOPS.\n\nAs mentioned above, the peak I was MIPS I was able to achieve on my system was $125000$ MIPS. This is however in no way related to FLOPS/sec. MIPS → Millions of (mixed) instructions per second. The benchmark relies **heavily** on the instruction pool, and results will vary a lot. The Netlib code attempts to provide a decent distribution of all types of instructions but again, this is a very unreliable measure.\n\nMarch 29, 2022 - I spent most of yesterday looking for benchmarking tools. Turns out there’s very little available resources on the internet for measuring stuff like GFLOPS/sec because apart from CPU vendors, practically everyone considers it to be a bad benchmark. Nevertheless, I came across `phoronix-test-suite` which is a pretty cool tool... when it works. Most of the tests I tried to run had dependency or build issues and debugging each one was painful. However, some of the recommended ones did work. The `stream` benchmark gave pretty poor results, so I decided to run it locally and figure out what’s wrong. I’ll probably spend most of today benchmarking bandwidth and GFLOPS/sec and see what’s the best I can do.\n\n## [Stream - Memory bandwidth benchmark](https://www.cs.virginia.edu/stream/)\n\nThese were the highest scores I was able to hit for each individual test while varying several compiler flags and pragmas. To achieve these results I modified the code to set the maximum number of threads OpenMP spawns to 4 (from the default 16). Why? I’m not sure, but I’m guessing on `O3` the vectorized instructions generated are insanely fast and the overhead from creating threads ends up slowing it down. 4 threads seems to be the sweet-spot for gaining performance. For `add` and `triad` we got around 37GB/s each using 4 threads. But a single thread was able to crunch through FMA 2GB/s faster.\n\n**Update (March 31, 2022) -** I was wrong about this. Running the benchmark on two threads proved to be the fastest scores we can get across the board. I believe the reason for this is the fact that I have only dual channel memory and memory can only reach my CPU across two channels. If I have more than 2 threads trying to pull in memory they’re probably going to end up competing for cache and channel bandwidth and hence slowdown performance overall. Updated table as well.\n\n|Operation|Highest MB/s|Compiler|Compiler Flags|\n|---|---|---|---|\n|Copy (a[i] = b[i])|39685.0|icc|-O3 -xCORE-AVX2 -qopenmp|\n|Scale (a[i] = scalar*b[i])|39513.6|icc|-O3 -xCORE-AVX2 -qopenmp|\n|Add (a[i] = b[i]+c[i]))|40520.8|icc|-O3 -xCORE-AVX2|\n|Triad (a[i] = b[i]+scalar*c[i]))|40706.4|icc|-O3 -xCORE-AVX2|\n\nConcluding, I can say the 5800H on my machine has a bandwidth of around $\\approx \\ 40GB/sec$ according to the stream benchmark.\n\n## [Mystical / Flops](https://github.com/Mysticial/Flops)\n\nFrom the `README` file of that project,\n\n\u003e Modern x86 and x64 processors can theoretically reach a performance on the order of 10s - 100s of GFLOPs. However, this can only be achieved through the use of SIMD and very careful programming. Therefore very few (even numerical) programs can achieve even a small fraction of the theoretical compute power of a modern processor. This project shows how to achieve \u003e95% of that theoretical performance on some of the current processors of 2010 - 2014.\n\nThe theoretical peak we computed was $819.2 \\ GFLOPS/sec$, let’s see how much of that theoretical bound this benchmark can hit.\n\n### Single-thread performance\n\n### 128-bit AVX+FMA\n\n|Datatype|Operations|GFLOPS|\n|---|---|---|\n|Single-Precision|Add / Sub|30.784|\n|Double-Precision|Add / Sub|15.408|\n|Single-Precision|Multiply|34.752|\n|Double-Precision|Multiply|17.376|\n|Single-Precision|Multiply + Add|46.8|\n|Double-Precision|Multiply + Add|23.424|\n|Single-Precision|FMA|65.856|\n|Double-Precision|FMA|33.024|\n\n### 256-bit AVX+FMA\n\n|Datatype|Operations|GFLOPS|\n|---|---|---|\n|Single-Precision|Add / Sub|61.632|\n|Double-Precision|Add / Sub|30.752|\n|Single-Precision|Multiply|69.408|\n|Double-Precision|Multiply|34.656|\n|Single-Precision|Multiply + Add|92.64|\n|Double-Precision|Multiply + Add|45.792|\n|Single-Precision|FMA|131.136|\n|Double-Precision|FMA|65.664|\n\nOur theoretical peak performance per thread that we computed was $\\frac{819.2}{8} = 102.4 \\ GFLOPS/sec$.\n\nMost of the numbers on this sheet seem to correspond to performance under that metric and we see almost an exact $2\\times$ increase in performance when we move from 128-bit AVX to 256-bit AVX which is exactly what we’d expect to see. However, notice that **Single-Precision 256-bit FMA** has recorded **131.136** GFLOPS/sec, which is almost **30** GFLOPS higher than we computed. Normally we’d expect to see practical numbers _lower_ than computed.\n\nInspecting further, I noticed that the single core that the benchmark ran on turbo-boosted to and maintained a clock-speed of $4.2GHz$ throughout the benchmark. Further, as the [WikiChip](https://en.wikichip.org/wiki/flops) page for flops mentions clearly,\n\n\u003e For example, a typical [fused multiply-accumulate](https://en.wikichip.org/w/index.php?title=fused_multiply-accumulate\u0026action=edit\u0026redlink=1) (FMAC) operation can perform two floating-point operations at once. For a single core, this can be expressed as $FLOPS_{core} = \\frac{instructions}{cycle} \\times \\frac{operations}{instruction} \\times \\frac{FLOPs}{operation} \\times \\frac{cycles}{second}$\n\nComputing this value is hard without knowing the exact architecture level specifics of our processor. But even ignoring this and using the updated clock-speed of our processor, we get a theoretical bound of $4.2 \\times 32 = 134.4 \\ GFLOPS/sec$ which is in line with our benchmark results.\n\n### Multi-thread performance (8 cores/16-threads)\n\n### 128-bit AVX+FMA\n\n|Datatype|Operations|GFLOPS|\n|---|---|---|\n|Single-Precision|Add / Sub|252.576|\n|Double-Precision|Add / Sub|123.776|\n|Single-Precision|Multiply|243.792|\n|Double-Precision|Multiply|120.912|\n|Single-Precision|Multiply + Add|420.96|\n|Double-Precision|Multiply + Add|217.344|\n|Single-Precision|FMA|473.664|\n|Double-Precision|FMA|236.832|\n\n### 256-bit AVX+FMA\n\n|Datatype|Operations|GFLOPS|\n|---|---|---|\n|Single-Precision|Add / Sub|470.976|\n|Double-Precision|Add / Sub|232.224|\n|Single-Precision|Multiply|450.048|\n|Double-Precision|Multiply|223.968|\n|Single-Precision|Multiply + Add|787.968|\n|Double-Precision|Multiply + Add|394.608|\n|Single-Precision|FMA|883.392|\n|Double-Precision|FMA|439.296|\n\nThis time there isn’t an (almost) exact $8\\times$ increase in performance but the numbers are very close to it. The small discrepancies we see can mostly be accredited to overheads in creating more threads and memory sharing over multiple cores. Similarly, we can again see **Single-Precision 256-bit FMA** beat the estimated theoretical peak. Monitoring CPU utilization again, most of my cores were running at 3.5GHz instead and 2 were running at 3.8GHz.\n\nAccounting for this, we get a theoretical bound of $915.2 \\ GFLOPS/sec$ which is again, in line with our benchmark results. Although this isn’t an accurate bound when dealing with FMA instructions.\n\n# Writing my own benchmarks\n\nMarch 30, 2022 One common pattern I noticed in both benchmarking programs is the use of `AVX2` and `omp` to parallelize the load over all 16 threads. If I want to post scores similar to or even better than the programs I found on GitHub utilizing `AVX2` efficiently is critical. For parallelizing it over multiple threads I’m still not sure what the best course of action is. There are multiple libraries such as [OpenMP](https://www.openmp.org/) and [pthreads](https://man7.org/linux/man-pages/man7/pthreads.7.html) which provide an API to allow parallelization over multiple threads. OpenMP seems to be the most optimized, easy-to-use and portable among the 2 options, allowing me to parallelize for loops using just a single pragma `#pragma omp parallel for`. However, if time permits, I think I’ll try out both libraries and see how each of them perform. Otherwise, I’ll just stick to OpenMP.\n\nA few things to keep in mind:\n\n1. The computation must take in some pre-defined input (if I want to check accuracy) or must include non-determinism or there’s high chances the compiler just optimizes the computation to compile time.\n2. The computation must _do_ something with the result, maybe print output or something so that the compiler doesn’t just skip all the computation.\n\nI’ll have to be careful about this or risk inaccurate benchmarks.\n\n## Writing a benchmark library\n\nThis is also a good opportunity to write a basic benchmarking library. It’ll be very useful to have a re-usable and trust-able benchmarking library to rely on when optimizing BLIS.\n\nMarch 31, 2022 - I spent most of my time yesterday and today working on and testing the benchmark library. There were many ways to write the library and it gave weird results that I wasn’t sure were right until confirmed with another benchmarking library (Written in C++). Plus it’s also very easy to make it run parallel tests over multiple threads. As of now it works on OpenMP but if I have time I’ll consider adding a `pthreads` implementation as well. The code for the library can be found here:\n\n[kBLIS/benchmarks/lib](https://github.com/akcube/kBLIS/tree/main/benchmarks/lib)\n\nTuning the [Stream](https://www.cs.virginia.edu/stream/) benchmark\n\nApril 4, 2022 I’m planning on writing the memory benchmarks today.\n\nThe [Stream](https://www.cs.virginia.edu/stream/) benchmark utility provides really nice functionality to write custom _tuned_ benchmarks which are run instead of the normal stream benchmarks when compiled with `-DTUNED`. This means I won’t need to use the benchmarking library or write fancy printing code. I can simply work on the tuned code and let the benchmarking tool run it. Now coming to writing tuned code, I prefer to do the vectorization manually. Going through the load/store instructions on [Intel’s intrinsics page](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2\u0026ig_expand=6939,6949,6946\u0026cats=Store), there are both streaming and non-streaming instructions available both of which perform at the same Latency/Throughput measures. The primary difference between streaming vs non-streaming instructions is that the non-streaming versions evict present data in cache to store the newly loaded values. In all the operations, I never care about a previously loaded value after initial load. So I’m thinking streaming instructions will help here. Also, quoting a comment from Stack Overflow:\n\n\u003e Based on last comment from an old topic on that matter: [software.intel.com/en-us/forums/intel-isa-extensions/topic/…](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075) Assuming CPU optimizes such a read at all the data is loaded into a cache line-size buffer, which is separate from the data cache. There is not many of these buffers available so best when the content of the buffer is read in subsequent instructions otherwise it might be lost and needed to be reacquired with potentially big performance penalty. And the copying itself will not be faster than with cached memory - but will not evict entire working set from cache.\n\nIf this is true, then it shouldn’t affect me pre-fetching the next addresses as well. Further, since I have 16 vector registers available and independent load / stores happening I’ll attempt to unroll the loops as well.\n\n**Update:** Yes, streaming the writes back instead of storing gave a $2 \\times$ increase in bandwidth.\n\n## Results\n\n**GCC / Clang without `-DTUNED`**\n\n```jsx\nFunction Best Rate MB/s Avg time Min time Max time\nCopy: 36853.5 0.035270 0.034732 0.038729\nScale: 21994.3 0.059661 0.058197 0.065483\nAdd: 26976.9 0.072385 0.071172 0.076895\nTriad: 26512.4 0.073506 0.072419 0.076874\n```\n\n**GCC / Clang with `-DTUNED`**\n\n```jsx\nFunction Best Rate MB/s Avg time Min time Max time\nCopy: 41040.2 0.031695 0.031189 0.034857\nScale: 41048.0 0.031470 0.031183 0.032065\nAdd: 40111.3 0.048329 0.047867 0.049275\nTriad: 40077.0 0.048333 0.047908 0.049962\n```\n\nWe see massive improvements (more than 19 GB/s in `Scale` for example!). We also hit our highest recorded bandwidths of 41GB/s. `-DTUNED` simply tells the benchmark to run the custom tuned code I wrote instead of the default code. The Intel compiler benchmarks aren’t as interesting as it was already hitting $\\approx39-40GB/s$ but it did boost by roughly $\\approx +1GB/s$.\n\n- **ICC without `-DTUNED`**\n \n ```jsx\n Function Best Rate MB/s Avg time Min time Max time\n Copy: 39773.8 0.032854 0.032182 0.034661\n Scale: 39858.0 0.032708 0.032114 0.035328\n Add: 40366.6 0.047950 0.047564 0.050785\n Triad: 40395.6 0.047940 0.047530 0.048804\n ```\n \n- **ICC with `-DTUNED`**\n \n ```jsx\n Function Best Rate MB/s Avg time Min time Max time\n Copy: 41038.6 0.031655 0.031190 0.033246\n Scale: 41009.7 0.031747 0.031212 0.033515\n Add: 40725.3 0.047596 0.047145 0.049452\n Triad: 40672.7 0.047437 0.047206 0.048021\n ```\n \n\nThese were the exact flags passed to each compiler:\n\n1. GCC - `-O3 -ffast-math -mavx2 -fopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 -march=znver3`\n2. Clang `-O3 -ffast-math -mavx2 -fopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 -march=znver3`\n3. ICC - `-O3 -xCORE-AVX2 -qopenmp -g -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20`\n\n**Note:** Two modifications were made to the original stream benchmark code. The first modification was limiting the number of OpenMP threads created to just 2. The second was using `__attribute__((aligned(32)))` to align all the arrays to 32-bit boundaries. This alignment was in place for the runs without `-DTUNED` as well.\n\nMy code for the tuned version of the stream benchmark can be found here:\n\n[kBLIS/benchmarks/stream/stream.c](https://github.com/akcube/kBLIS/blob/main/benchmarks/stream/stream.c)\n\nThe main optimizations done were using AVX2 and FMA instructions, loop unrolling and using stream write instead of just write.\n\n## Writing mGBPS\n\nWhile Stream is a decent memory bandwidth benchmark, I’m not sure if it’s the absolute max bandwidth I can squeeze out of my system. [Wikipedia](https://en.wikipedia.org/wiki/Memory_bandwidth) defines memory bandwidth as,\n\n\u003e **Memory bandwidth** is the rate at which data can be read from or stored into a [semiconductor memory](https://en.wikipedia.org/wiki/Semiconductor_memory) by a [processor](https://en.wikipedia.org/wiki/Central_processing_unit)\n\nAll of Stream’s benchmarks involve more complicated operations such as say, copying or adds etc. which require multiple load/store operations which are **dependent** on each other. The primary issue with such an operation is that the corresponding `store` operations cannot be performed until the `load` operation is done. This dependency has potential to slow down certain writes.\n\nApril 5, 2022 My goal is to get the absolute maximum possible bandwidth that I can out of my system by removing the dependency in my read/write operations. The idea is to design something like this:\n\n1. Have two arrays, one means for reading and one meant for reading.\n2. In each iteration half the registers are dedicated to reading an summing values in array `A` and the other half are writing some constant value to array `B`. Perhaps I can use just 1 read value to write to `B` to not waste any registers on storing constants.\n3. Experiment with the stream and non-stream versions of instructions.\n\nPreviously, I got nearly 1.5x - 2x speedups by using the `stream` instructions. Only `int` supports 256-bit stream reads, so I’ll use the integer datatype.\n\n**Update:** Sadly the compiler optimizes out any independent reads in the for loop as I’m not really doing anything with the read values. Hack-y tricks like using functions with `O0` enforced pragmas and inling don’t work either. I’ll have to inline asm the code.\n\nApril 6, 2022 [This talk by Chandler Carruth](https://www.youtube.com/watch?v=nXaxk27zwlk\u0026t=2398s) is one of the best talks I’ve seen on benchmarking. He described a couple of neat tricks that work on any compiler that supports the GCC syntax for inserting inline asm. The simple asm we write allows us to insert “empty” instructions into the assembly which essentially trick the compiler into thinking there is artificial use for variables / code which renders it’s dead code elimination (DCE) pass ineffective. Using this, I was able to record bandwidth up-to $\\approx 45 GB/s$ with both GCC and ICC.\n\n```objectivec\n------------------------------------------------------------\nSingle-thread:\tBenchmark information - Independent R/W\n------------------------------------------------------------\nTotal runtime:\t\t\t5.000988\nResult computed:\t\t0.000000\nTotal FLOPS computed:\t\t0\nTotal memory accesses:\t\t227520000000\n\nGFLOPS:\t\t\t\t0.000000\nBandwidth:\t\t\t45.495008 GB/s\n```\n\nOne caveat with this approach, the values read from memory also get written to stack memory once. This is a consequence of the fact that I must reference the address of each read 256-bit value to `escape` it. This would count as a second write operation but because this write is to the exact same location millions of times, it’s all writes to locations that are loaded into cache. Hence my bandwidth readings of $80+GB/s$ rely heavily on cache and is the “effective” bandwidth, not actual bandwidth. So I’d guess my actual bandwidth is slightly higher than $45GB/s$. I’ll update the code eventually with inline assembly which does not have this same issue. The code for this benchmark can be found here:\n\n[kBLIS/benchmarks/mGBPS](https://github.com/akcube/kBLIS/tree/main/benchmarks/mGBPS)\n\n## Writing mGFLOPS\n\nApril 7, 2022 First thing, if there’s any chance of reaching close to the computed theoretical peak, I **cannot** let memory bandwidth bottleneck my program. This means I _cannot_ read any data from memory during the entire computation. We’ll have to initialize registers to some constant/input data and keep performing computations using only this initial data. Since we’re just using registers here we’ll just talk about SIMD `ADD`'s of registers `A` and `B` as `A+B` without regard for the packed `ADD`.\n\n1. We have exactly 16 256-bit registers free in x86-64 architectures. Any more and I’ll have to spill over some register values to memory. So if I want to max FLOPS I’ll want to use all 16 registers and no more than 16 registers. As for which instruction to pick, we are mainly dealing with throughput here. Ideally we’ll want to pick instructions with low latency **and** low CPI (high throughput). But in conflicting cases we’ll probably be better off with prioritizing lower CPI. This immediately rules out almost all categories except the [**Arithmetic** category of avx2 instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2,FMA)\n \n2. As for load/store ops, ideally we’d want to eliminate them as well. They have latency 1 and 0.5CPI but they don’t count as floating point operations and using them usually means interfacing with memory, which we want to avoid at all costs. FMA has the highest chance of maximizing throughput thanks to the fused multiply and add it performs so we should also be trying to fully utilize just FMA instructions.\n \n `_mm256_fmadd_ps` has throughput 0.5 CPI and latency 4 which is about the same as a simpler `_mm256_add_ps` instructions but gets two instructions done instead of just one.\n \n3. If we’re using FMA we’re computing an expression of the type $DST = A \\times B + C$. To have some meaningful computation, I think I’ll fix registers $A$ and $C$ to be constants and keep $14$ accumulative registers. $A, C$ and $B[0] \\dots B[13]$ will be the input for the program.\n \n\n# Setting up the benchmark environment for BLIS\n\nPretty much an entire week and some more went into the creation of a neat benchmark environment for this.\n\n## The benchmark library\n\nFirst up, the benchmark library that I wrote was specifically tailored for running Kernels that followed a specific function signature. I can’t make `BLIS`, `CBLAS` or `KBLAS` conform to that signature. I don’t think C provides a way to typecast function pointers to an arbitrary type and recast them easily, although I’m sure its possible. Even if we went this route each function has specific arguments and we end up with a lot of conditional checks for creating arguments to pass to these functions either ways. If this was C++, we had more opportunities thanks to templates which would allow me to offload this work to the compiler. Since we don’t have that luxury, the next best way to imitate this behavior in C is with well... `#define`. This is a terribly hack-y way of doing this, but it works and lets me keep my code look somewhat “clean”. I essentially made parts of the function a set of macros so I can write my benchmark like so:\n\n```bash\n * Usage:\n * BENCH_START(FRUN, MRUN, DUR, NAME) \n * reset_vars(...)\n * START_RECORD\n * fun(arg1, arg2. ...) \n * END_RECORD\n * BENCH_END\n```\n\nNotice the section where I allow myself to call some function `reset_vars`. A hurdle I came across while benchmarking is that the BLAS prototypes all modify the vector given to them as input. Repeatedly calling the function in a loop, for example in `scal` just makes the elements go to $+inf$ and that’s not ideal. Floating point computations near the extremities ($+inf/-inf$ or $\\approx0$) is not representative of true performance. This means I’ll need a section of code to reset my vector and this must not be timed. Hence we make this inclusion here to allow for such reset operations.\n\n### `mem_flush`\n\nAnother thing we’ll be needing is `mem_flush`. [This function](https://github.com/akcube/kBLIS/blob/main/lib/benchmark.c#L129) is how Intel recommends we flush the cache. This function becomes relevant when we’re dealing with small memory vectors/matrices where the effects of cache heavily influence the benchmark. It’s very important that the benchmarks are fair and all 3 library implementations are provided as similar environments are possible.\n\n### `fill_cache`\n\nIn contrast to `mem_flush`, [this function](https://github.com/akcube/kBLIS/blob/main/lib/benchmark.c#L145) does the opposite. It performs a series of sequential reads and populates the cache with cache-lines from whatever memory was passed to it. I realized that `fill_cache` pretty much ends up flushing most unnecessary stuff from cache thanks to cache eviction policies and works pretty well even without `mem_flush`. However, especially for large vectors this might end up being a very slow operation and I don’t want the benchmark to take much time. Hence we relied on the same tools used to write the `mGBPS` benchmark. Using the `escape` trick to turn off compiler optimizations and 256 bit unrolled AVX loads gives us a really fast `fill_cache` function.\n\nRelevant code can be found here\n\n[kBLIS/lib](https://github.com/akcube/kBLIS/tree/main/lib)\n\nNow that this is done, we still need a way to check if our functions return CORRECT results. When making optimizations its really easy to get carried away and chase higher numbers and forget about verification. I also want to benchmark across different data-sizes. To do this, I decided to setup the following workflow. The idea is that the benchmark program **must** be placed in a top level directory which looks like this\n\n```bash\n├── benchmark_program\n├── generator_program\n├── input\n│   ├── dscal\n│   │   ├── 1\n│   │   ├── ...\n│   │   └── 7\n│   └── sscal\n│   ├── 1\n│   ├── ...\n│   └── 9\n└── verification\n ├── dscal\n │   ├── 1\n │   ├── ...\n │   └── 7\n └── sscal\n ├── 1\n ├── ...\n └── 9\n```\n\nA generator program generates files with some input in the `input/` directory and then also outputs files in the `verification/` directory with the CORRECT output for that input.\n\n## The generator program\n\nFile generation is a standalone program, so I decided to allow myself to use C++ here. With templates in C++, life is a lot easier and with minimal code we have a neat little program which outputs binary files in a specific format that can be read by a C program. All I have to do is modify the `config` map whenever required and it’ll generate the files for me. Further, a minor modification in the `cmake` file allows me to generate this directory structure.\n\nThe code for the generator program can be found here:\n\n[kBLIS/benchmarks/generator/gen.cpp](https://github.com/akcube/kBLIS/blob/main/benchmarks/generator/gen.cpp)\n\nC++ provides us with nice tools to generate random numbers which follow some probability distribution. I’m using `std::mt19937 gen` which is a mersenne twister engine that I will seed with `std::random_device`. This engine will give us high quality random numbers and we can combine this with `std::uniform_real_distribution\u003cT\u003e` to get random real numbers in some interval. I chose $1 - 10^6$ as the default range.\n\nAs mentioned in the comments, this is the expected target for the config files. I believe this will cover enough data range to get a good idea of how our modification affect program results.\n\n```bash\n /**\n\t * The goal here is to benchmark the functions for varying sizes in memory. Ultimately\n\t * memory will end up becoming the benchmark for large vector / matrix operations.\n\t * Here is a reasonable guide to picking config sizes: \n\t * + 1-3 options should fit in L1-L2 cache\n\t * + 4-6 options should fit in L2-L3 cache\n\t * + 7-9 options should overflow cache and start occupying large amounts of system memory\n\t * \n\t * Tip: For the double versions of benchmarks, simply half the memory usage of the float versions\n\t */\n```\n\nFor verification, I rely on the fact that the results provided by `CBLAS` are accurate. These are the results I will test my `KBLAS` implementations against.\n\n## The benchmark\n\nThe benchmark itself is a bit involved. All the files are pretty much duplicates of each other (we are back in C land :() with special argument parsing for each type of function and function call. We can go over one such file, say `sscal_bench.c` [here](https://github.com/akcube/kBLIS/blob/main/benchmarks/sscal_bench.c).\n\n1. The first part of the code simply parses argument using `getopt`. Setting `min` and `max` limits is useful when I just want quick results for some memory size or when running using `valgrind` or other slow profiling tools.\n \n2. Each file is read one by one and the size of the files are displayed. The config I’ve given for the generator program generates files in ascending order of memory usage and we can see how increasing data size of our vectors affects the FLOPS/sec and bandwidth accordingly as more and more cache misses occur.\n \n3. The files all contain binary data in this format `[int] dim N, [int] dim M, [float/double] N*M matrix`. The functions in out benchmark library `get_farg` and `get_darg` get one argument and move the file pointer to the next argument if any. This lets us quickly get arguments and populate the right variables. We then make copies of some vectors so we can quickly reset arguments later on.\n \n4. For resetting variables, we call two lines of code.\n \n ```c\n memcpy(X, xcpy, N * sizeof(float));\n fill_cache((const char*) X, N * sizeof(float));\n ```\n \n `memcpy` resets the vector, but pulls both the copy and `X` into cache which compete for cache. We fill the entire cache with `X` next to eject `xcpy` and let our working set be in cache.\n \n5. After this is just the call to the function being benchmarked. Only this call is in the timed region. This is looped till the function execution time crosses a minimum threshold of $2$ seconds. Then results are outputted.\n \n\nThe order of benchmarks being run is $CBLAS \\to KBLAS \\to BLIS$ (per file).\n\nThe following is an example of the file sizes generated for the `scal` benchmarks:\n\n![meg-1](/images/meg-1.webp)\n\nTo make sure the benchmark actually works and is benchmarking what we actually want it to benchmark, I ran the program with `perf`, (`perf record -g ./\u003cprogram_name\u003e`). This generates a [`perf.data`](http://perf.data) file which can be used to visualize all the information I want to visualize about the program. `perf report -g 'graph,0.5,caller'`.\n\n![meg-2](/images/meg-2.webp)\n\nTime spent executing code **in** main (not a child function) is $\\approx0.12\\%$ which is pretty good. Most of the time is sadly spent in the `memcpy` function which has to be called to reset the arguments passed to `#scal`. I can’t think of a better way except this, for smaller data sizes the function being benchmarked runs relatively fast and hence is called many times to get an accurate reading. The same is the case for `fill_cache`. It is often executed when we know the contents of `X` are not fully in cache and hence it takes the brunt of the cache misses and has slow execution time. These functions are **not** included in the timed region so it shouldn’t affect benchmark results.\n\nFinally, `bli_sscalv_zen_int10` in `libblis.so.4.0.0` takes about $\\approx18\\%$ execution time while `kblas` takes around $\\approx 9\\%$. The time in the `libblis` shared object contains both the time spent running the `cblas` function call **and** the `blis` function call. A simple way to verify this is run `perf` on the benchmark while omitting out one of the calls to either `CBLAS` or `BLIS`, this gives us\n\n![meg-3](/images/meg-3.webp)\n\nWe now see a much more even distribution. Opening the call graph on the first `perf` report gives a few weird symbols but it does seem to indicate a split between two calls each taking $\\approx9\\%$ execution time.\n\n**Note:** Lesser $\\%$ of execution time on the `perf` report does **NOT** mean it was faster. Because of the dynamic nature of the benchmark, faster functions might get called more times and hence the program might spend more time executing the faster functions. What we expect to see is the dynamic nature of the benchmark keeping the execution times approximately equal between the functions and that is exactly what we see. Hence I’m fairly sure the benchmark does exactly what we expect it to do.\n\nFinal addition, I also voluntarily broke the `kblas` code to return incorrect results to ensure that the `verify` functions work and can confirm that it did pass that test at the very least.\n\nThe relevant benchmarking code can be found here:\n\n[kBLIS/benchmarks](https://github.com/akcube/kBLIS/tree/main/benchmarks)\n\nAs we keep running the benchmark for an iterative feedback → improvement process it becomes important to quickly get results for specific memory sizes. So I just added an option to pass in flags like `--min=6KB` or `--max=64MB` to run the benchmark only on files that are within this memory capacity. This allows for quicker feedback.\n\n# Starting with KBLAS optimizations\n\n## BLAS Level 1\n\n### `sscal` \u0026 `dscal`\n\nThese are probably the simplest functions to implement in the entire library. Given a vector $X$ and a scalar $\\alpha$, it performs the following calculation: $X = \\alpha \\cdot X$. A basic implementation of the function would probably look like this. We can augment this with pragmas but that is about all we can do with pragmas.\n\n```c\nfor (int i = 0; i \u003c N; i++) {\n\t\tX[i * incX] *= alpha;\n}\n```\n\nThe function signature of the `scal` set of functions looks like this: `void #blas_#scal(const int N, const float alpha, float *X, const int incX)`. I’m particularly interested in a specialization of this function where $incX = 1$. When this is true, we know that the elements that we are accessing are all exactly $1$ stride away from each other. That is, they are contiguous and I can make full use of SIMD here. When working with large data, a single branch `if` doesn’t matter much. So we forward this function to a specialized `void kblas_sscal_inc1(const int N, const float alpha, float *X)` when $incX = 1$. In the specialized function, we do standard loop unrolling and SIMD intrinsics to get maximum performance.\n\n**Observations and design space search:**\n\n1. A simple SIMD implementation of the above function along with $12\\times$ loop unrolling to facilitate register renaming does pretty well and is mostly on par with the CBLAS / BLIS implementations. Even without using `#pragma omp simd`, inspecting the instructions generated with even `O3` shows us that the compiler pretty much tries to do the same. But perhaps with lesser use of all $16$ SIMD registers.\n \n2. The next step is to parallelize over cores. Here we face the first major hurdle. **False sharing.** Quoting Wikipedia,\n \n \u003e When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that _is_ being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource.\n \n Essentially, if we have all 16 threads trying to access data what we have is a data access pattern where each thread pre-fetches some cache lines into memory. If these cache-lines are requested by another thread, they now have to bear the cost of shifting the data across caches. This is especially prevalent behavior in non-coherent L1/L2 caches. This leads to what is known as [thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)), and the cache misses deal huge slowdowns. Hence a simple `#pragma omp parallel for` will not do.\n \n3. Design space search. OpenMP provides a lot of interesting modifiers we can add to the pragmas to get more find grained control of the multi-thread allocation. Thread affinity means a lot in these scenarios. Adding `proc_bind(spread)` instructs OpenMP to pick allocations such that the threads are as far away apart from each other as possible. Similarly `close` implies they are as close as possible. This includes two hyper threads operating from the same core. Using 2 threads gives interesting results.\n \n\n**KBLAS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 14.8958 |\n| 1 | 16.6439 | 24.3836 |\n| 2 | 18.2288 | 16.0736 |\n| 3 | 19.2288 | 10.3453 |\n| 4 | 20.2288 | 21.1504 |\n| 5 | 21 | 33.0208 |\n| 6 | 22.3219 | 32.2854 |\n| 7 | 23 | 23.0411 |\n| 8 | 23.585 | 7.6065 |\n| 9 | 24 | 6.10622 |\n| 10 | 24.3219 | 5.61181 |\n| 11 | 25 | 5.15223 |\n| 12 | 26 | 4.20697 |\n| 13 | 27 | 3.59869 |\n| 14 | 28 | 3.57481 |\n| 15 | 29 | 3.86786 |\n| 16 | 30 | 3.75847 |\n+----+---------------+------------+\n```\n\n**CBLAS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 25.8997 |\n| 1 | 16.6439 | 26.482 |\n| 2 | 18.2288 | 22.9104 |\n| 3 | 19.2288 | 22.6618 |\n| 4 | 20.2288 | 19.3967 |\n| 5 | 21 | 19.5086 |\n| 6 | 22.3219 | 18.2897 |\n| 7 | 23 | 13.9195 |\n| 8 | 23.585 | 10.1656 |\n| 9 | 24 | 4.18144 |\n| 10 | 24.3219 | 3.99074 |\n| 11 | 25 | 3.87764 |\n| 12 | 26 | 3.87251 |\n| 13 | 27 | 3.83174 |\n| 14 | 28 | 3.67364 |\n| 15 | 29 | 3.73085 |\n| 16 | 30 | 3.68187 |\n+----+---------------+------------+\n```\n\n**BLIS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 22.2387 |\n| 1 | 16.6439 | 24.5427 |\n| 2 | 18.2288 | 20.8703 |\n| 3 | 19.2288 | 21.7485 |\n| 4 | 20.2288 | 18.7602 |\n| 5 | 21 | 20.1412 |\n| 6 | 22.3219 | 17.9279 |\n| 7 | 23 | 13.5539 |\n| 8 | 23.585 | 9.65506 |\n| 9 | 24 | 4.12919 |\n| 10 | 24.3219 | 3.98638 |\n| 11 | 25 | 3.88883 |\n| 12 | 26 | 3.86826 |\n| 13 | 27 | 3.83379 |\n| 14 | 28 | 3.71239 |\n| 15 | 29 | 3.71288 |\n| 16 | 30 | 3.54376 |\n+----+---------------+------------+\n```\n\n![meg-4](/images/meg-4.webp)\n\n1. The benchmarking workflow I use iteratively increases the size of the files it runs the benchmark on. The first observation we see from using 2 threads is that performance has horribly degraded for small data $(\\leq 1MB)$.\n2. However, after the $1MB$ mark, performance is sky high. Then we again see degradation near $64MB$. There’s a big drop near $16MB$. This is exactly the point where we run out of cache memory and have to go all the way to main memory. This drop is expected and there is nothing much that we can do.\n\n**Optimizing this**\n\n**Why is it that below $\\approx1MB$ using multiple threads has horrible performance?**\n\n→ The answer is fairly simple. $L1/L2$ cache are _really_ fast and any data accesses made here are simply so fast that the entire operation finishes really quickly. This means that the overhead that comes from thread initialization and division of cache greatly outweigh how much a single core can crunch through. If it can be supplied data fast enough it can compute fast enough.\n\n1. This leads us to our first optimization, I disable multi-threading until we reach a certain point of memory usage. This point clearly comes only _after_ we overflow the $L2$ cache of the core. Each of my cores have $512KB$ cache. For me the tipping point is somewhere around $2 \\times L2 \\ cap$. Capping threads at $2$ is fine but not ideal.\n\n**Can we somehow boost the thread count higher without facing the consequences of false sharing?**\n\nHere is where the OpenMP pragmas which give us more fine tuned control of thread affinity and allocation come in very useful. Using `proc_bind(spread)` allows us to increase running threads to $4$ while maintaining a low percentage of false sharing. This gives a considerable boost in performance, but as we keep increasing memory the amount of false sharing only keeps increasing between the 4 competing threads.\n\nNotice the higher FLOP/s compared to the 2-thread run.\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 27.5244 |\n| 1 | 16.6439 | 26.9237 |\n| 2 | 18.2288 | 23.145 |\n| 3 | 19.2288 | 19.0416 |\n| 4 | 20.2288 | 26.6045 |\n| 5 | 21 | 35.5922 |\n| 6 | 22.3219 | 39.9146 |\n| 7 | 23 | 18.351 |\n| 8 | 23.585 | 7.57891 |\n| 9 | 24 | 6.17116 |\n| 10 | 24.3219 | 5.78736 |\n| 11 | 25 | 5.08657 |\n| 12 | 26 | 3.88079 |\n| 13 | 27 | 3.56862 |\n| 14 | 28 | 3.61083 |\n| 15 | 29 | 3.48344 |\n| 16 | 30 | 3.19347 |\n+----+---------------+------------+\n```\n\n![meg-5](/images/meg-5.webp)\n\n**Can we do something to reduce false sharing when memory being read overflows cache size? The drop comes mainly _after_ we overflow all the cache memory we have.**\n\n→ Using non-temporal hints and stores seems like a good idea here. The problem is too many threads competing for limited cache because each thread ends up filling part of the cache with memory it won’t be accessing ever again. Non-temporal store instructions do not work on all types of memory, but they do provide hints that this memory can be evicted from cache and in general do hint that they should be evicted once used whenever possible. Implementing this for large cache sizes turns out to be _very successful_. This combined with a minor `schedule(static, 128)` while we are in $L2 \\to L3$ territory gives us our final result.\n\n**KBLAS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 26.7844 |\n| 1 | 16.6439 | 27.0707 |\n| 2 | 18.2288 | 26.3522 |\n| 3 | 19.2288 | 22.8382 |\n| 4 | 20.2288 | 25.8374 |\n| 5 | 21 | 38.0308 |\n| 6 | 22.3219 | 44.9139 |\n| 7 | 23 | 19.6356 |\n| 8 | 23.585 | 8.92693 |\n| 9 | 24 | 6.92278 |\n| 10 | 24.3219 | 6.07048 |\n| 11 | 25 | 5.10881 |\n| 12 | 26 | 4.45963 |\n| 13 | 27 | 4.44933 |\n| 14 | 28 | 4.43135 |\n| 15 | 29 | 4.41979 |\n| 16 | 30 | 4.36659 |\n+----+---------------+------------+\n```\n\n**CBLAS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 26.1373 |\n| 1 | 16.6439 | 27.3349 |\n| 2 | 18.2288 | 26.1963 |\n| 3 | 19.2288 | 22.5089 |\n| 4 | 20.2288 | 19.7618 |\n| 5 | 21 | 19.1649 |\n| 6 | 22.3219 | 19.6915 |\n| 7 | 23 | 12.4671 |\n| 8 | 23.585 | 4.63941 |\n| 9 | 24 | 4.13325 |\n| 10 | 24.3219 | 3.92989 |\n| 11 | 25 | 3.81249 |\n| 12 | 26 | 3.56251 |\n| 13 | 27 | 3.75595 |\n| 14 | 28 | 3.72533 |\n| 15 | 29 | 3.65334 |\n| 16 | 30 | 3.48209 |\n+----+---------------+------------+\n```\n\n**BLIS - sscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 22.8759 |\n| 1 | 16.6439 | 25.4255 |\n| 2 | 18.2288 | 24.7427 |\n| 3 | 19.2288 | 21.9082 |\n| 4 | 20.2288 | 19.2777 |\n| 5 | 21 | 19.2581 |\n| 6 | 22.3219 | 17.7078 |\n| 7 | 23 | 12.3045 |\n| 8 | 23.585 | 4.60922 |\n| 9 | 24 | 4.03432 |\n| 10 | 24.3219 | 3.91525 |\n| 11 | 25 | 3.89575 |\n| 12 | 26 | 3.8172 |\n| 13 | 27 | 3.72308 |\n| 14 | 28 | 3.72022 |\n| 15 | 29 | 3.66334 |\n| 16 | 30 | 3.37475 |\n+----+---------------+------------+\n```\n\n![meg-6](/images/meg-6.webp)\n\nOur `sscal` implementation pretty much beats `BLIS` and `CBLAS` and by quite high margins in practically all workloads. It is never under-performing. Either on par with or much better. The massive gain is seen for vectors ranging in the sizes of $1MB$ to $16MB$. Even after we are done with $L3$ however, we are still able to consistently produce at least $\\approx1 GFLOPS/s$ more than BLIS and CBLAS.\n\nBecause our optimizations rely entirely on the memory sizes we’re working with, the code is pretty much the exact same for `dscal`.\n\n**KBLAS - dscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 13.6625 |\n| 1 | 16.6439 | 13.8026 |\n| 2 | 18.2288 | 10.9766 |\n| 3 | 19.2288 | 11.3643 |\n| 4 | 20.2288 | 9.42722 |\n| 5 | 21 | 19.7346 |\n| 6 | 22.3219 | 19.2039 |\n| 7 | 23 | 8.89872 |\n| 8 | 23.585 | 4.8018 |\n| 9 | 24 | 2.39269 |\n| 10 | 24.3219 | 2.25448 |\n| 11 | 25 | 2.19517 |\n| 12 | 26 | 2.19255 |\n| 13 | 27 | 2.20313 |\n| 14 | 28 | 2.15703 |\n| 15 | 29 | 2.16863 |\n| 16 | 30 | 2.19908 |\n+----+---------------+------------+\n```\n\n**CBLAS - dscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 13.3615 |\n| 1 | 16.6439 | 13.5734 |\n| 2 | 18.2288 | 12.2449 |\n| 3 | 19.2288 | 11.2757 |\n| 4 | 20.2288 | 9.47351 |\n| 5 | 21 | 9.92226 |\n| 6 | 22.3219 | 9.2906 |\n| 7 | 23 | 5.29672 |\n| 8 | 23.585 | 2.52696 |\n| 9 | 24 | 2.10817 |\n| 10 | 24.3219 | 2.00232 |\n| 11 | 25 | 1.95269 |\n| 12 | 26 | 1.91755 |\n| 13 | 27 | 1.89866 |\n| 14 | 28 | 1.82919 |\n| 15 | 29 | 1.80406 |\n| 16 | 30 | 1.80269 |\n+----+---------------+------------+\n```\n\n**BLIS - dscal**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 11.6296 |\n| 1 | 16.6439 | 12.8131 |\n| 2 | 18.2288 | 11.4942 |\n| 3 | 19.2288 | 10.9634 |\n| 4 | 20.2288 | 9.58147 |\n| 5 | 21 | 9.65832 |\n| 6 | 22.3219 | 7.65926 |\n| 7 | 23 | 5.11889 |\n| 8 | 23.585 | 2.54898 |\n| 9 | 24 | 2.10908 |\n| 10 | 24.3219 | 1.99358 |\n| 11 | 25 | 1.94815 |\n| 12 | 26 | 1.92111 |\n| 13 | 27 | 1.89673 |\n| 14 | 28 | 1.82711 |\n| 15 | 29 | 1.77158 |\n| 16 | 30 | 1.76492 |\n+----+---------------+------------+\n```\n\n![meg-7](/images/meg-7.webp)\n\nRelevant code for `scal` can be found here:\n\n1. `sscal`\n\n[kBLIS/src/kblas_sscal.c](https://github.com/akcube/kBLIS/blob/main/src/kblas_sscal.c)\n\n1. `dscal`\n\n[kBLIS/src/kblas_dscal](https://github.com/akcube/kBLIS/blob/main/src/kblas_dscal.c)\n\n### `sdot` \u0026 `ddot`\n\nAll the memory based parameter tuning we did applies to every benchmark, not just `sscal` and `dscal`. What’s now left is simply figuring out the best set of instructions to compute `sdot` or `ddot`. There are a few possible ideas we can consider.\n\n1. Load + FMA: `hadd` is a slow instruction, keeping an accumulator we can do FMA with might be useful.\n2. Load + Regular multiply + add: Not sure if FMA is necessary / worth it. Should experiment with regular multiply + add as well.\n3. Load + DPPS: Could be a possibility to explore, although on zen3 according to Agner Fog’s tables this instruction is slower + limiting as we can only work with 128 bits now. This is especially limiting when working with double precision numbers and hence not scalable enough for consideration.\n\n**Data races**\n\nAnother challenge we come across here is the possibility of dealing with data races when parallelizing this operation over multiple threads. Each thread has to update the sum of the result variable. This is problematic as this write is not atomic and can lead to us having wildly incorrect results. One idea is to simply add a `#pragma reduction (+:result)` condition in our loop. This makes the write atomic. However, this means I have to update the sum each iteration of the loop which might have considerable overhead, especially since the writes _must_ be atomic.\n\nThe alternative solution I have in mind is to allow each thread to privately keep accumulating its copy of the variable during its run. Finally, when the thread is done with its execution we add the value of the **thread local** accumulated variable to **shared** result variable. Doing this with OpenMP is not very straightforward however. This is what we are required to do:\n\n```c\n#pragma omp parallel reduction(+:result) num_threads(...\n{\n\t\t#pragma omp for nowait\n}\n```\n\nWhat this does is tell OpenMP to parallelize code within that block. Now adding a `#pragma omp for` inside this block causes the for loop to tell the parent parallel block that it requires a thread group and then parallelizes it over a set of threads. The private/shared/reduction rules imposed by the parent apply to these worker threads as well. Now, I can create local variables inside this block which the loop sums over. Once this is done, we can use the reduction rule to add these partial accumulated sums to the global result variable. The `nowait` clause simply instructs the OpenMP thread to not wait for the rest of the `for` threads to finish before continuing execution. This is fine as the computation following the for loop is not dependent on anything that is being computed by any of the other threads in the group or vice versa.\n\nImplementing all of this gives us our final performance results:\n\n**KBLAS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 32.4182 |\n| 1 | 16.6439 | 33.4257 |\n| 2 | 18.2288 | 33.7023 |\n| 3 | 19.2288 | 25.3582 |\n| 4 | 20.2288 | 22.8789 |\n| 5 | 21 | 64.2133 |\n| 6 | 22.3219 | 62.8844 |\n| 7 | 23 | 78.8638 |\n| 8 | 23.585 | 53.4039 |\n| 9 | 24 | 49.8571 |\n| 10 | 24.3219 | 33.9986 |\n| 11 | 25 | 16.6836 |\n| 12 | 26 | 13.309 |\n| 13 | 27 | 11.0947 |\n| 14 | 28 | 11.1318 |\n| 15 | 29 | 11.2067 |\n| 16 | 30 | 11.0233 |\n+----+---------------+------------+\n```\n\n**CBLAS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 30.5628 |\n| 1 | 16.6439 | 32.1508 |\n| 2 | 18.2288 | 32.8369 |\n| 3 | 19.2288 | 29.2845 |\n| 4 | 20.2288 | 23.0307 |\n| 5 | 21 | 26.1548 |\n| 6 | 22.3219 | 26.7091 |\n| 7 | 23 | 24.1143 |\n| 8 | 23.585 | 21.4597 |\n| 9 | 24 | 17.921 |\n| 10 | 24.3219 | 15.7512 |\n| 11 | 25 | 12.1024 |\n| 12 | 26 | 9.4867 |\n| 13 | 27 | 8.50288 |\n| 14 | 28 | 8.36098 |\n| 15 | 29 | 9.42331 |\n| 16 | 30 | 8.35616 |\n+----+---------------+------------+\n```\n\n**BLIS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 27.5369 |\n| 1 | 16.6439 | 30.5812 |\n| 2 | 18.2288 | 31.9513 |\n| 3 | 19.2288 | 28.1317 |\n| 4 | 20.2288 | 22.9166 |\n| 5 | 21 | 26.6008 |\n| 6 | 22.3219 | 20.0403 |\n| 7 | 23 | 22.6566 |\n| 8 | 23.585 | 23.7281 |\n| 9 | 24 | 16.7784 |\n| 10 | 24.3219 | 15.6218 |\n| 11 | 25 | 12.6145 |\n| 12 | 26 | 9.31534 |\n| 13 | 27 | 8.63364 |\n| 14 | 28 | 8.49679 |\n| 15 | 29 | 9.32126 |\n| 16 | 30 | 8.37292 |\n+----+---------------+------------+\n```\n\n![meg-8](/images/meg-8.webp)\n\n**No write-backs**\n\nSomething I noticed then is the very high FLOPS/s I got on `dot` compared to `scal`. An obvious reason for this is that in `dot`, I don’t have to perform any write-backs to memory. Does this mean the amount of false sharing I have to compete against reduces by a good amount? I decided to run the benchmark again but this time letting myself access more cores and threads than previously. This turns out to be an _excellent_ idea and I get the following results. Amazing performance boost. With this we are able to clearly beat BLIS and CBLAS in both small and large data sizes and give almost $5\\times$ the performance in the medium ranges $\\approx1-16MB$.\n\n**KBLAS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 30.4949 |\n| 1 | 16.6439 | 28.8954 |\n| 2 | 18.2288 | 54.5931 |\n| 3 | 19.2288 | 74.1311 |\n| 4 | 20.2288 | 89.6297 |\n| 5 | 21 | 80.4699 |\n| 6 | 22.3219 | 68.7903 |\n| 7 | 23 | 68.5272 |\n| 8 | 23.585 | 65.7599 |\n| 9 | 24 | 47.4145 |\n| 10 | 24.3219 | 32.0242 |\n| 11 | 25 | 16.6444 |\n| 12 | 26 | 12.7048 |\n| 13 | 27 | 10.9017 |\n| 14 | 28 | 10.5428 |\n| 15 | 29 | 10.823 |\n| 16 | 30 | 10.5882 |\n+----+---------------+------------+\n```\n\n**CBLAS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 29.5658 |\n| 1 | 16.6439 | 30.9696 |\n| 2 | 18.2288 | 31.3796 |\n| 3 | 19.2288 | 21.7191 |\n| 4 | 20.2288 | 21.2733 |\n| 5 | 21 | 26.1364 |\n| 6 | 22.3219 | 24.5054 |\n| 7 | 23 | 22.3898 |\n| 8 | 23.585 | 21.1398 |\n| 9 | 24 | 17.3707 |\n| 10 | 24.3219 | 15.4571 |\n| 11 | 25 | 12.7241 |\n| 12 | 26 | 9.03112 |\n| 13 | 27 | 8.587 |\n| 14 | 28 | 8.41701 |\n| 15 | 29 | 8.29715 |\n| 16 | 30 | 8.47759 |\n+----+---------------+------------+\n```\n\n**BLIS - sdot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 26.5525 |\n| 1 | 16.6439 | 29.7291 |\n| 2 | 18.2288 | 29.3259 |\n| 3 | 19.2288 | 26.7111 |\n| 4 | 20.2288 | 26.242 |\n| 5 | 21 | 22.6719 |\n| 6 | 22.3219 | 22.0736 |\n| 7 | 23 | 24.3165 |\n| 8 | 23.585 | 23.2249 |\n| 9 | 24 | 16.7992 |\n| 10 | 24.3219 | 15.5576 |\n| 11 | 25 | 12.6779 |\n| 12 | 26 | 9.20955 |\n| 13 | 27 | 8.56137 |\n| 14 | 28 | 8.3971 |\n| 15 | 29 | 8.29895 |\n| 16 | 30 | 8.48299 |\n+----+---------------+------------+\n```\n\nThe implementation for `ddot` is not very different. We just have to modify the horizontal add instructions a little bit. This isn’t really ever going to be a hot-spot in our code so I’ve just gone with a simple readable but not ultra-efficient implementation for both float/double versions.\n\n![meg-9](/images/meg-9.webp)\n\n**KBLAS - ddot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 15.1987 |\n| 1 | 16.6439 | 16.2712 |\n| 2 | 18.2288 | 27.3627 |\n| 3 | 19.2288 | 38.0566 |\n| 4 | 20.2288 | 44.6123 |\n| 5 | 21 | 39.0425 |\n| 6 | 22.3219 | 35.3251 |\n| 7 | 23 | 26.071 |\n| 8 | 23.585 | 26.3589 |\n| 9 | 24 | 21.2313 |\n| 10 | 24.3219 | 15.2867 |\n| 11 | 25 | 8.10269 |\n| 12 | 26 | 6.00609 |\n| 13 | 27 | 4.95448 |\n| 14 | 28 | 5.06161 |\n| 15 | 29 | 5.11223 |\n| 16 | 30 | 5.02176 |\n+----+---------------+------------+\n```\n\n**CBLAS - ddot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 15.0312 |\n| 1 | 16.6439 | 15.5627 |\n| 2 | 18.2288 | 16.5811 |\n| 3 | 19.2288 | 14.0308 |\n| 4 | 20.2288 | 11.8751 |\n| 5 | 21 | 12.0909 |\n| 6 | 22.3219 | 11.5726 |\n| 7 | 23 | 10.9247 |\n| 8 | 23.585 | 8.31322 |\n| 9 | 24 | 7.65811 |\n| 10 | 24.3219 | 7.40659 |\n| 11 | 25 | 5.92358 |\n| 12 | 26 | 4.44579 |\n| 13 | 27 | 4.608 |\n| 14 | 28 | 4.13111 |\n| 15 | 29 | 4.00058 |\n| 16 | 30 | 4.1858 |\n+----+---------------+------------+\n```\n\n**BLIS - ddot**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 13.3349 |\n| 1 | 16.6439 | 15.0041 |\n| 2 | 18.2288 | 15.6681 |\n| 3 | 19.2288 | 13.8185 |\n| 4 | 20.2288 | 13.5891 |\n| 5 | 21 | 11.3213 |\n| 6 | 22.3219 | 11.2515 |\n| 7 | 23 | 10.7465 |\n| 8 | 23.585 | 7.80139 |\n| 9 | 24 | 7.52648 |\n| 10 | 24.3219 | 6.80121 |\n| 11 | 25 | 5.93768 |\n| 12 | 26 | 4.5354 |\n| 13 | 27 | 4.42863 |\n| 14 | 28 | 4.14201 |\n| 15 | 29 | 4.01051 |\n| 16 | 30 | 4.17515 |\n+----+---------------+------------+\n```\n\n![meg-10](/images/meg-10.webp)\n\nRelevant code for `dot` can be found here:\n\n1. `sdot`\n\n[kBLAS/src/kblas_sdot.c](https://github.com/akcube/kBLIS/blob/main/src/kblas_sdot.c)\n\n1. `ddot`\n\n[kBLAS/src/kblas_ddot.c](https://github.com/akcube/kBLIS/blob/main/src/kblas_ddot.c)\n\n### `saxpy` \u0026 `daxpy`\n\nThe `axpy` functions are sort of like a mix of `dot` and `scal`. We’ll need to handle working with 2 vectors like in `dot` and we’ll need to handle write-backs like we did in `scal`. The learnings we had while implementing those two functions alone ought to be enough information to implement this. Also, just like in `dot`, we can use FMA instructions to gain speedup on the FLOPS side as well, although our program will ultimately end up memory bound more than anything.\n\n`FMA` turns out to be really beneficial here and gives us a noticeable increase in performance compared to say something like `scal`. Something else I noticed here is that `BLIS` also performs reasonably well here, much better than `CBLAS` and even close to the performance of `KBLAS` and beating it at times. FMA is helpful enough that switching to 2 threads just when crossing the $L2$ boundary ends up being a deficit in comparison to `scal`. Multiple threads however again ends up being beneficial as we keep increasing data. The final results are here:\n\n**KBLAS - saxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 49.3417 |\n| 1 | 16.6439 | 50.1562 |\n| 2 | 18.2288 | 50.7136 |\n| 3 | 19.2288 | 35.1016 |\n| 4 | 20.2288 | 24.0396 |\n| 5 | 21 | 52.2806 |\n| 6 | 22.3219 | 59.2832 |\n| 7 | 23 | 55.3022 |\n| 8 | 23.585 | 32.9809 |\n| 9 | 24 | 26.1802 |\n| 10 | 24.3219 | 23.2948 |\n| 11 | 25 | 19.4666 |\n| 12 | 26 | 13.5502 |\n| 13 | 27 | 10.8165 |\n| 14 | 28 | 11.3821 |\n| 15 | 29 | 11.4666 |\n| 16 | 30 | 11.0536 |\n+----+---------------+------------+\n```\n\n**CBLAS - saxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 29.8596 |\n| 1 | 16.6439 | 31.7331 |\n| 2 | 18.2288 | 31.0241 |\n| 3 | 19.2288 | 25.6688 |\n| 4 | 20.2288 | 24.2299 |\n| 5 | 21 | 23.6875 |\n| 6 | 22.3219 | 23.3208 |\n| 7 | 23 | 21.5013 |\n| 8 | 23.585 | 17.2897 |\n| 9 | 24 | 8.55303 |\n| 10 | 24.3219 | 7.55571 |\n| 11 | 25 | 7.12983 |\n| 12 | 26 | 6.1104 |\n| 13 | 27 | 5.63097 |\n| 14 | 28 | 5.46632 |\n| 15 | 29 | 5.33872 |\n| 16 | 30 | 5.38933 |\n+----+---------------+------------+\n```\n\n**BLIS - saxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 24.8586 |\n| 1 | 16.6439 | 27.4797 |\n| 2 | 18.2288 | 29.0903 |\n| 3 | 19.2288 | 24.9761 |\n| 4 | 20.2288 | 22.4395 |\n| 5 | 21 | 24.4729 |\n| 6 | 22.3219 | 25.6598 |\n| 7 | 23 | 24.5933 |\n| 8 | 23.585 | 17.1534 |\n| 9 | 24 | 8.51503 |\n| 10 | 24.3219 | 7.55425 |\n| 11 | 25 | 7.1247 |\n| 12 | 26 | 6.13848 |\n| 13 | 27 | 5.59954 |\n| 14 | 28 | 5.47486 |\n| 15 | 29 | 5.33735 |\n| 16 | 30 | 5.13565 |\n+----+---------------+------------+\n```\n\n![meg-11](/images/meg-11.webp)\n\nPretty much same implementation for `daxpy`.\n\n**KBLAS - daxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 15.6674 |\n| 1 | 16.6439 | 16.0682 |\n| 2 | 18.2288 | 15.5728 |\n| 3 | 19.2288 | 12.1665 |\n| 4 | 20.2288 | 11.7703 |\n| 5 | 21 | 25.9659 |\n| 6 | 22.3219 | 27.036 |\n| 7 | 23 | 26.9574 |\n| 8 | 23.585 | 13.9522 |\n| 9 | 24 | 7.29532 |\n| 10 | 24.3219 | 5.08099 |\n| 11 | 25 | 3.43613 |\n| 12 | 26 | 2.95682 |\n| 13 | 27 | 2.65331 |\n| 14 | 28 | 2.80176 |\n| 15 | 29 | 2.8763 |\n| 16 | 30 | 2.81742 |\n+----+---------------+------------+\n```\n\n**CBLAS - daxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 14.7215 |\n| 1 | 16.6439 | 15.8373 |\n| 2 | 18.2288 | 15.248 |\n| 3 | 19.2288 | 12.8118 |\n| 4 | 20.2288 | 11.3691 |\n| 5 | 21 | 11.6618 |\n| 6 | 22.3219 | 11.7689 |\n| 7 | 23 | 10.2316 |\n| 8 | 23.585 | 9.17419 |\n| 9 | 24 | 4.22879 |\n| 10 | 24.3219 | 3.74399 |\n| 11 | 25 | 3.25647 |\n| 12 | 26 | 2.76513 |\n| 13 | 27 | 2.67007 |\n| 14 | 28 | 2.66673 |\n| 15 | 29 | 2.75282 |\n| 16 | 30 | 2.41476 |\n+----+---------------+------------+\n```\n\n**BLIS - daxpy**\n\n```c\n+----+---------------+------------+\n| | Memory used | GFLOPS/s |\n|----+---------------+------------|\n| 0 | 15.6439 | 12.379 |\n| 1 | 16.6439 | 13.378 |\n| 2 | 18.2288 | 14.4791 |\n| 3 | 19.2288 | 11.6106 |\n| 4 | 20.2288 | 12.0553 |\n| 5 | 21 | 10.2339 |\n| 6 | 22.3219 | 11.4045 |\n| 7 | 23 | 9.68807 |\n| 8 | 23.585 | 8.84999 |\n| 9 | 24 | 4.24071 |\n| 10 | 24.3219 | 3.57199 |\n| 11 | 25 | 3.2515 |\n| 12 | 26 | 2.92566 |\n| 13 | 27 | 2.67459 |\n| 14 | 28 | 2.66347 |\n| 15 | 29 | 2.66854 |\n| 16 | 30 | 2.44301 |\n+----+---------------+------------+\n```\n\n![meg-12](/images/meg-12.webp)\n\nWe’re pretty much able to beat BLIS and CBLAS in all of the level 1 functions we implemented. Pretty satisfactory results.\n\n## BLAS Level 2\n\n### `kblas_sgemv`\n\nGeneralized matrix vector multiply. Given a $m\\times n$ matrix $M$ and a $n \\times 1$ vector $V$ it computes the product vector $A = Mv$. The final dimensions of $A$ are $m \\times 1$. There are 2 main cases we need to consider here\n\n1. Let’s say the matrix is stored in ROW order and we are computing $Mv$. The memory is stored contiguously and all our matrix reads will be prefetched by the pre-fetcher. The vector access pattern is also relatively simple to predict. This means we can just do normal reads and do the multiplication like we normally code matrix vector multiplication.\n\n![meg-13](/images/meg-13.webp)\n\n1. Consider the case where we have to compute $M^Tv$ or the matrix $M$ is stored in COL order and we have to compute $Mv$. In both these cases our matrix data access pattern looks like the figure on the left. Massive jumps in memory accesses means pre-fetching these addresses is very difficult. We would have to pre-fetch every single time we iterate over the indices in the column and cannot rely on the fetched cache-line to have more than 1 single element. To counter this we can compute the dot product of $v[0]$ with the first row of the matrix. This gives us a partial summation for $A$. Repeatedly computing this summation again means we have race conditions to take care of but this will at least ensure that our data access pattern is cache friendly.\n\nWhat this means is we want to divide up the matrix vector multiply into 2 cases. 1 where we do normal strided multiplication and one where we account for the large jumps. To account for the leading row where sometimes we’ll have to multiply sub-matrices with vectors we can experiment with adding pre-fetch statements in code to hint the CPU to fetch those lines into cache. This should also hopefully help us reduce the effect of cache misses on our function.\n\n### Case #1\n\nI have quite a few ideas I’d like to try out here. After the end of the operation, our final vector $Y$ has the form\n\n$$ Y[i] = \\beta \\times Y[i] + \\alpha \\times(A[i][0] \\cdot X[0] + A[i][1] \\cdot X[1] + \\dots + A[i][n-1] \\cdot A[n-1]) \\\\ \\implies Y[i] = \\beta \\times Y[i] + \\alpha \\times \\sum_{j=0}^{n-1}A[i][j]\\cdot X[j] $$\n\nOne idea here is to run our optimized version of `scal` first to transform $Y \\to \\beta \\times Y$ and then compute the summation following which we can multiply it by $\\alpha$ before adding it to $\\beta Y[i]$, However, we can’t do this with FMA as we’re only computing one sum at one time. And I highly doubt that it is profitable to store each element in memory before doing the final computation with FMA. This method also requires that I first iterate over all of $Y$ once and then do it again when adding the second term to it. Lack of FMA and this double iteration might prove to be slower. But again, since the computation using $A$ which is a $M \\times N$ matrix is the bottleneck, the double $Y$ access might not be a big problem after all.\n\nAnother idea I have here is to compute $\\alpha \\times \\sum_{j=k}^{k+8}A[i][j]\\cdot X[j]$ first. We can then use FMA to add this sum $s$ to $Y$ like so $Y[i] = \\beta \\times Y[i] + s$. This allows using FMA and we don’t access $Y$ twice, but this has the penalty that I’ll end up accessing up to 8 rows of $A$ at the same time. That’s not a _very_ nice cache access pattern.\n\nI can’t really reason out anything as to which method might be faster, so the only option left is to code both up, benchmark and let the results speak for themselves.\n\nThe first idea when implemented gives the following results\n\n**KBLAS - sgemv**\n\n```c\n+----+----------+----------+\n| | Memory | Flops |\n|----+----------+----------|\n| 0 | 15.6437 | 48.404 |\n| 1 | 16.6438 | 69.8283 |\n| 2 | 18.2287 | 107.688 |\n| 3 | 19.2288 | 119.767 |\n| 4 | 20.2288 | 120.499 |\n| 5 | 20.9997 | 128.732 |\n| 6 | 22.3219 | 91.4613 |\n| 7 | 22.9997 | 118.113 |\n| 8 | 23.5848 | 112.7 |\n| 9 | 24 | 73.694 |\n| 10 | 24.3222 | 63.1474 |\n| 11 | 25.0001 | 46.3859 |\n| 12 | 26 | 40.752 |\n| 13 | 26.9999 | 36.2615 |\n| 14 | 28 | 33.7448 |\n| 15 | 28.9999 | 34.0236 |\n| 16 | 30 | 34.1832 |\n+----+----------+----------+\n```\n\n**CBLAS - sgemv**\n\n```c\n+----+----------+---------+\n| | Memory | Flops |\n|----+----------+---------|\n| 0 | 15.6437 | 44.5799 |\n| 1 | 16.6438 | 46.048 |\n| 2 | 18.2287 | 56.5389 |\n| 3 | 19.2288 | 46.6531 |\n| 4 | 20.2288 | 51.3103 |\n| 5 | 20.9997 | 44.292 |\n| 6 | 22.3219 | 47.1806 |\n| 7 | 22.9997 | 51.8916 |\n| 8 | 23.5848 | 44.6651 |\n| 9 | 24 | 36.2907 |\n| 10 | 24.3222 | 44.9791 |\n| 11 | 25.0001 | 32.7863 |\n| 12 | 26 | 28.3062 |\n| 13 | 26.9999 | 28.809 |\n| 14 | 28 | 26.7651 |\n| 15 | 28.9999 | 27.4012 |\n| 16 | 30 | 27.1336 |\n+----+----------+---------+\n```\n\n**BLIS - sgemv**\n\n```c\n+----+----------+---------+\n| | Memory | Flops |\n|----+----------+---------|\n| 0 | 15.6437 | 36.9656 |\n| 1 | 16.6438 | 41.2038 |\n| 2 | 18.2287 | 54.2682 |\n| 3 | 19.2288 | 48.5784 |\n| 4 | 20.2288 | 50.0798 |\n| 5 | 20.9997 | 48.5103 |\n| 6 | 22.3219 | 50.0211 |\n| 7 | 22.9997 | 47.2412 |\n| 8 | 23.5848 | 45.4841 |\n| 9 | 24 | 39.0531 |\n| 10 | 24.3222 | 39.2671 |\n| 11 | 25.0001 | 34.6988 |\n| 12 | 26 | 29.3634 |\n| 13 | 26.9999 | 28.87 |\n| 14 | 28 | 26.8491 |\n| 15 | 28.9999 | 27.1517 |\n| 16 | 30 | 26.9858 |\n+----+----------+---------+\n```\n\n![meg-14](/images/meg-14.webp)\n\n**KBLAS - dgemv**\n\n```c\n+----+----------+---------+\n| | Memory | Flops |\n|----+----------+---------|\n| 0 | 15.6436 | 27.3551 |\n| 1 | 16.6437 | 40.9623 |\n| 2 | 18.2289 | 62.0474 |\n| 3 | 19.2287 | 67.2187 |\n| 4 | 20.2288 | 66.3189 |\n| 5 | 21 | 65.1162 |\n| 6 | 22.3218 | 52.9969 |\n| 7 | 23 | 58.5811 |\n| 8 | 23.5846 | 52.6212 |\n| 9 | 23.9997 | 29.5139 |\n| 10 | 24.3218 | 23.6324 |\n| 11 | 25 | 23.1014 |\n| 12 | 26.0001 | 17.6225 |\n| 13 | 27 | 16.8314 |\n| 14 | 27.9999 | 16.0933 |\n| 15 | 29 | 15.6468 |\n| 16 | 29.9999 | 16.1884 |\n+----+----------+---------+\n```\n\n**CBLAS - dgemv**\n\n```c\n+----+----------+---------+\n| | Memory | Flops |\n|----+----------+---------|\n| 0 | 15.6436 | 30.6115 |\n| 1 | 16.6437 | 34.3677 |\n| 2 | 18.2289 | 35.0149 |\n| 3 | 19.2287 | 24.8823 |\n| 4 | 20.2288 | 25.6034 |\n| 5 | 21 | 22.1813 |\n| 6 | 22.3218 | 22.2271 |\n| 7 | 23 | 23.7892 |\n| 8 | 23.5846 | 21.9209 |\n| 9 | 23.9997 | 21.3864 |\n| 10 | 24.3218 | 16.6219 |\n| 11 | 25 | 15.3128 |\n| 12 | 26.0001 | 14.4105 |\n| 13 | 27 | 12.0814 |\n| 14 | 27.9999 | 13.09 |\n| 15 | 29 | 11.4815 |\n| 16 | 29.9999 | 12.575 |\n+----+----------+---------+\n```\n\n\n**BLIS - dgemv**\n\n```c\n+----+----------+---------+\n| | Memory | Flops |\n|----+----------+---------|\n| 0 | 15.6436 | 24.0806 |\n| 1 | 16.6437 | 30.0972 |\n| 2 | 18.2289 | 33.0577 |\n| 3 | 19.2287 | 27.7414 |\n| 4 | 20.2288 | 26.6681 |\n| 5 | 21 | 24.6108 |\n| 6 | 22.3218 | 24.2242 |\n| 7 | 23 | 23.7262 |\n| 8 | 23.5846 | 24.394 |\n| 9 | 23.9997 | 18.3102 |\n| 10 | 24.3218 | 16.1597 |\n| 11 | 25 | 16.7286 |\n| 12 | 26.0001 | 13.944 |\n| 13 | 27 | 12.1705 |\n| 14 | 27.9999 | 12.4518 |\n| 15 | 29 | 13.0285 |\n| 16 | 29.9999 | 13.3451 |\n+----+----------+---------+\n```\n\n![meg-15](/images/meg-15.webp)", - "lastmodified": "2026-01-11T05:06:16.672861952+05:30", + "lastmodified": "2026-01-11T10:59:29.141183388+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] }, "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union": { "title": "More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union", - "content": "# Greedy Algorithms\nPicking off from [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding), the _Greedy_ idea is as follows. At every step, our algorithm picks the **locally optimum** choice in the hope that this choice will also be the **global** optimum. The greedy idea is often the easiest to come up with. Picking the local optimum, in some sense, is often a much easier problem to solve than picking the global minimum. Picking the global minimum often requires seeing _ahead_ to figure out if a global optimum can be reached by picking non-locally optimum choices.\n\nThis often requires recursively solving and there are techniques to speed up computation, but just looking at the local options and picking the best option is much easier in general. The implementation is simple as well. We only need to consider the local choices.\n\nThese properties of greedy algorithms make them quite desirable. An easy to implement an algorithm that is _also_ very fast? That's a great algorithm. Except, for one caveat. As with any algorithm, the first most important thing to prove about it is its **correctness**. This is sadly the case with most greedy algorithms, they fail this test. Picking the local optimum is often **not** the right way to proceed in many algorithms. They might give a _desirable_ result, something _close_ to the global optimum. But not the global optimum itself. And often, it will be possible to generate a counter-case where the greedy solution can be forced to produce a very poor result.\n\nIf a shoddy, quick solution that provides a \"good\" result in most cases is the desired result, then the greedy solution is a great choice! In fact, there are many \"hard\" problems today whose global optimums cannot be computed in feasible time, even with the best-known algorithms for them. In such situations, the best we can hope is to produce a _good_ greedy solution that generates a \"good\" result and hopes that it is _close_ to the global optimum.\n\n## Matroid Theory\n\nMatroid theory gives the sufficient condition for greedy strategies to be applicable to a problem. If we can express a problem in the terms described by Matroid theory, then we can be guaranteed that a greedy solution exists for this problem. From here on forth, when we say \"greedy solutions\", we will refer to greedy solutions that always give the global optimum. Also, note that the reverse is not true. Matroid theory is simply a sufficient condition. If a problem does not fit the matroid theory, it does not mean that it cannot have a greedy solution. Dijkstra is one such problem that does not fit the terms described by Matroid theory, yet has a greedy solution.\n\n# The Minimum Spanning Tree (MST) problem\n\nThe MST problem asks the following question, _\"Given some undirected graph G, find a fully connected tree such that it contains every vertex of the graph G and the set of all edges of the tree must be a subset of the edges of G and its total cost is minimized.\"_\n\nMore formally\n\nGiven some undirected graph $G = \\langle V, E\\rangle$ where each edge has some cost $w_e$ associated with it, find a tree $T = \\langle V, E'\\rangle$ where $E' \\subseteq E$ and the total cost of the tree $\\sum^{e \\in E} w_e$ is minimized.\n\nConsider the naïve approach which involves finding **every** possible spanning tree of the graph and finally outputting the one with the least cost. This is not feasible as the number of spanning trees we can generate for some graph $G$ grows exponentially.\n\nThis is where the idea of \"Greedy\" comes in. However, to facilitate proving the correctness of our solution later, let us cover an interesting property about graphs first.\n\n## The Cut Property\n\n### Cut\n\nIn graph theory, we define a cut as a partition that divides a **connected** graph into two disjoint subsets.\n\n![cut-1](/images/cut-1.webp)\n\nNotice that in the above graph, the \"cut\" depicted by the pink line divides our graph into two connected **disjoint** subgraphs. A cut can remove multiple edges, but the end result is two disjoint connected subgraphs.\n\nHere we also define what is known as the **Cut Set**. It is simply the set of all edges in the cut. That is, it is the set of all edges which must be removed to achieve the result of the cut. In the above example, the cut set would be $E_c= \\{e_4\\}$\n\n### The Cut property - Statement\n\nLet's say $X$ is the set of all the edges belonging to the MST of some undirected graph $G$.\n\nNow, pick some subset of nodes $S$ such that none of the edges in $X$ provide a connection between 2 vertices in $S$ and $S^c$. In more intuitive terms, this subset must either all belong to the MST or all not. We can now imagine this as a _cut_ between all the nodes in the MST and all the nodes _not_ included in the MST yet.\n\nThe cut property states that the **minimum weight** edge in the cut set should be included in the minimum spanning tree of the graph $G$. That is, the minimum weight edge that crosses $S$ and $S^c$ must be a part of the MST of the graph.\n\n**Proof:**\n\n- We have some set of edges $X$ which belong to the MST $T$ of our undirected graph $G = \\langle V, E \\rangle$.\n- Let us begin by assuming that we have picked some edge $e$ which is _not_ the minimum weight edge in the cut set\n- If we do so, it will lead to constructing a _different_ MST $T'$ of our graph compared to the MST that we would generate if we included the minimum weight edge $e_{min}$.\n- Now, because $T'$ is an MST, it must be connected and acyclic. Also, by proof of its construction, $e_{min}$ does _not_ belong to $T'$\n- Now, since the graph is a tree, if we _include_ $e_{min}$ to $T'$, notice that there **must** exist some edge(s) in $e' \\in E'$ such that $e_{min}$ forms a cycle with $e'$. [$E'$ is the edge set of $T'$ ]. This must be true as every node is connected, and the graph is acyclic. This implies there is a unique path between any 2 pairs of vertices in the graph. If a new edge is added connecting two nodes, a new path is created between them which creates a cycle.\n- Now, by nature of how $T'$ was constructed, $w_{emin} \\lt w_{e'}$. If this was not true we would have picked $e'$ to be $e_{min}$\n- Next, _remove_ edge $e'$. Notice that removing an edge from a cycle does **not** make the graph acyclic. Further, we added and subtracted one edge each. This implies that the number of edges in the graph is still $|V|-1$. This implies that the graph **must** be acyclic as well. That is, our new graph is a **tree.**\n- The cost of our new tree is $W_{T'} - w_{e'} + w_{emin}$ .\n- $W_{T'} - w_{e'} + w_{emin} \\lt W_{T'}$ as $w_{emin} \\lt w_{e'}$. This implies that $T'$ is not the MST as a better tree can be constructed which includes $e_{min}$.\n\n## Kruskal's Algorithm\n\nKruskal's approach isolates all of the nodes in the original graph, forming a forest of single node trees, and then progressively merges these trees, merging any two of all the trees with some edge of the original graph at each iteration. All edges are sorted by weight before the algorithm is run (in non-decreasing order). The unification procedure then begins: choose all edges from first to last (in sorted order), and if the endpoints of the presently selected edge belong to separate subtrees, these subtrees are merged, and the edge is added to the answer. After iterating through all of the edges, we'll find that all of the vertices belong to the same sub-tree, and we'll have the solution.\n\nFurther, note that there may be multiple possible solutions. Kruskal will simply give us _one_ such solution.\n\n### Proof\n\nMost greedy algorithms often have their proof in induction, as it is a methodical and elegant way to approach the reasoning of picking the local optimum to get the global optimum. Notice that at every step of the algorithm, we pick the local optimum. That is, we pick the lowest weight edge that belongs to the cut set of the MST and the graph. Hence, by the cut property, the edge we pick **must** belong to the MST. Doing so repeatedly allows us to pick all $n-1$ edges for the graph.\n\n#### A _small problem_\n\nNotice that sorting takes $O(nlogn)$ time. But however, _checking_ if a chosen edge belongs to the cut set or not takes $O(n)$ for each edge. This is not ideal and pushes the algorithm to the time complexity of $O(n^2)$. However, it is possible to eliminate this cost by introducing a data structure that can perform a unification operation and parent lookup operation in an amortized constant time complexity. This will bring down the total time complexity of Kruskal's to $O(MlogN)$ where $|E| = M, |V| = N$.\n\n# Disjoint Set Union\n\nThe DSU is a data structure that allows for queries of two types.\n\n1. Merge 2 sets\n2. Query the root element of some set $S$\n\nThe idea is to maintain a structure that maintains the sets as nodes in a tree where the root is the primary identifier of any set and a merging operation is simply the unification of two trees.\n\n![dsu-1](/images/dsu-1.webp)\n\nThe DSU is initially initialized as an array like so `dsu[i]=i`. `dsu[i]` essentially contains the parent element of set $i$. If $dsu[i]=i$, then $i$ is the root node. Following is the code for the DSU:\n\n**Querying for parent:**\n\n```cpp\nint parent(int i){\n if(dsu[i]==i) return i;\n else return parent(dsu[i]);\n}\n```\n\nLooking at just this, it is easy to come up with a case for which this algorithm will take $O(n)$ time. However, by introducing a small factor in the merging step, it is possible to guarantee $O(logn)$ complexity. Here is the code for the unification of two sets in the DSU.\n\n**Query to merge two sets:**\n\n```cpp\nvoid unify(int a, int b){\n a = parent(a);\n b = parent(b);\n if(rank[a] \u003c rank[b])\n swap(a, b);\n dsu[b] = a;\n if(a!=b \u0026\u0026 rank[a] == rank[b])\n rank[a]++;\n}\n```\n\n## What is rank[x]?\n\nWe can think of `rank[x]` as simply a variable that helps us construct _balanced_ tree structures when we perform the merging operation. Notice that the following statements always hold true for `rank[x]`.\n\n1. For all $x$ in our DSU, $rank[x] \\lt rank[parent(x)]$\n2. Let's say some root node in our DSU has rank $k$. This implies that this root node has at least $2^k$ nodes in its subtree. Why? Notice that to make a tree of rank $k$, we need at least two trees of rank $k-1$. `if(a!=b \u0026\u0026 rank[a] == rank[b])` implies this. We can then extend this by induction to prove this.\n3. From statement 2, it is implied that if there are $n$ elements in the DSU, at most $\\frac{n}{2^k}$ nodes can have rank $k$\n\nThis gives us a balanced tree construction in the unification stage that ensures that our $parent(x)$ queries are no more than $log(n)$ per query.\n\nHowever... _can we do better?_\n\nIt turns out that indeed, we can!\n\n## Path compression\n\nLet's consider the following alternative to our initially proposed `parent(x)` function.\n\n```cpp\nint parent(int i){\n if(dsu[i]==i) return i;\n else return dsu[i] = parent(dsu[i]);\n}\n```\n\nNotice that the only line that has changed is the last line. We simply assign $DSU(i)$ to the parent of $DSU(i)$ at every query operation. This has the effect of shortening the path we must traverse on our journey to find the root node from any child.\n\nSay we break the numbers in intervals of $log^*n$. We get the following split.\n\n$$ [1],[2],[3, 4],[5,\\dots, 2^4],[2^4+1,\\dots,2^{16}],[2^{16}+1,\\dots,2^{65536}],\\dots $$\n\nNotice the following\n\n1. If a node $x$ on the path to the root is of the same rank as the parent, say in the interval $[k+1, \\dots, 2^k]$, then the parent can increase its rank a maximum of $2^k$ times. After these many jumps, it is incremented to the next interval.\n2. If a node $x$ on the path to the root has a rank lesser than the rank of the node's parent, then there can be only $log^*n$ nodes of this type.\n\nThis tells us that $2^k\\times|\\text{nodes with rank} \\gt k| \\leq nlog^*n$\n\nCombined with the unification via rank optimization, it is possible to prove that the amortized bound over all operations can be as low as $O(\\alpha(n))$ where $\\alpha(n)$ is the inverse Ackermann function. This can be reasonably approximated to a constant as the inverse Ackermann function is a function that grows _extremely slowly_. In fact, $\\alpha(n) \\lt 4$ for $n \\lt 10^{600}$.\n# Code!\nBelow are links to `C++` implementations of both the fully equipped Disjoint Set Union data structure and Kruskal's.\n[algorithms-notebook/dsu.cpp at main · akcube/algorithms-notebook](https://github.com/akcube/algorithms-notebook/blob/main/code/graph/dsu.cpp)\n[algorithms-notebook/kruskals.cpp at main · akcube/algorithms-notebook](https://github.com/akcube/algorithms-notebook/blob/main/code/graph/kruskals.cpp)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Disjoint Set Union - cp-algorithms](https://cp-algorithms.com/data_structures/disjoint_set_union.html)\n\n---\n\n## Related Reading\n\n- [Set Cover \u0026 Approximation Algorithms](/blog/set-cover-approximation-algorithms)\n- [DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)\n", - "lastmodified": "2026-01-11T05:06:16.840863646+05:30", + "content": "# Greedy Algorithms\nPicking off from [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding), the _Greedy_ idea is as follows. At every step, our algorithm picks the **locally optimum** choice in the hope that this choice will also be the **global** optimum. The greedy idea is often the easiest to come up with. Picking the local optimum, in some sense, is often a much easier problem to solve than picking the global minimum. Picking the global minimum often requires seeing _ahead_ to figure out if a global optimum can be reached by picking non-locally optimum choices.\n\nThis often requires recursively solving and there are techniques to speed up computation, but just looking at the local options and picking the best option is much easier in general. The implementation is simple as well. We only need to consider the local choices.\n\nThese properties of greedy algorithms make them quite desirable. An easy to implement an algorithm that is _also_ very fast? That's a great algorithm. Except, for one caveat. As with any algorithm, the first most important thing to prove about it is its **correctness**. This is sadly the case with most greedy algorithms, they fail this test. Picking the local optimum is often **not** the right way to proceed in many algorithms. They might give a _desirable_ result, something _close_ to the global optimum. But not the global optimum itself. And often, it will be possible to generate a counter-case where the greedy solution can be forced to produce a very poor result.\n\nIf a shoddy, quick solution that provides a \"good\" result in most cases is the desired result, then the greedy solution is a great choice! In fact, there are many \"hard\" problems today whose global optimums cannot be computed in feasible time, even with the best-known algorithms for them. In such situations, the best we can hope is to produce a _good_ greedy solution that generates a \"good\" result and hopes that it is _close_ to the global optimum.\n\n## Matroid Theory\n\nMatroid theory gives the sufficient condition for greedy strategies to be applicable to a problem. If we can express a problem in the terms described by Matroid theory, then we can be guaranteed that a greedy solution exists for this problem. From here on forth, when we say \"greedy solutions\", we will refer to greedy solutions that always give the global optimum. Also, note that the reverse is not true. Matroid theory is simply a sufficient condition. If a problem does not fit the matroid theory, it does not mean that it cannot have a greedy solution. Dijkstra is one such problem that does not fit the terms described by Matroid theory, yet has a greedy solution.\n\n# The Minimum Spanning Tree (MST) problem\n\nThe MST problem asks the following question, _\"Given some undirected graph G, find a fully connected tree such that it contains every vertex of the graph G and the set of all edges of the tree must be a subset of the edges of G and its total cost is minimized.\"_\n\nMore formally\n\nGiven some undirected graph $G = \\langle V, E\\rangle$ where each edge has some cost $w_e$ associated with it, find a tree $T = \\langle V, E'\\rangle$ where $E' \\subseteq E$ and the total cost of the tree $\\sum^{e \\in E} w_e$ is minimized.\n\nConsider the naïve approach which involves finding **every** possible spanning tree of the graph and finally outputting the one with the least cost. This is not feasible as the number of spanning trees we can generate for some graph $G$ grows exponentially.\n\nThis is where the idea of \"Greedy\" comes in. However, to facilitate proving the correctness of our solution later, let us cover an interesting property about graphs first.\n\n## The Cut Property\n\n### Cut\n\nIn graph theory, we define a cut as a partition that divides a **connected** graph into two disjoint subsets.\n\n![cut-1](/images/cut-1.webp)\n\nNotice that in the above graph, the \"cut\" depicted by the pink line divides our graph into two connected **disjoint** subgraphs. A cut can remove multiple edges, but the end result is two disjoint connected subgraphs.\n\nHere we also define what is known as the **Cut Set**. It is simply the set of all edges in the cut. That is, it is the set of all edges which must be removed to achieve the result of the cut. In the above example, the cut set would be $E_c= \\{e_4\\}$\n\n### The Cut property - Statement\n\nLet's say $X$ is the set of all the edges belonging to the MST of some undirected graph $G$.\n\nNow, pick some subset of nodes $S$ such that none of the edges in $X$ provide a connection between 2 vertices in $S$ and $S^c$. In more intuitive terms, this subset must either all belong to the MST or all not. We can now imagine this as a _cut_ between all the nodes in the MST and all the nodes _not_ included in the MST yet.\n\nThe cut property states that the **minimum weight** edge in the cut set should be included in the minimum spanning tree of the graph $G$. That is, the minimum weight edge that crosses $S$ and $S^c$ must be a part of the MST of the graph.\n\n**Proof:**\n\n- We have some set of edges $X$ which belong to the MST $T$ of our undirected graph $G = \\langle V, E \\rangle$.\n- Let us begin by assuming that we have picked some edge $e$ which is _not_ the minimum weight edge in the cut set\n- If we do so, it will lead to constructing a _different_ MST $T'$ of our graph compared to the MST that we would generate if we included the minimum weight edge $e_{min}$.\n- Now, because $T'$ is an MST, it must be connected and acyclic. Also, by proof of its construction, $e_{min}$ does _not_ belong to $T'$\n- Now, since the graph is a tree, if we _include_ $e_{min}$ to $T'$, notice that there **must** exist some edge(s) in $e' \\in E'$ such that $e_{min}$ forms a cycle with $e'$. [$E'$ is the edge set of $T'$ ]. This must be true as every node is connected, and the graph is acyclic. This implies there is a unique path between any 2 pairs of vertices in the graph. If a new edge is added connecting two nodes, a new path is created between them which creates a cycle.\n- Now, by nature of how $T'$ was constructed, $w_{emin} \\lt w_{e'}$. If this was not true we would have picked $e'$ to be $e_{min}$\n- Next, _remove_ edge $e'$. Notice that removing an edge from a cycle does **not** make the graph acyclic. Further, we added and subtracted one edge each. This implies that the number of edges in the graph is still $|V|-1$. This implies that the graph **must** be acyclic as well. That is, our new graph is a **tree.**\n- The cost of our new tree is $W_{T'} - w_{e'} + w_{emin}$ .\n- $W_{T'} - w_{e'} + w_{emin} \\lt W_{T'}$ as $w_{emin} \\lt w_{e'}$. This implies that $T'$ is not the MST as a better tree can be constructed which includes $e_{min}$.\n\n## Kruskal's Algorithm\n\nKruskal's approach isolates all of the nodes in the original graph, forming a forest of single node trees, and then progressively merges these trees, merging any two of all the trees with some edge of the original graph at each iteration. All edges are sorted by weight before the algorithm is run (in non-decreasing order). The unification procedure then begins: choose all edges from first to last (in sorted order), and if the endpoints of the presently selected edge belong to separate subtrees, these subtrees are merged, and the edge is added to the answer. After iterating through all of the edges, we'll find that all of the vertices belong to the same sub-tree, and we'll have the solution.\n\nFurther, note that there may be multiple possible solutions. Kruskal will simply give us _one_ such solution.\n\n### Proof\n\nMost greedy algorithms often have their proof in induction, as it is a methodical and elegant way to approach the reasoning of picking the local optimum to get the global optimum. Notice that at every step of the algorithm, we pick the local optimum. That is, we pick the lowest weight edge that belongs to the cut set of the MST and the graph. Hence, by the cut property, the edge we pick **must** belong to the MST. Doing so repeatedly allows us to pick all $n-1$ edges for the graph.\n\n#### A _small problem_\n\nNotice that sorting takes $O(nlogn)$ time. But however, _checking_ if a chosen edge belongs to the cut set or not takes $O(n)$ for each edge. This is not ideal and pushes the algorithm to the time complexity of $O(n^2)$. However, it is possible to eliminate this cost by introducing a data structure that can perform a unification operation and parent lookup operation in an amortized constant time complexity. This will bring down the total time complexity of Kruskal's to $O(MlogN)$ where $|E| = M, |V| = N$.\n\n# Disjoint Set Union\n\nThe DSU is a data structure that allows for queries of two types.\n\n1. Merge 2 sets\n2. Query the root element of some set $S$\n\nThe idea is to maintain a structure that maintains the sets as nodes in a tree where the root is the primary identifier of any set and a merging operation is simply the unification of two trees.\n\n![dsu-1](/images/dsu-1.webp)\n\nThe DSU is initially initialized as an array like so `dsu[i]=i`. `dsu[i]` essentially contains the parent element of set $i$. If $dsu[i]=i$, then $i$ is the root node. Following is the code for the DSU:\n\n**Querying for parent:**\n\n```cpp\nint parent(int i){\n if(dsu[i]==i) return i;\n else return parent(dsu[i]);\n}\n```\n\nLooking at just this, it is easy to come up with a case for which this algorithm will take $O(n)$ time. However, by introducing a small factor in the merging step, it is possible to guarantee $O(logn)$ complexity. Here is the code for the unification of two sets in the DSU.\n\n**Query to merge two sets:**\n\n```cpp\nvoid unify(int a, int b){\n a = parent(a);\n b = parent(b);\n if(rank[a] \u003c rank[b])\n swap(a, b);\n dsu[b] = a;\n if(a!=b \u0026\u0026 rank[a] == rank[b])\n rank[a]++;\n}\n```\n\n## What is rank[x]?\n\nWe can think of `rank[x]` as simply a variable that helps us construct _balanced_ tree structures when we perform the merging operation. Notice that the following statements always hold true for `rank[x]`.\n\n1. For all $x$ in our DSU, $rank[x] \\lt rank[parent(x)]$\n2. Let's say some root node in our DSU has rank $k$. This implies that this root node has at least $2^k$ nodes in its subtree. Why? Notice that to make a tree of rank $k$, we need at least two trees of rank $k-1$. `if(a!=b \u0026\u0026 rank[a] == rank[b])` implies this. We can then extend this by induction to prove this.\n3. From statement 2, it is implied that if there are $n$ elements in the DSU, at most $\\frac{n}{2^k}$ nodes can have rank $k$\n\nThis gives us a balanced tree construction in the unification stage that ensures that our $parent(x)$ queries are no more than $log(n)$ per query.\n\nHowever... _can we do better?_\n\nIt turns out that indeed, we can!\n\n## Path compression\n\nLet's consider the following alternative to our initially proposed `parent(x)` function.\n\n```cpp\nint parent(int i){\n if(dsu[i]==i) return i;\n else return dsu[i] = parent(dsu[i]);\n}\n```\n\nNotice that the only line that has changed is the last line. We simply assign $DSU(i)$ to the parent of $DSU(i)$ at every query operation. This has the effect of shortening the path we must traverse on our journey to find the root node from any child.\n\nSay we break the numbers in intervals of $log^*n$. We get the following split.\n\n$$ [1],[2],[3, 4],[5,\\dots, 2^4],[2^4+1,\\dots,2^{16}],[2^{16}+1,\\dots,2^{65536}],\\dots $$\n\nNotice the following\n\n1. If a node $x$ on the path to the root is of the same rank as the parent, say in the interval $[k+1, \\dots, 2^k]$, then the parent can increase its rank a maximum of $2^k$ times. After these many jumps, it is incremented to the next interval.\n2. If a node $x$ on the path to the root has a rank lesser than the rank of the node's parent, then there can be only $log^*n$ nodes of this type.\n\nThis tells us that $2^k\\times|\\text{nodes with rank} \\gt k| \\leq nlog^*n$\n\nCombined with the unification via rank optimization, it is possible to prove that the amortized bound over all operations can be as low as $O(\\alpha(n))$ where $\\alpha(n)$ is the inverse Ackermann function. This can be reasonably approximated to a constant as the inverse Ackermann function is a function that grows _extremely slowly_. In fact, $\\alpha(n) \\lt 4$ for $n \\lt 10^{600}$.\n# Code!\nBelow are links to `C++` implementations of both the fully equipped Disjoint Set Union data structure and Kruskal's.\n[algorithms-notebook/dsu.cpp at main · akcube/algorithms-notebook](https://github.com/akcube/algorithms-notebook/blob/main/code/graph/dsu.cpp)\n[algorithms-notebook/kruskals.cpp at main · akcube/algorithms-notebook](https://github.com/akcube/algorithms-notebook/blob/main/code/graph/kruskals.cpp)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Disjoint Set Union - cp-algorithms](https://cp-algorithms.com/data_structures/disjoint_set_union.html)", + "lastmodified": "2026-01-11T10:59:50.08806525+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -289,7 +289,7 @@ "/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey": { "title": "Multi-Agent Systems: Harnessing Collective Intelligence - A Survey", "content": "\nIn my previous post ([Reasoning, Acting, and Learning ; A Survey of Single-Agent LLM Patterns](/blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns)), I explored strategies to improve the performance of a single agent, using structures like Tree-of-Thoughts to explore complex solution spaces. However, most of these patterns are already internally implemented by frontier labs and there is not as significant a gain to expect from implementing extra compute intensive patterns on top manually. \n\nThis brings us to **Multi-Agent Systems (MAS)**. Instead of a single monolithic agent, we can employ multiple specialized agents collaborating or debating or orchestrating to improve accuracy. Gemini tends to be good at certain tasks, Anthropic in others, etc. We can effectively utilize these ideas in practice using multi-agent architectures. Karpathy's [LLM Council](https://x.com/karpathy/status/1992381094667411768?s=20) is a great example! For more long horizon orchestration, there are two popular architectures which are proposed today, both by Anthropic \u0026 OpenAI. \n# Architectures (Manager vs. Network)\n[Multi-agent Systems \\- LangChain](https://langchain-ai.github.io/langgraph/concepts/multi_agent/) | [A practical guide to building agents \\- OpenA](https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf) | [Building Effective Agents \\- Anthropic](https://www.anthropic.com/engineering/building-effective-agents)\n## Building Multi-Agent Systems\n\n![pasted-image-20251204063547](/images/pasted-image-20251204063547.webp)\n\n\u003e “*Regardless of the orchestration pattern, the same principles apply: keep components flexible, composable, and driven by clear, well-structured prompts.”*\n\nWe primarily care about two architectures here. The “Manager / Supervisor” model and the “Network” model. Hierarchical / Custom come under them for the most part.\n### Manager\nThe supervisor architecture employs a central agent, to manage and direct the workflow of other specialized agents. For example, the supervisor agent receives the initial incident report and then delegates specific diagnostic tasks, such as log analysis, metric monitoring, and configuration checking, to the appropriate specialized agents. You can also involve hierarchy here which organizes agents into a tree-like structure, with supervisor agents at different levels of the hierarchy overseeing groups of subordinate agents.\n\n**Example:** A Head Doctor agent gets an alert and asks the Metric Agent for CPU data and the Log Agent for errors; they report back *only* to the Head Doctor.\n### Network\nIn a network architecture, multiple agents interact with each other as peers who, within the system, can communicate directly with every other agent. This many-to-many communication pattern is well-suited for problems where a clear hierarchy of agents or a specific sequence of agent calls is not predefined. \n\n**Example:** The Database Metric Agent detects high disk latency and *directly* notifies/triggers the Cloud Metric Agent to check underlying disk health.\n# Multi-Agent Debate (MAD) Strategies\nApart from the above defined ‘workflow’ patterns, there has also been a significant amount of exploratory research in structured interactions between different agents (as equals, as debaters in front of a ‘judge’, etc.) for improving reasoning. \n## MAD (Persona / Tit-for-Tat) \u0026 Degeneration-of-Thought\n[Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [Tsinghua, Shanghai Jiao Tong, Tencent AI Lab]](https://aclanthology.org/2024.emnlp-main.992.pdf)\n\nOne new concept this paper introduces is that of **Degeneration-of-Thought (DoT).** It’s the idea that self-reflection mechanisms in LLMs often fail because once an LLM-based agent has established confidence in its answers, it is unable to generate novel thoughts later through self-reflection even if the initial stance is correct. To address this problem, they propose a **Multi-Agent Debate (MAD)** framework specifically designed to encourage divergent thinking, contradiction and debate. The core workflow is as follows:\n### Debate Setup\nThe proposed environment has two debaters, one playing the affirmative role (a ‘devil’ proposing an initial, likely intuitive but flawed solution) and the other plays the negative role (an ‘angel’ who disagrees and corrects the initial solution). These agents are prompted using a *tit-for-tat* meta prompt like so:\n\n\u003e *You are a debater. Hello and welcome to the debate competition. It’s not necessary to fully agree with each other’s perspectives, as our objective is to find the correct answer. The debate topic is stated as follows: .*\n\nNote that this is a key ‘hyperparameter’ for this persona-based MAD framework. The more the prompt encourages contradiction, the more it influences outcomes. For example, in benchmarks meant to challenge counter-intuitive thinking, this technique works better. But in other ‘simpler’ benchmarks, this technique actually hurts performance. \n### Judge\nA third agent acts as a judge or moderator. It monitors the debate and has two modes:\n\n1. **Discriminative***:* Decides if a satisfactory solution has been reached after a round, allowing for an early break to end the debate early. \n2. **Extractive:** If the debate reaches a limit without a clear resolution, the judge extracts the final answer based on the history.\n\nHere’s an example of a judge prompt:\n\n\u003e*You are a moderator. There will be two debaters involved in a debate competition. They will present their answers and discuss their perspectives on them. At the end of each round, you will evaluate both sides’ answers and decide which one is correct.*\t\n## MAD (Society of Minds)\n[Improving Factuality and Reasoning in Language Models through Multiagent Debate](https://arxiv.org/pdf/2305.14325)\n\nThis paper introduces another MAD strategy, inspired by the book Society of Mind, by [Marvin Minsky](https://en.wikipedia.org/wiki/Marvin_Minsky) (Turing Awardee, Co-founder MIT AI Lab). In this framework, they set up multiple instances of the same (or different) LLMs to act as agents engaging in a collaborative debate to refine the answer to a proposed problem over multiple rounds of debate. The process is as follows:\n### Debate\n1. Each debater agent composes an initial response to the proposed question. \n2. The agents are then shown the responses of the other agents and prompted with something like:\n\n \u003e “*These are the solutions to the problem from other agents: [other answers] Using the opinion of other agents as additional advice, can you give an updated response . . .”*\n\n Or\n\n \u003e*\" These are the solutions to the problem from other agents: [other answers] Based off the opinion of other agents, can you give an updated response . . .\"*\n\n Note that in the first one, the LLM knows its own response and hence is more likely to be stubborn about its own response. In experiments, they found the first version led to longer debates and better answers. \n3. The agents converge on a single, agreed-upon answer or they hit a limit on the number of iterations and we pick the majority consensus answer. \n\nThe idea of ‘debates’ between individual agents is orthogonal to other work on improving individual agent performance. So we can still use ideas like few-shot learning, CoT / ToT / GoT, Medprompt, Reflexion, etc. to improve single-agent performance and stack this society-of-minds model on top to improve performance. They also showed that the debate doesn't just amplify an initially correct answer present among the agents. The paper shows cases where all agents initially provide incorrect answers but arrive at the correct solution through the process of mutual critique and refinement during the debate.\n\nHowever, as we’ll see in the paper comparing MAD strategies, medprompt actually beats Society-of-minds more often than not. However, I believe this similar framework can boost information sharing between individual expert agents in a framework where individual expert agents are trying to correlate information across domains and diagnose incidents. \n## MAD Is Not Always Better: Medprompt\nWhile MAD has shown promising results, it’s not always better and performance can vary significantly depending on ‘hyperparameter’ (prompt) tuning and choice of dataset. \n### MEDPROMPT: Generalist Foundational Models Outperforming Special-Purpose Tuning via Prompt Engineering\n[Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine [Microsoft]](https://arxiv.org/pdf/2311.16452)\n\nThis paper attempts to prove that the ‘old’ notion of requiring domain-specific fine-tuning to achieve SOTA benchmarks is no longer necessary, and that newer models (like the then SOTA GPT-4) can match or surpass SOTA benchmark performance purely through sophisticated prompt engineering and **in-context learning (ICL)** techniques. They then designed a prompt-engineered setup on top of GPT-4, which achieved SOTA results on several medical benchmarks, **Medprompt**. There’s three core components to it:\n\n1. **Self-Generated Chain of Thought (CoT):** Use a simple prompt template to get the GPT-4 models to generate CoT examples for future few-shot example ICL training. Here’s an example prompt:\n \n ![pasted-image-20251204064029](/images/pasted-image-20251204064029.webp)\n\n2. **Dynamic Few-Shot Selection:** During test-time, query the vector database by generating the same embedding for the unseen test questions, use k-means or a similar search model to identify the most similar few-shot CoT examples to provide as ICL examples for the model. This is essentially dynamically generating the few-shot examples for the model’s prompt. This entire process is almost completely automated. \n\n3. **Choice Shuffling Ensemble:** They noticed the models tend to have some bias towards picking options in certain positions. So they used a classic CoT-SC type ensemble approach by asking the model to repeat the CoT prediction process *m* times (with temperature \\\u003e 0\\) and choose the final answer by scoring the aggregate. Additionally, shuffle the order of the options for each run to further improve the randomness (apart from just temperature \\\u003e 0).\n \n ![pasted-image-20251204064106](/images/pasted-image-20251204064106.webp)\n\n\tThe results were pretty convincing\n\t\n\t![pasted-image-20251204064130](/images/pasted-image-20251204064130.webp)\n\n# Comparing MAD Strategies\n[Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs](https://arxiv.org/pdf/2311.17371)\n\nThis paper benchmarks several Multi-Agent Debate (MAD) strategies like Society of Minds, Multi-Persona, etc. against other single-agent prompting techniques like self-consistency, ensemble refinement, and Medprompt across a bunch of Q\\\u0026A datasets (medical and reasoning). The main results from the paper are that:\n## MAD isn’t always better\n\n![pasted-image-20251204064236](/images/pasted-image-20251204064236.webp)\n\nNote the top-right corner X in the first diagram. Multi-persona was actually able to score the highest on that particular benchmark simply by tuning the degree to which the angel was asked to disagree with the devil. In short, MAD protocols seem a lot more sensitive to their ‘hyperparameters’ as compared to single-agent strategies. There is also some bias of LLMs towards their own responses in a mult-model multi-agent setup. \n## Tuning Can Greatly Affect Results\nThey introduced a concept of *agreement modulation*. They allowed ‘tuning’ the verbal prompt given to the multi-persona model by introducing a percentage into the prompt. Example: \n\n\u003e *\"You should agree with the other agents 90% of the time.\"* \n\nThis actually made Multi-Persona go from being the lowest scoring strategy to the highest scoring strategy in that benchmark. But it also negatively affected its performance in a different benchmark. In benchmarks like CIAR, which was meant to be more counter-intuitive, higher percentages helped. And in more ‘straightforward’ benchmarks, lower numbers helped. The tuning factor was benchmark dependent. \n\nThe ‘winner’ is likely benchmark and use-case dependent. No single strategy dominated all the benchmarks. It’s important to have our own benchmark for our case and be able to experiment with it.\n\n---\nFinally, implementing complex LLM prompting / interaction patterns can get relatively complicated if done wrong. Classic software programming wasn't quite built for handling probabilistic outputs and large dumps of prompt text in code. In this context, I find [DSPy](https://docs.databricks.com/aws/en/generative-ai/dspy/) to be one of the best practical ways to implement agentic code. With DSPy you can almost quite go back to a *\"Program, don't prompt.\"* mode of operation. If you found this interesting, check out my blog on [Building a Type-Safe Tool Framework for LLMs in Scala](/blog/building-a-type-safe-tool-framework-for-llms-in-scala) for details on implementing such patterns yourself from scratch! ", - "lastmodified": "2026-01-11T05:06:17.817873501+05:30", + "lastmodified": "2026-01-11T10:59:30.119963496+05:30", "tags": [ "domain-cs-ai-ml-llms" ] @@ -297,7 +297,7 @@ "/blog/network-flow-algorithms-ford-fulkerson": { "title": "Network-Flow Algorithms, Ford Fulkerson", "content": "Let's learn another really cool tool that can be used to solve optimization problems, network flows!\n\n# What is the network flow graph?\n\nA network flow graph $G = \\langle V, E \\rangle$ is nothing but a directed graph, with 2 distinctive features.\n\n1. It has 2 distinct vertices $S$ and $T$ marked. $S$ is the **source** vertex and $T$ is the **sink** vertex. These vertices are distinct.\n2. Every edge $e \\in E$ has some capacity $c_i$ associated with it. It is implicitly assumed that $\\forall e \\in E, c_i = 0$.\n\nAn example of one such graph is given below\n\n![nf-1](/images/nf-1.webp)\n\nHere, $S = 1$ and $T = 6$. We will use this same example when discussing further ideas.\n\n## The problem\n\nThe problem that network flow attempts to solve is pretty simple. It asks the questions, _\"Given an infinite amount of \"flow\" at source $S$, what is the maximum amount of \"flow\" you can push through the network at any point of time and reach sink $T$?\"_\n\nAn intuitive way to think about this is to pretend that the source $S$ is an infinite source of water and that the capacities on each edge are sort of like the maximum amount of water that can flow through each of the \"pipes\" or edges. If we think of edges in terms of pipes, the question basically asks how much water we can push through the pipes so that the maximum amount of water reaches sink $T$ per unit time.\n\nWhy is this helpful? Think about traffic scheduling, for example, we could replace water with traffic and the problem would be similar to scheduling traffic through a busy set of streets. Replace it with goods flowing in a warehouse system and we begin to see how powerful this model of the optimization problem is.\n\nTo define this more formally, the only primary constraints are as follows:\n\n1. The flow through any edge **must** be $\\leq$ the capacity of that edge.\n2. The flow entering and leaving any given vertex (except $S$ or $T$) must be the same. (Pretty similar to Kirchhoff's current laws.)\n\nHere is an example of a valid network flow assignment:\n\n![nf-2](/images/nf-2.webp)\n\nWe can manually go over every vertex and ensure that the two constraints are obeyed everywhere. Further, notice that the flow of this network $= 3$. (Just sum up the flow going to $T$, i.e., the edges incident on $T$)\n\nAn interesting observation is that we appear to have \"cyclic flow\" within our graph with this particular assignment of flow. Eliminating this does not change the total flow going to $T$, so this is pretty much the same assignment without that cyclic flow within the network:\n\n![nf-3](/images/nf-3.webp)\n\nBut what about the max flow assignment for this network? is 3 the maximum flow we can achieve? Or can we do better? After a bit of fiddling around, we can notice that we can do better by pushing more flow on the bottom half of this network instead of sending 1 flow up to the top from node 3. Fixing this ends up giving this network:\n\n![nf-4](/images/nf-4.webp)\n\nIt can be proven that we cannot do better than this for this particular network. The max flow of this network is 4.\n\nHopefully, the above examples have managed to convey the true difficulty that flow algorithms face. Solving network flow is **not** easy, primarily because from any given state, the optimal state might not be reached by just monotonically increasing flow through edges. We might have to reduce the flow through some edges to increase the flow in others. Changing the flow amount through any one edge ends up affecting the entire network. So we need to find ways to iteratively increase the flow in our network, **BUT** it is not a monotonic increase. So we must sometimes backtrack and reduce flow in some edges. But perhaps by focusing on monotonically increasing the _max flow_ of our network, we might be able to figure out a proper algorithm that incorporates this backtracking. This is the primary goal we keep in mind when trying to solve max flow.\n\n## Defining the problem formally\n\n### Some useful notation\n\nFor the remainder of this article, we will use \"implicit summation\" notation. All sets will be named by capital letters, and whenever we use sets in the place of elements like for example $f(s, V)$, this means the summation of flow $\\sum_{v\\in V} f(s, v)$. We use this notation to simplify the math we will be writing.\n\n### Formal definition\n\n**Flow:** We define the _flow_ of a network $G$ as a function $f:V \\times V \\to R$ satisfying the following 3 constraints,\n\n1. $\\forall u,v \\in V, f(u, v) \\leq c(u,v)$. That is, flow through any edge must be less than the capacity of that edge.\n2. $\\forall u\\in V - \\{ s, t \\}, f(u, V) \\implies \\sum_{v\\in V}f(u,v) = 0$. That is, flow entering and exiting every node except source and sink is 0. It is conserved.\n3. $\\forall u, v \\in V, f(u,v) = -f(u,v)$. This is not the flow between two vertices, given any two vertices on the network $u$ and $v$, the flow going from one vertex u to v should be the negation of the flow from v to u. This property is called _skew-symmetry._\n\n### Defining flow\n\nLet us denote the value of the flow through a network by $|f|$. Then we define this quantity as\n\n$$ |f| = f(s, V) $$\n\nIntuitively, this is essentially all the flow (sum) that is going from the source node to every other vertex on the graph. It is important to note that the summation is not of all positive terms, if there is flow going from some vertex $v$ to $s$, then this term would be negative (skew symmetry).\n\nUsing this, it is possible to prove that $|f| = f(s, V) = f(V, t)$. That is, it is all the flow going to vertex $t$ and is more \"intuitive\" to understand as the definition of flow. But before we can prove this, let's go over some key properties of flow-networks which we can derive from the constraints.\n\n**Properties:**\n\n1. $f(X, X) = 0 , X \\subset V$, this is derivable from skew-symmetry.\n2. $f(X,Y) = -f(Y,X), X,Y \\subset V$. Direct consequence of skew symmetry.\n3. $f(X \\cup Y, Z) = f(X,Z)+f(Y,Z) \\text{ if } X\\cup Y = \\phi$. If the intersection of $X$ and $Y$is null, then we can safely add the two flows separately as there is no risk of double counting.\n\nNow that we know these properties, let's prove it!\n\n$$ \n|f| = f(s, V) \\\\\n$$\n\nLet's start from the definition of our flow amount $|f|$. Using property 3, we can transform it to mean\n\n$$\n\\begin{aligned}\nf(V, V) = f(s \\cup (V-s), V) = f(s, V) + f(V-s, V) \\\\ \\implies |f| = f(s,V) = f(V,V)-f(V-s, V) \\\\ \\implies |f| = 0 - f(V-s, V) \\\\\n\\end{aligned}\n$$\n\nThis is intuitively just saying that flow from $s \\to V$ is the negative of the flow from other vertices to all vertices. This is because flow within non-source-sink vertices is 0 and they must all flow out the sink. Now, notice that we want to try to prove that $|f| =f(V,t)$. To do this, we will attempt to isolate $t$ from the above equation using the 3rd property again.\n\n$$ \n\\begin{aligned}\nf(V-s, V) = f(t \\cup (V-s-t), V) = f(t, V) + f(V-s-t, V) \\\\ \\implies |f| = -f(t, V) - f(V-s-t, V) \\\\ \\implies |f| = f(V,t) + 0 \\\\ \\implies |f| = (V,t) \n\\end{aligned}\n$$\n\nThe tricky part here is understanding why $f(V-s-t,V) = 0$. This is because of flow conservation. Flipping it around, we get $f(V, V-s-t)$. By the 2nd constraint imposed on our flow network, this quantity is constrained to be 0 always. Hence we have now proved that\n\n$$ |f| = f(s, V) = f(V,t) $$\n\n# Ford-Fulkerson\n\n## Residual networks\n\nWe denote the residual network of a flow network $G$ by $G_R(V_R, E_R)$.\n\nThe only constraints on the edges are that all the edges have strictly positive residual capacities. 0 means the edge is deleted. And, if $(u,v) \\notin E, c(v,u) = 0, f(v,u) = -f(v, u)$\n\nEssentially, $\\forall e\\in E_r, c_{Re} = c_e-f_e$. The residual edges represent edges that \"could\" admit more flow if required. Here $c_e$ is the capacity of the edge in the original flow graph and $f_e$ is the flow passing through the edge in the original flow network.\n\nThe idea behind these edges becomes more apparent when we actually construct the network.\n\nConsider the old suboptimal max flow network we had.\n\n![nf-5](/images/nf-5.webp)\n\nWe'll begin by constructing the residual graph for this network. Remember, for each edge in the network, we add an edge with capacity $c_e - f_e$ as long as this quantity is $\\gt 0$. And now, to respect the last constraint, we must ensure that we add a back-edge in the opposite direction with value = $f_e$ as long $f_e \\gt 0$. This is the **key** idea behind what the residual network hopes to accomplish. Recall back when said one of the reasons the flow problem was very difficult was because it is very difficult to account for having to _reduce_ flow in some edges to increase max flow? This residual network is what helps the algorithm get around this problem. Here is the residual network:\n\n![nf-6](/images/nf-6.webp)\n\nNow, the Ford Fulkerson algorithm becomes extremely simple. It simply says, use any graph traversal algorithm such as BFS or DFS to find _an augmenting path_ in this graph, and apply it to the original graph.\n\nWe formally define an augmenting path as a path from $s_R$ to $t_R$ in the residual graph. Recall that every edge in the residual graph **must** be a positive value. If such a path is found, then it **must** be possible to increment the value of max flow in the network by **at least** 1. This is because the residual graph is essentially an entire encoding of every possible increase/decrease in flow that we can perform on the original graph. The presence of a path with all edges $\\gt 0$ implies I can increase flow from $s_R$ to $t_R$ by at least 1.\n\nIf this is understood, the Ford Fulkerson algorithm becomes pretty simple.\n\n### Pseudocode\n\n1. Construct the residual graph for some given flow network $G$\n2. While we can find an augmenting path in the residual graph:\n 1. Get the `min` of the edges that constitute this path and increment the flow in the original graph by this value along the edges in the residual graph. If it is a direct edge, increment by `min`. If it is a back-edge, decrease flow by `min`.\n 2. Reconstruct residual graph.\n 3. Repeat. If no more augmenting paths are found, we have achieved max flow.\n\n## Proof\n\nWhy does this algorithm work optimally all the time? To prove the correctness of this algorithm, we will first prove the correctness of the Max-flow, Min-cut theorem.\n\n## Max-Flow, Min-Cut\n\nThe theorem states that the following statements are equivalent.\n\n1. $|f| = c(S, T)$ for some cut $(S, T)$.\n2. $f$ is the maximum flow.\n3. $f$ admits no augmenting paths.\n\nWe will prove this theorem by proving $1 \\implies 2 \\implies 3 \\implies 1$.\n\n### Proving $1 \\implies 2$\n\nWe know that $|f| \\leq c(s, t)$ for any cut $(s, t)$. Hence, if $|f| = c(s, t)$ then $f$ must be the maximum flow through this network.\n\n### Proving $2 \\implies3$\n\nWe can prove this by contradiction. Assume that there existed some augmenting path. Then this would imply that we could increase the max flow by some amount, hence contradicting the fact that $f$ is the maximum flow. Hence $f$ cannot admit any augmenting paths.\n\n### Proving $3 \\implies 1$\n\nLet us assume that $|f|$ admits no augmenting paths. That means, we have no path from $s$ to $t$ in $G_R$. We now define two sets $S = \\{v\\in V : \\text{ there exists a path in } G_R \\text{ from } s \\to v\\}$. The other set is defined as $T = V-S$. Trivially, $s \\in S$ and $t \\in T$, as I cannot reach $t$ from $s$. Therefore, these two sets form a cut $(S, T)$.\n\nNow, we pick two vertices $u \\in S$ and $v \\in T$. Now, by definition, there is a path from $s$ to $u$. But no path from $u \\to v$. Otherwise $v \\in S$, which is false.\n\nNow, $c_R(u,v)$ **must** be zero. $c_R(u,v)$ is by definition always positive. Now, if $c_R(u, v) \\gt 0$ it would imply that $v \\in T$. This is a contradiction. Therefore, $c_R(u,v) = 0$.\n\nNow, we know that $c_R(u, v) = c(u,v) -f(u,v) \\implies f(u,v) = c(u,v)$\n\nFor our arbitrary choices of $u \\in S$ and $v \\in T$, we arrive at the conclusion that $f(S, T) = C(S, T)$.\n\nSince $1 \\implies 2 \\implies 3 \\implies 1$, the Min-Cut Max Flow theorem is true.\n\n## Proving Ford-Fulkerson\n\nNow, the Ford Fulkerson algorithm terminates when there are no longer any more augmenting paths in $G_R$. According to the Maxflow MinCut theorem, this is equivalent to our network reaching maximum flow. Hence we have proved the correctness of our algorithm.\n\n## Complexity\n\nIt is easy to see that for integral capacities and flow constraints, finding an augmenting path implies increasing the value of maximum flow by **at least** one. This means that the algorithm will at least increment flow in network by 1 per iteration. Hence it will terminate and we can bound the complexity to $O((V+E)U)$ where $V+E$ is the complexity of the BFS and $U$ is max flow.\n\nFor non-integer capacities, the complexity is unbounded.\n\nThis... isn't great. Because our complexity depends on the maxflow of the graph. If we construct a graph such that at each iteration, we have worst case and the algorithm increases flow in the network by only one unit and the capacity on the edges is large, we might end up doing millions of iterations for a small graph.\n\n## Edmond-Karp \n\nEdmond and Karp were the first to put a polynomial bound on this algorithm. They noticed that BFS implementations of Ford Fulkerson's outperformed DFS versions a lot. Upon analyzing these implementations, they were able to put a polynomial bound on the problem. Their were able to reduce it the following bound: $O(VE^2)$. **The coolest part about this is that this is true even for _irrational_ capacities!**\n\nThe intuition is, that every time we find an augmenting path one of the edges becomes saturated, and the distance from the edge to $s$ will be longer, if it appears later again in an augmenting path. And the length of a simple paths is bounded by $V$.\n\n## Dinics\n\nDinic's algorithm solves the maximum flow problem in $O(V^2E)$.\n\n## More recent research\n\nThe asymptotically fastest algorithm found in 2011 runs in $O(VElog_{\\frac{E}{VlogV}V})$ time.\n\nAnd more recently, Orlins algorithm solves the problem in $O(VE)$ for $E \\leq O(V^{\\frac{16}{15}-\\epsilon})$ while KRT (King, Rao and Tarjan)'s does it in $O(VE)$ for $E \\gt V^{1+\\epsilon}$\n\nThere's a lot of research going on in this field and we know no proven lower bound for this algorithm. Who knows, we might be able to get even faster! Techniques like push-relabel, with a greedy optimization have managed to get a lower bound of $O(V^3)$. This modification was proposed by Cheriyan and Maheshwari in 1989.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. [Incremental Improvement: Max Flow, Min Cut - MIT OCW 6.046J - Srinivas Devadas](https://youtu.be/VYZGlgzr_As?si=iPvwN-x_4ac5yQHG)\n", - "lastmodified": "2026-01-11T05:06:18.321878587+05:30", + "lastmodified": "2026-01-11T10:59:30.62797044+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -305,7 +305,7 @@ "/blog/new-and-delete": { "title": "`New` and `Delete`", "content": "# `new` \u0026 `delete`\n\nBefore we go ahead and figure out what smart pointers are, let's take a moment to look back to how we handle heap allocated memory in C++. Back in C, we had the `malloc` and `free` functions to handle heap memory. `new` and `delete` are C++ operators that try to do the same task, but cleaner. \n\n```cpp\nint *b = (int*) malloc(sizeof(int)); // Old C-style heap allocation\nint *c = new int; // New C++ heap allocation\n```\n\n`malloc` returns `void*`. You will notice that back in C, we did not have to explicitly cast a `void*` to `int*`, however in C++, implicit pointer type conversion is a compile error. `int *b = malloc(sizeof(int))` will throw:\n\u003e `error: invalid conversion from ‘void*’ to ‘int*’ [-fpermissive]`\n\nIn C++, `new` and `delete` are **operators.** They are **not** functions like their C counterparts. This means that just like any other operator, they can be overloaded to do pretty much anything. This means that the behaviour of `new` \u0026 `delete` are dependent on the C++ library and compiler that you are using. \n\nHowever, most implementations just implement calling `new` to call the underlying `malloc` function. And similarly `free` for `delete`. But one key additional task that `new` does is it **will** also **call the constructor** for the object it is creating. And `delete` the destructor.\n\nWe'll use the following `Entity` class as a toy-example when playing around with `new` / `delete`.\n\n```cpp\nclass Entity{\npublic:\n\tEntity(){\n \t\tstd::cout \u003c\u003c \"Constructor!\" \u003c\u003c std::endl; }\n\tEntity(const std::string \u0026name) : name(name) {\n\t\tstd::cout \u003c\u003c \"P-Constructor!\" \u003c\u003c std::endl; }\n\t~Entity() {\n \t\tstd::cout \u003c\u003c \"Destructor!\" \u003c\u003c std::endl; }\nprivate:\n\tstd::string name;\n};\n```\n\nHere are three ways to use `new` in C++.\n\n```cpp\nint main(void){\n\tEntity *obj = new Entity; // Output: Constructor!\n\tEntity *same_thing = new Entity(); // Output: Constructor!\n\tEntity *pobj = new Entity(\"abcd\"); // Output: P-Constructor!, name: abcd\n\tEntity *obj_arr = new Entity[5];\n\t/**\n\t * Output:\n\t * Constructor!\n\t * Constructor!\n\t * Constructor!\n\t * Constructor!\n\t * Constructor!\n\t */\n\t// This is called \"placement new\"\n\tstd::cout \u003c\u003c sizeof(Entity) \u003c\u003c std::endl; // Output: 32\n\tint *space = new int[10];\n\tEntity *placement_new = new(space) Entity[2];\n\t/*\n\t * Output:\n\t * Constructor!\n\t * Constructor!\n\t */\n}\n```\n \nSo the first 4 examples are the basic ones. You'll notice that `new` always makes it a point to call the constructor of the class we're allocating memory for. This is an attempt to work around the uninitialized memory problem we have with `malloc`. Links back to [RAII - Resource Acquisition Is Initialization](/blog/raii-resource-acquisition-is-initialization) principles as well. We **don't** want uninitialized memory. We can use `new` to also initialize an object with it's parameterized constructor instead of the default one. However we can't parameter initialize an array of them :) \n\nGetting a pointer to an array of `Entity` objects is also quite simple. `obj_arr` is a pointer to a contiguous chunk of memory that points to an array of 5 `Entity` objects.\n\nThe interesting `new` use-case here is the \"placement new\". Here, `new` isn't actually allocating a block of memory. It simply uses the previously allocated memory for `space` and just initializes `Entity` in that memory by calling it's constructor. \n\nFor `delete`, it's pretty similar. \n\n```cpp\ndelete obj; // Output: Destructor!\ndelete[] obj_arr;\n/**\n * Output:\n * Destructor!\n * Destructor!\n * Destructor!\n * Destructor!\n * Destructor!\n*/\n// Note! It's also possible to compile\ndelete obj_arr; // Output: Destructor!\n```\n\nYou'll notice the last way to call `delete` actually just calls the destructor once. So when de-allocating a pointer to an array of elements in memory it's important to always remember to use `delete[]` instead of `delete` to properly clean this memory. \n\n### Why `new` \u0026 `delete`?\n\nOne, it's a lot cleaner than the C-style way. Two, it is a paradigm that avoids the uninitialized memory issue we can have when using the C-style `malloc` and `free` functions. `new` and `delete` prevent this from ever happening by **always** calling the constructor and destructor. \n\nHowever, a problem they still don't solve is the problem of memory leaks and dangling pointers. To solve this, we have the idea of [Smart Pointers](/blog/smart-pointers).\n", - "lastmodified": "2026-01-11T05:06:36.79406577+05:30", + "lastmodified": "2026-01-11T10:59:48.871233035+05:30", "tags": [ "domain-cs-languages-cpp" ] @@ -313,7 +313,7 @@ "/blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning": { "title": "OtterTune - Automatic Database Management System Tuning Through Large-Scale Machine Learning", "content": "# Abstract\nTuning a MySQL node (well) can be a challenging task for most DBAs, primarily because the variables that control the performance of the node are all inter-dependent on each other. A simple version of this problem might be dealing with the branch predictor in a processor. For strided accesses branch prediction might do a great job, but it might be counterproductive in a workload that has completely random accesses as it will fetch several useless lines into cache causing some thrashing. Add dependent variables like fixing a good \"cache size\" for a workload and we're immediately forced to try every combination of values to pick the \"optimal\" values for a workload since we cannot solve them independently.\n\nIn one of his talks, Andy mentions how whenever a developer working on MySQL / Postgres implements something new where they have to decide on a \"good\" size for a buffer or similar, they just expose it as a variable that can be configured. This is a problem not just restricted to database systems, you'll find similar themes in any program that has to crunch through a lot of data. For example, genomic pipelines which have to crunch through gigabytes of data using heuristic algorithms (which have extreme random-access patterns) also provide a bunch of variables \u0026 caching mechanisms you can tune to improve performance. I believe the automation and ideas described in this paper can be used to tune a lot of things beyond DBMSs. \n\n\u003e **Abstract**\n\u003e *Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration “knobs” that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized (i.e., two DBMSs use a different name for the same knob), not independent (i.e., changing one knob can impact others), and not universal (i.e., what works for one application may be sub-optimal for another). Worse, information about the effects of the knobs typically comes only from (expensive) experience.* \n\u003e \n\u003e *To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune and tested it on three DBMSs. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert.*\n\n# The Problem\n## Big Data Era\nProcessing \u0026 analyzing large amounts of data is crucial in the \"big data\" era we live in now. We measure the performance of these data processing systems in metrics like throughput \u0026 latency. Both of these quantities can be significantly impacted by the parameters a DBMS is configured with for a given workload and cluster spec. \n## Too Much To Tune\nModern database management systems are notorious for having a bazillion parameters that can be \"tuned\" to better fit the user's runtime environment and workload. \n\n![pasted-image-20240906170105](/images/pasted-image-20240906170105.webp)\n- [Why Machine Learning for Automatically Optimizing Databases Doesn't Work by Andy Pavlo - JOTB23](https://www.youtube.com/@Jonthebeach)\n\nIn the past 20 years alone, MySQL has grown from having some 30 knobs or so to 700+ now. That's simply far too many parameters for any single human or even group of humans to optimize. You might be able to classify some parameters as useless (name of output file, port, etc.) but a lot of other parameters may be inter-related to each other and affect performance significantly. The optimal configuration cannot be reached by distributing and solving independently. And solving it individually is beyond what humans can reason about.\n## Previous Attempt(s) Shortcomings\nThe authors claim that most of the previous attempts either suffered from vendor lock-in or they required integrating several manual steps in the process. They were more so configured to \"assist\" DBAs than automate the tuning process. \n\n\u003e All of these tools also examine each DBMS deployment independently and thus are unable to apply knowledge gained from previous tuning efforts. This is inefficient because each tuning effort can take a long time and use a lot of resources.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n## Expensive Humans, Cheap Machines\nThis is the classic problem of some designs being optimized for cost-savings in the old days when human labour was cheap and computers were extremely expensive to acquire. Nowadays, especially with the advent of the Cloud, getting access to expensive hardware has become much cheaper. In contrast, the supply of DBAs who are capable of making any decent progress in tuning the complex DBMSs we have today have dwindled, consequently also making them extremely expensive labour for companies. \n\n![pasted-image-20240906194040](/images/pasted-image-20240906194040.webp)\n- [U.S. BUREAU OF LABOR STATISTICS](https://www.bls.gov/oes/current/oes151242.htm)\n## Tuning Is A Difficult Problem\n\n### Dependencies\n#### Standard ML Problem?\nKnobs cannot be tuned independent of each other. A subsets of knobs may end up changing the effect of a different subset. This was the figure they obtained just plotting Log file size vs Buffer pool size. In reality we're trying to find a global optimum for an $n$-dimensional function (where $n$ is the number of knobs). This alone would be a classic machine learning problem.\n\n![pasted-image-20240906202812](/images/pasted-image-20240906202812.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n#### Not Really\nHowever, it's not that simple. For one, this $n$ dimensional function is fixed only for a very specific workload and for a specific system configuration that it runs on. If the workload or configuration changes, the function changes. This makes using \"past\" data for optimization very difficult. Further, we also cannot afford to run the workload with varying configurations many times since the costs will easily shoot through the roof. The function is also not perfectly constant between multiple replays of the same workload but should be close enough.\n#### Non-Reusability\nContinuing the previous section, non-reusability of data is by far the hardest problem here. If we could just use the data we have across hundreds of databases, we can just collect a lot of data and then easily optimize things. But even for a fixed instance \u0026 database combination, a change in the workload can drastically change the function we are trying to optimize. This means for each database, instance \u0026 workload combination, we have to run an expensive data collection \u0026 tuning process.\n# OtterTune \n\n![pasted-image-20240907115234](/images/pasted-image-20240907115234.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nIn short, you can divide the architecture into two halves. The client-side controller interacting with the runtime database and the server-side tuning manager which handles data collection, and the recommender systems. The server-side tuning manager has a repository of data from previous tuning-sessions as well. The goal is to make the database **self-driving**, as Andy Pavlo puts it. You want a database that is capable of automatically tuning itself when required. \n## Assumptions \u0026 Limitations\nBefore we discuss the details of the architecture, let's quickly go over the assumptions made by the paper and it's limitations.\n### Not \"Completely\" Self-Driving\nThere are certain real-world limitations that make this solution much more difficult to deploy in real world environments. For example, some of the \"knobs\" that the databases expose for tuning require the database to be restarted to actually have an effect. However, in the real-world, restarting prod databases especially are a major no-go. \nFurther, due to specifications of how a company is deploying a database \u0026 the side-car's attached to it, you might need a DBA to place limits on what knobs can be tuned and what knobs cannot be tuned. As such, OtterTune maintains a hardcoded blacklist of knobs that require restarts and also allows DBA's to pass in extra knobs that they want to blacklist from tuning to guard against these issues.\n### Availability of a Near-Perfect Testing Environment\nAs we've highlighted already, the function we are trying to optimize depends heavily on the workload and instance configuration. To optimize a database in production, we assume that we have a copy of such a database that has near-identical configuration \u0026 load. In practice, this might be very difficult to acquire.\n### Database Design Must be Reasonable\nIn short,\n\n\u003e Lastly, we also assume that the physical design of the database is reasonable. That means that the DBA has already installed the proper indexes, materialized views, and other database elements. There has been a considerable amount of research into automatic database design ([Self-Tuning Database Systems: A Decade of Progress](https://15799.courses.cs.cmu.edu/spring2022/papers/01-background/p3-chaudhuri.pdf)) that the DBA can utilize for this purpose. As discussed in Appendix C, we plan to investigate how to apply these same techniques to tune the database’s physical design.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\n## The Architecture\n\n![pasted-image-20240907141744](/images/pasted-image-20240907141744.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nOtterTune works in three broad steps.\n1. Workload Characterization\n2. Knob Identification\n3. Automatic Tuning by Applying Knob configs\n### Workload Characterization\nPerhaps the biggest problem that this paper, and OtterTune solve, is that of workload characterization. The ability to map previously unseen database workloads to known workloads. Running an optimization algorithm like gradient descent by collecting data many times by running a test workload on a test instance is not practically feasible. The costs would far outweigh the benefits. But by solving the problem of workload characterization, they unlock the ability to use a wealth of previously collected data for ML purposes. \n\nBut this is a difficult problem. When we say workload, we mean a collection of the following items at minimum:\n- The instance configuration (This is easy)\n- The RDBMS software \u0026 version (This is easy)\n- A replay of the SQL queries the DBMS received (Difficult)\n\nYou can imagine that in an environment that supports both sharding \u0026 replication, this \"workload\" characteristic might have even more features to consider. How do we solve this complicated problem of taking two different workloads and getting an algorithm to judge their \"similarity\"? There were two main approaches considered.\n#### Differing Approaches to Capture Similarity\n##### Logical Analysis\nYou can attempt to analyze the workload at the logical level. \n\n\u003e This means examining the queries and the database schema to compute metrics, such as the number of tables/columns accessed per query and the read/write ratio of transactions. These metrics could be further refined using the DBMS’s “what-if” optimizer API to estimate additional runtime information ([AutoAdmin “What-if Index Analysis Utility](https://dl.acm.org/doi/pdf/10.1145/276304.276337)), like which indexes are accessed the most often.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nHowever, going back to the original problem, we want to be able to analyze how workload performance can be optimized by changing knobs. Given two differing workloads, changing knobs in each workload does not affect it's logical definition. DBMSs execute queries by pushing the query through a query optimizer which may generate differing query plans with the slightest of changes to a query. The information we need cannot be captured by just examining the logical data.\n##### Runtime Counter Analysis\nThe authors claim that a better approach is to use the DBMS's internal runtime metrics to characterize how a workload behaves. \n\n\u003eAll modern DBMSs expose a large amount of information about the system. For example, MySQL’s InnoDB engine provides statistics on the number of pages read/written, query cache utilization, and locking overhead. OtterTune characterizes a workload using the runtime statistics recorded while executing it. These metrics provide a more accurate representation of a workload because they capture more aspects of its runtime behavior. Another advantage of them is that they are directly affected by the knobs’ settings. For example, if the knob that controls the amount of memory that the DBMS allocates to its buffer pool is too low, then these metrics would indicate an increase in the number of buffer pool cache misses.\n\u003e- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nThere may be a little bit of manual effort into re-labelling the information presented by different DBMS software into the same label but it is doable. In practice, the main difference between the data collected by differing DBMS software lie in the granularity at which they capture and record this data. Some DBMS might capture it at the the level of individual components (per each buffer pool, for example) while others might only capture this information at the physical database level. However, in all instances, it is possible to aggregate this information at the physical database level for all modern DBMS software.\n##### Personal Take: Why Not a Bit of Both?\nWhile runtime metrics are very useful to judge the difference adjusting a knob makes, I believe it might be very easy for a knob change to drastically affect the query plan generated by the query optimizer for differing workloads. In these scenarios, a different query plan might suddenly make two previously similar workloads behave quite differently. True, in this scenario the runtime metrics should behave differently too but it is possibly something that can be explored. \n#### Statistics Collection\nUnder the assumption that it is possible to get a near-identical copy of workload to work with, OtterTune starts tuning sessions by wiping the runtime counters clean, executing the workload and then grabbing the metrics right after the workload terminates. OtterTune grabs aggregated values to make this system work across multiple databases. But this also implies that it can only tune global to database knobs, instead of having more fine-grained control to specifics like each buffer pool's size, etc. \n#### Pruning Redundant Metrics\nThere may be a fair chunk of not-so-useful metrics in this proposed runtime statistics collection strategy. One reason is that certain knobs might have high correlation. The other is differing granularities for the same metric being reported.\n\n\u003e For example, MySQL reports the amount of data read in terms of bytes and pages.\n\u003e ...\n\u003e For example, we found from our experiments that the Postgres metric for the number of tuples updated5 moves almost in unison with the metric that measures the number of blocks read from the buffer for indexes.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nThis is essentially just the classic dimensionality reduction problem. They use one such technique, called *factor analysis* [scikit-learn Documentation – Factor Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FactorAnalysis.html) and *k-means* [scikit-learn Documentation – KMeans](http://scikit-learn. org/stable/modules/generated/sklearn.cluster.KMeans.html). This helps greatly reduce the search space for the ML algorithm \u0026 to also eliminate \"noise\" in the data. If you want a quick primer on dimensionality reduction, the paper does a great job here:\n\n\u003e Given a set of real-valued variables that contain arbitrary correlations, FA reduces these variables to a smaller set of factors that capture the correlation patterns of the original variables. Each factor is a linear combination of the original variables; the factor coefficients are similar to and can be interpreted in the same way as the coefficients in a linear regression. Furthermore, each factor has unit variance and is uncorrelated with all other factors. This means that one can order the factors by how much of the variability in the original data they explain. **We found that only the initial factors are significant for our DBMS metric data, which means that most of the variability is captured by the first few factors.**\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nThis is the detailed description from the paper:\n\n\u003eThe FA algorithm takes as input a matrix $X$ whose rows correspond to metrics and whose columns correspond to knob configurations that we have tried. The entry $X_{ij}$ is the value of metric $i$ on configuration $j$. FA gives us a smaller matrix $U$: the rows of $U$ correspond to metrics, while the columns correspond to factors, and the entry $U_{ij}$ is the coefficient of metric $i$ in factor $j$. We can scatter-plot the metrics using elements of the $i^{th}$ row of $U$ as coordinates for metric $i$. Metrics $i$ and $j$ will be close together if they have similar coefficients in $U$ — that is, if they tend to correlate strongly in $X$. Removing redundant metrics now means removing metrics that are too close to one another in our scatter-plot. \n\u003e\n\u003eWe then cluster the metrics via $k$-means, using each metric’s row of $U$ as its coordinates. We keep a single metric for each cluster, namely, the one closest to the cluster center. One of the drawbacks of using $k$-means is that it requires the optimal number of clusters ($K$) as its input. We use a simple heuristic [40] to fully automate this selection process and approximate $K$. Although this approach is not guaranteed to find the optimal solution, it does not require a human to manually interpret a graphical representation of the problem to determine the optimal number of clusters. We compared this heuristic with other techniques [55, 48] for choosing $K$ and found that they select values that differ by one to two clusters at most from our approximations. Such variations made little difference in the quality of configurations that OtterTune generated in our experimental evaluation in Sect. 7.\n\u003e- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n##### Factor Analysis\nLet me explain that with a simple example. Let's say we fix the workload and measure metrics $m_1, m_2, \\cdots, m_n$ for configurations $c_1, c_2, \\cdots, c_m$. Note that each configuration is in itself a vector of key value pairs for each individual knob that you can tune, but at the moment, let's just give these vectors an ID and call each unique configuration ID a column. Our table looks like:\n$$\n\\begin{array}{c|cccc} \u0026 c_1 \u0026 c_2 \u0026 \\cdots \u0026 c_m \\\\ \\hline m_1 \u0026 x_{11} \u0026 x_{12} \u0026 \\cdots \u0026 x_{1m} \\\\ m_2 \u0026 x_{21} \u0026 x_{22} \u0026 \\cdots \u0026 x_{2m} \\\\ \\vdots \u0026 \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ m_n \u0026 x_{n1} \u0026 x_{n2} \u0026 \\cdots \u0026 x_{nm} \\end{array} \n$$\nFactor analysis recognizes that we have an input of $m$ columns and tries to find a set of \"underlying\" factors with smaller cardinality that can represent these $m$ columns as linear combinations of the factors it identifies $\\pm \\text{unique variance}$. In short, let's suppose we get $s \\lt m$ factors $f_1, f_2, \\cdots, f_s$. This essentially means that we can represent each of our original factors as some $c_i = \\lambda_{i1}\\eta_1 + \\lambda_{i2}\\eta_2 + \\cdots + \\lambda_i{s} + \\epsilon_i$ where $\\epsilon_i$ is the \"unique\" variance factor of metric $i$. \n\n\u003eThe factor coefficients are similar to and can be interpreted in the same way as the coefficients in a linear regression. Furthermore, each factor has unit variance and is uncorrelated with all other factors. This means that one can order the factors by how much of the variability in the original data they explain. We found that only the initial factors are significant for our DBMS metric data, which means that most of the variability is captured by the first few factors.\n\u003e- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\n##### $k$-Means Clustering\nNow, let's suppose that the size of the factor set generated was exactly $s = 2$. If we plotted the two factors against each other now, each factor would be a data point in 2d space that we could run $k$-means on to identify *similar* metrics. With this, we can significantly reduce the number of metrics we want to use in our final optimization ML algorithm to reduce search space. \n\n![pasted-image-20240907164426](/images/pasted-image-20240907164426.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nAn interesting consequence of this clustering is that a lot of the useless metrics like output file name, etc. get mapped to the same cluster since their values don't really depend on the configuration in any way whatsoever. The paper suggest providing some hints to the model to discard clusters containing parameters that we know for sure are useless. \n\n\u003eFrom the original set of 131 metrics for MySQL and 57 metrics for Postgres, we are able to reduce the number of metrics by 93% and 82%, respectively. Note that OtterTune still collects and stores data for all of the DBMS’s metrics in its repository even if they are marked as redundant. The set of metrics that remain after pruning the FA reduction is only considered for the additional ML components that we discuss in the next sections.\n\u003e- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n### Identifying Important Knobs\nAs mentioned before, this is a huge number of configurable variables or \"knobs\" that most modern RDBMS software expose for tuning, most of which are useless. So on the note of reducing search space, the next step we want to carry out is reducing the number of knobs we want to consider for tuning. The authors use a linear regression technique called *Lasso* regression ([Lasso / Ridge / Elastic Net Regression](/blog/lasso-ridge-elastic-net-regression)) to find the knobs that have the highest correlation to the system's overall performance. \n\n\u003eOtterTune’s tuning manager performs these computations continuously in the background as new data arrives from different tuning sessions. In our experiments, each invocation of Lasso takes ∼20 min and consumes ∼10 GB of memory for a repository comprised of 100k trials with millions of data points. The dependencies and correlations that we discover are then used in OtterTune’s recommendation algorithms, presented in Sect. 6.\n\u003e- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\n#### Lasso Regression\nLasso regression is used to perform regression but with slightly increased bias in exchange for reduction in variance. This technique is particularly useful when the variables we use for tuning might contain several useless variables which do not contribute any variance to the parameter we are predicting by tuning their weights to 0. \n\nThe paper in particular, uses a version of lasso called the *Lasso path algorithm* as described in [The Elements of Statistical Learning](https://hastie.su.domains/Papers/ESLII.pdf). \n\n\u003e The algorithm starts with a high penalty setting where all weights are zero and thus no features are selected in the regression model. It then decreases the penalty in small increments, recomputes the regression, and **tracks what features are added back to the model at each step**. OtterTune uses the order in which the knobs first appear in the regression to determine how much of an impact they have on the target metric (e.g., the first knob selected is the most important).\n\u003e \n\u003e Before OtterTune computes this model, it executes two preprocessing steps to normalize the knobs data. This is necessary because Lasso provides higher quality results when the features are (1) continuous, (2) have approximately the same order of magnitude, and (3) have similar variances. It first transforms all of the categorical features to “dummy” variables that take on the values of zero or one. Specifically, each categorical feature with n possible values is converted into n binary features. Although this encoding method increases the number of features, all of the DBMSs that we examined have a small enough number of categorical features that the performance degradation was not noticeable. Next, OtterTune scales the data. We found that standardizing the data (i.e., subtracting the mean and dividing by the standard deviation) provides adequate results and is easy to execute. We evaluated more complicated approaches, such as computing deciles, but they produced nearly identical results as the standardized form\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nTo capture relationships *between* variables, such as dependencies between memory allocations (since maximizing every individual knob that allocated memory would lead to trashing from the system running out of memory), we introduce polynomial terms in the equation. The output of the lasso path algorithm is a list of all knobs, sorted in the amount of impact they have on a target metric. Now we need to figure out how many terms of the equation we want to keep and how many to discard. We can obviously consider binary searching the \"optimal\" point, but the paper suggests using an incremental approach where OtterTune dynamically increases the number of knobs used in a tuning session over time. They claim this to be effective in other optimization algorithms as well and this is likely fine since the number of knobs we will get an optimal prefix with is likely pretty small. \n### Automated Tuning\nNow that we've gotten a lot of the \"pre-processing\", if you want to call it that, parts out of the way, we can move on to the fun stuff. When tuning a new workload, OtterTune divides the tasks into two broad sections.\n1. Workload Mapping $\\to$ Tries to map the new workload to some existing workload data in the database.\n2. Configuration Recommendation $\\to$ Tries to use the data points from the \"similar\" workload identified in the previous steps and then adds new data points from the new workload by trading off exploration vs exploitation using Gaussian Regression. (Which I will hopefully write a blog about in the future. I can understand the overarching ideas, but the math is black magic to me at the moment :)\n#### Workload Mapping\nThe first step that OtterTune does when given a new optimizing task, is to run a few replays of the workload on a few different previously seen **configurations** and records the metrics of the run. Once this is done, it tries to match the previously seen **workload** that has the most similar metric readings for the configurations tested. As we run more test runs on different configurations, the quality of the match made by OtterTune increases, which is what we'd expect to see. The below describes this matching process in more detail:\n\n\u003e For each DBMS version, we build a set $S$ of $N$ matrices — one for every non-redundant metric — from the data in our repository. Similar to the Lasso and FA models, these matrices are constructed by background processes running on OtterTune’s tuning manager (see Sect. 3). The matrices in $S$ (i.e., $X_0, X_1, \\cdots. X_{N−1}$) have identical row and column labels. Each row in matrix $X_m$ corresponds to a workload in our repository and each column corresponds to a DBMS configuration from the set of all unique DBMS configurations that have been used to run any of the workloads. The entry $X_{m,i,j}$ is the value of metric $m$ observed when executing workload $i$ with configuration $j$. If we have multiple observations from running workload $i$ with configuration $j$, then entry $X_{m,i,j}$ is the median of all observed values of metric $m$. \n\u003e \n\u003e The workload mapping computations are straightforward. OtterTune calculates the Euclidean distance between the vector of measurements for the target workload and the corresponding vector for each workload $i$ in the matrix $X_m$ (i.e., $X_{m,i,:}$). It then repeats this computation for each metric $m$. In the final step, OtterTune computes a “score” for each workload $i$ by taking the average of these distances over all metrics $m$. The algorithm then chooses the workload with the lowest score as the one that is most similar to the target workload for that observation period. \n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nThat was the crux of it. Essentially compute the similarity score between our new workload and any of the previously seen workloads by computing euclidean distance between performance metrics for both the workloads on the same configuration. To make this work correctly you'll need to also do some normalization / binning which is described below. \n\n\u003e Before computing the score, it is critical that all metrics are of the same order of magnitude. Otherwise, the resulting score would be unfair since any metrics much larger in scale would dominate the average distance calculation. OtterTune ensures that all metrics are the same order of magnitude by computing the deciles for each metric and then binning the values based on which decile they fall into. We then replace every entry in the matrix with its corresponding bin number. With this extra step, we can calculate an accurate and consistent score for each of the workloads in OtterTune’s repository.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n#### Configuration Recommendation\nOtterTune uses a regression technique called [Gaussian Process Regression](/blog/gaussian-process-regression) which they claim to be the state-of-the-art technique with power approximately equal to that of deep networks. I know too little about ML \u0026 the math behind how Gaussian Process Regression works is still black magic for me, so I'll try to give the intuitive understanding that I got from reading [A Visual Exploration of Gaussian Processes - Distill](https://distill.pub/2019/visual-exploration-gaussian-processes/).\n##### Gaussian Processes (A *very* high level overview)\n\n**Regression** is just a term we give to techniques used to find a function that represents the 'best-fit' for a set of data points as closely as possible. We usually define some notion of cost to minimize and regression minimizes this cost, giving us our 'best-fit' function. \n\n**Gaussian Processes** are a probabilistic method that gives a confidence interval for the predicted function. The mean of the distribution would be the \"best-fit\" line, but the variance would help us gauge our confidence in a given prediction.\n\nThere are infinitely many functions that can fit your data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions. \n\n![pasted-image-20240909143736](/images/pasted-image-20240909143736.webp)\n*The distribution without any points to fit on*\n\n![pasted-image-20240909143836](/images/pasted-image-20240909143836.webp)\n*The distribution given two data points*\n\n![pasted-image-20240909143909](/images/pasted-image-20240909143909.webp)\n*The distribution once all known points are given*\n\nYou'll also notice that at locations where multiple data points are present, the distribution is very narrow, signifying high confidence. On the other hand, near the horizontal ends you'll see the distribution widen, signifying lower confidence, which is what we want to see.\n##### Searching with GPs\nThe paper also claim that GPs are able to provide a theoretically justified way to trade off exploration (acquiring new knowledge) and exploitation (making decisions based on existing knowledge). \n\nRemember that OtterTune has mapped the new workload to some existing previously seen workloads which has already recorded it's execution on multiple different configurations. OtterTune starts from these values and then adds the new data points collected by actually running the new workload. \n\nSince the mapped workload might not exactly match the unknown workload, the model’s predictions are not fully trusted. To address this, we increase the noise parameter variance for all untried points in the GP model by adding a ridge term to the covariance. We also add a smaller ridge term for each configuration selected. The claim is that this approach helps in handling variability in virtualized environments where DBMS metrics like throughput and latency can vary between observations.\n\nIn the context of our proposed solution:\n\n**Exploration** $\\to$ Search an unknown region in its GP $\\to$ Run the workload on far-away from GP distribution configuration, possibly adjusting several config parameters at the same time. (Remember, this GP is $m$-d space, where $m$ is the number of knobs to tune.) This can be particularly useful when OtterTune is trying to change knob values where the upper or lower limit for the knob's best value might depend on the underlying hardware. (Ex: Total memory available.)\n\n**Exploitation** $\\to$ Select a configuration that is near the best configuration in its GP $\\to$ Run the workload on a configuration that is somewhere close to where the GP has high confidence, to confirm \u0026 slightly improve performance with fine-tuned adjustment. \n\n\u003e Which of these two strategies OtterTune chooses when selecting the next configuration depends on the variance of the data points in its GP model. It always chooses the configuration with the greatest expected improvement. The intuition behind this approach is that each time OtterTune tries a configuration, it “trusts” the result from that configuration and similar configurations more, and the variance for those data points in its GP decreases. The expected improvement is near-zero at sampled points and increases in between them (although possibly by a small amount). Thus, it will always try a configuration that it believes is optimal or one that it knows little about. Over time, the expected improvement in the GP model’s predictions drops as the number of unknown regions decreases. This means that it will explore the area around good configurations in its solution space to optimize them even further.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n##### Gradient Descent\nThe output of the GP is the function, in this case a $m$ dimensional surface, that it believes models the function for our new workload the best. Once we have this surface, we have finally reduced our problem to the standard ML problem we described all those ages back :)\n\n\u003e There are two types of configurations in the initialization set: the first are the top-performing configurations that have been completed in the current tuning session, and the second are configurations for which the value of each knob is chosen at random from within the range of valid values for that knob. Specifically, the ratio of top-performing configurations to random configurations is 1-to-10. During each iteration of gradient descent, the optimizer takes a “step” in the direction of the local optimum until it converges or has reached the limit on the maximum number of steps it can take. OtterTune selects from the set of optimized configurations the one that maximizes the potential improvement to run next. This search process is quick; in our experiments OtterTune’s tuning manager takes 10– 20 sec to complete its gradient descent search per observation period. Longer searches did not yield better results. \n\u003e \n\u003e Similar to the other regression-based models that we use in OtterTune (see Sects. 5.1 and 6.1), we employ preprocessing to ensure that features are continuous and of approximately the same scale and range. We encode categorical features with dummy variables and standardize all data before passing it as input to the GP model. \n\u003e \n\u003e Once OtterTune selects the next configuration, it returns this along with the expected improvement from running this configuration to the client. The DBA can use the expected improvement calculation to decide whether they are satisfied with the best configuration that OtterTune has generated thus far.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n## Results\n### OtterTune vs iTuned\nRefer to the paper to obtain a more comprehensive overview of the test-suite and results. In short, the major contribution OtterTune brought is the ability to re-use previously seen workloads to tune new unseen workloads. And as such, they compare the performance of OtterTune with, [iTuned](https://users.cs.duke.edu/~shivnath/papers/ituned.pdf) a similar automatic DBMS tuning software that uses GPs. However, instead of starting from previously seen data, iTuned uses a stochastic sampling technique called Latin Hypercube Sampling to generate an initial set of 10 DBMS configurations that are executed at the start of the tuning session. Optimizing for the 99th percentile latency metric, they obtain the following results:\n\n![pasted-image-20240909153031](/images/pasted-image-20240909153031.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n\nIn general, we observe that OtterTune is able to converge to it's optimal configuration much faster than iTuned. It also outperforms it by a significant margin on OLTP workloads. In contrast, you'll notice that the gap is nowhere near as pronounced in OLAP workloads. It even seems to lose to iTuned on the OLAP workload. The authors claim that this difference is mainly due to the fact that the OLAB database, Vector exposes much less values for tuning and is less permissive on what values are allowed to be set too. This makes tuning Vector a much simpler problem than MySQL or Postgress, limiting the room for improvement. \n### OtterTune vs DBAs \u0026 Other Usual Tuning Strategies\n\n![pasted-image-20240909154811](/images/pasted-image-20240909154811.webp)\n- [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n# Conclusion\n\u003e We presented a technique for tuning DBMS knob configurations by reusing training data gathered from previous tuning sessions. Our approach uses a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map previously unseen database workloads to known workloads, and (3) recommend knob settings. Our results show that OtterTune produces configurations that achieve up to 94% lower latency compared to their default settings or configurations generated by other tuning advisors. We also show that OtterTune generates configurations in under 60 min that are comparable to ones created by human experts.\n\u003e - [Automatic Database Management System Tuning Through Large-Scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf)\n", - "lastmodified": "2026-01-11T05:06:22.051004749+05:30", + "lastmodified": "2026-01-11T10:59:34.40836999+05:30", "tags": [ "domain-cs-ai-ml-machine-learning", "domain-cs-systems-databases" @@ -322,7 +322,7 @@ "/blog/p-vs-np": { "title": "P vs NP", "content": "This is one of the most famous unsolved questions in computer science. I mean, seriously, the clay math institute offers a reward of a **million** dollars to the first person that is able to solve this problem. [https://www.claymath.org/millennium-problems/p-vs-np-problem](https://www.claymath.org/millennium-problems/p-vs-np-problem)\n\nWhy? What's so special about this problem and what even _is_ the problem?\n\nLet's begin by defining the problem. The problem asks, is $P = NP$? That is, is the set of _all_ the problems in $NP$ the same as the set of all the problems in $P$? A more intuitive way to phrase this question would be asking, \"Are all problems that can be _verified_ in polynomial time, also be _solved_ in polynomial time?\"\n\nBut why is this one of the most famous unsolved problems in computer science? What are the implications of such a result? Why is this even a question, do we even have _any_ reason to believe that $P$ _might_ equal $NP$?\n\nHere are a few, _interesting_ answers to these questions.\n\n1. If $P$ _did_ equal $NP$, it would mean that simply being able to _check_ if a solution is correct, would be **no harder** than solving the problem itself. Optimization problems like transport routing, production of goods, circuit design, etc. are **all** $NP$ problems. We would be able to get optimal answers to these solutions _much_ faster than we are able to today. The economy could be made so much more efficient. Protein folding is an $NP$ problem. If we could make protein folding a problem in $P$ then we would be able to make huge breakthroughs in biology. We would be able to cure cancer! One of my favorite quotes describing the implications of $P=NP$ is from an MIT researcher,\n \n \u003e _\"If P=NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffet.\"_ - Scott Aaronson\n \n One small downside is that RSA is also a $NP$ problem. If $P=NP$, all known security encryption measures would breakdown and none of our passwords would be safe :)\n \n2. But the truth is, most computer science researchers do **not** believe that $P=NP$. Consider the first definition of $NP$ problems that we gave. We are essentially relying on non-determinism in our Turing machine. We are relying on the fact that the Turing machine is somehow able to \"magically\" or \"luckily\" _always_ pick the right path of traversal. Luck or magic is not something we can model in a deterministic Turing machine. However, despite all this, no one has been able to prove $P \\neq NP$.\n \n3. Finally, problems in $NP$ have indeed been shown to be in $P$. Consider sorting an array by going through all its different permutations, such an algorithm would take $O(n!n)$ time. It is not in $P$. However, after we cleverly came up with a better algorithm such as bubble sort or merge sort, we managed to reduce this problem to be in $P$ by coming up with an $O(nlogn)$ algorithm for solving it. Similarly, problems we once thought to be in $NP$ have been shown to be in $P$ after someone managed to come up with a clever algorithm to solve the problem faster. But just because some problems we thought to be in $NP$ were later found to be in $P$ , does not mean that the two classes are equal. In fact, that the question $P=NP$ ? is really asking is if $P = NP-Complete$. Recall that $NP-Complete$ problems are the hardest problems in $NP$. Every single problem that belongs in $NP$, including the $NP-Complete$ problems are reducible to an $NP-Complete$ problem. This means that if we could somehow reduce even **one** problem belonging to the $NP-Complete$ class to $P$, we would be able to prove $P=NP$. So far, problems in $NP$ were found to be reducible to $P$, but never an $NP-Complete$ problem. As mentioned on the Clay institute website, _\"However, this apparent difficulty may only reflect the lack of ingenuity of your programmer.\"_ Someday, someone just might be able to come up with a radical new algorithm to reduce one of the $NP-Complete$ problems to $P$. There is a possibility, even if highly unlikely.\n \n\nThis is a view of the complexity classes as we know it, depending on the result of the $P$ vs $NP$ problem.\n![pnp-x](/images/pnp-x.webp)\n\nCourtesy: [https://brilliant.org/wiki/complexity-classes/](https://brilliant.org/wiki/complexity-classes/)\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n3. [Why We Might Use Different Numbers in the Future - Up and Atom](https://youtu.be/JS40jPaogM4?si=2DM7YS6xnipvUO5C) (Great Channel, recommend checking out)\n4. [Complexity Classes - Brilliant.org](https://brilliant.org/wiki/complexity-classes/)\n", - "lastmodified": "2026-01-11T05:06:22.144784141+05:30", + "lastmodified": "2026-01-11T10:59:34.502023379+05:30", "tags": [ "domain-cs-algorithms-complexity" ] @@ -330,7 +330,7 @@ "/blog/parallelism-with-omp": { "title": "Parallelism With OMP", "content": "We learnt a bit about vectorization in [Vectorization \u0026 Analyzing Loop Dependencies](/blog/vectorization-analyzing-loop-dependencies), we'll now try to add more parallelism to our programs by leveraging more hardware features (multiple processing units / cores). \n# Processes vs Threads\n\nProcesses are independent of each other and work in completely sand boxed environments. They don’t share memory / any other resources and this makes it notoriously difficult to communicate among processes. Threads on the other hand are interdependent and share memory and other resources.\n\n# OpenMP\n\nEasy quick and dirty way to parallelize your program. **NOTE:** OpenMP is not smart, and if you have dependencies which you attempt to parallelize you will segfault.\n\nPrimary advantages are that it works cross platform and independent of the number of cores on a machine. Disabling multi-threading effects is also as simple as omitting the `-fopenmp` flag.\n\nSome useful info:\n\n- Can use environment variables to limit number of threads code is parallelized over `export OMP_NUM_THREADS=x`\n- Can use `#pragma omp parallel shared(A) private(B)` to force OpenMP to treat certain variables as private data and ensure every thread has its own copy and keep certain variables data shared over multiple threads. **NOTE:** When variables are shared, their data is NOT accessed under mutex locks. Race conditions are very much possible.\n\nSide note, compilers _can_ analyze loops for dependencies and auto-parallelize code when possible. `icc` has the `-parallel` option and `gcc` has Graphite. [https://gcc.gnu.org/wiki/Graphite](https://gcc.gnu.org/wiki/Graphite)\n\n## Static vs Dynamic thread scheduling \u0026 Thread affinity\n\nThe [BLIS Library’s document on multi-threading](https://github.com/flame/blis/blob/master/docs/Multithreading.md#choosing-openmp-vs-pthreads) provides a very comprehensive description of thread affinity and why it is important\n\n\u003e This is important because when the operating system causes a thread to migrate from one core to another, the thread will typically leave behind the data it was using in the L1 and L2 caches. That data may not be present in the caches of the destination core. Once the thread resumes execution from the new core, it will experience a period of frequent cache misses as the data it was previously using is transmitted once again through the cache hierarchy. If migration happens frequently enough, it can pose a significant (and unnecessary) drag on performance. The solution to thread migration is setting processor affinity. In this context, affinity refers to the tendency for a thread to remain bound to a particular compute core. There are at least two ways to set affinity in OpenMP. The first way offers more control, but requires you to understand a bit about the processor topology and how core IDs are mapped to physical cores, while the second way is simpler but less powerful. Let's start with an example. Suppose I have a two-socket system with a total of eight cores, four cores per socket. By setting `GOMP_CPU_AFFINITY` as follows\n\u003e \n\u003e ```bash\n\u003e $ export GOMP_CPU_AFFINITY=\"0 1 2 3 4 5 6 7\"\n\u003e ```\n\u003e \n\u003e I am communicating to OpenMP that the first thread to be created should be spawned on core 0, from which it should not migrate. The second thread to be created should be spawned on core 1, from which it should not migrate, and so forth. If socket 0 has cores 0-3 and socket 1 has 4-7, this would result in the first four threads on socket 0 and the second four threads on socket 1. (And if more than eight threads are spawned, the mapping wraps back around, staring from the beginning.) So with `GOMP_CPU_AFFINITY`, you are doing more than just preventing threads from migrating once they are spawned--you are specifying the cores on which they will be spawned in the first place.\n\u003e \n\u003e Another example: Suppose the hardware numbers the cores alternatingly between sockets, such that socket 0 gets even-numbered cores and socket 1 gets odd-numbered cores. In such a scenario, you might want to use `GOMP_CPU_AFFINITY` as follows\n\u003e \n\u003e ```bash\n\u003e $ export GOMP_CPU_AFFINITY=\"0 2 4 6 1 3 5 7\"\n\u003e ```\n\u003e \n\u003e Because the first four entries are 0 2 4 6, threads 0-3 would be spawned on the first socket, since that is where cores 0, 2, 4, and 6 are located. Similarly, the subsequent 1 3 5 7 would cause threads 4-7 to be spawned on the second socket, since that is where cores 1, 3, 5, and 7 reside. Of course, setting `GOMP_CPU_AFFINITY` in this way implies that BLIS benefits from this kind of grouping of threads--which, generally, it does. As a general rule, you should try to fill up a socket with one thread per core before moving to the next socket.\n\u003e \n\u003e A second method of specifying affinity is via `OMP_PROC_BIND`, which is much simpler to set:\n\u003e \n\u003e ```bash\n\u003e $ export OMP_PROC_BIND=close\n\u003e ```\n\u003e \n\u003e This binds the threads close to the master thread, in contiguous \"place\" partitions. (There are other valid values aside from close.) Places are specified by another variable, `OMP_PLACES`:\n\u003e \n\u003e ```bash\n\u003e $ export OMP_PLACES=cores\n\u003e ```\n\u003e \n\u003e The cores value is most appropriate for BLIS since we usually want to ignore hardware threads (symmetric multi-threading, or \"hyper-threading\" on Intel systems) and instead map threads to physical cores.\n\u003e \n\u003e Setting these two variables is often enough. However, it obviously does not offer the level of control that `GOMP_CPU_AFFINITY` does. Sometimes, it takes some experimentation to determine whether a particular mapping is performing as expected. If multi-threaded performance on eight cores is only twice what it is observed of single-threaded performance, the affinity mapping may be to blame. But if performance is six or seven times higher than sequential execution, then the mapping you chose is probably working fine.\n\u003e \n\u003e Unfortunately, the topic of thread-to-core affinity is well beyond the scope of this document. (A web search will uncover many great resources discussing the use of `GOMP_CPU_AFFINITY` and `OMP_PROC_BIND`.) It's up to the user to determine an appropriate affinity mapping, and then choose your preferred method of expressing that mapping to the OpenMP implementation.\n\n## Processes vs Threads\n\nOne question to address before we begin using threads is asking “Why threads?” Processes are also capable of offering the same type of functionality that threads offer. Processes can also execute in parallel and work on different data at the same time. The reason why we prefer threads to processes is simple, communication. Processes by nature are completely isolated systems, which means they are in a sandboxed environment and inter-process communication is not easy. If there is some sort of accumulation variable whose task is split over multiple threads, it is much easier to do the final accumulation in a thread-based environment than with multiple processes. This is because threads share the same address space (code and data segments, registers, and stack segments are duplicated). To achieve the same level of ease of use with threads we would have to do something similar to `mmap`ing a shared address space and manually managing it which is both a lot of extra effort and possibly has a higher overhead.\n\n# OMP (Again)\n\nOMP is a highly customizable API and there are a lot many things that we can achieve with it.\n\n## `parallel` \u0026 `for`\n\nTwo of the most commonly used and most basic pragmas to know are `#pragma omp parallel` and `#pragma omp for`. The `parallel` pragma simply makes a code block execute in parallel, whereas `for` is a pragma that instructs a `parallel` code block to divide the loop into multiple chunks and assign each one to a different thread associated with that pool.\n\n```c\n// This instructs OMP to assign a thread worker group to the below code block\n#pragma omp parallel\n{\n\t\t// This instructs the parent parallel block to split the for loop into chunks \n\t\t// and assign each thread in the worker group a chunk to execute in parallel\n\t\t#pragma omp for\n\t\tfor(int i=0; i\u003c10; i++) printf(\"%d\\\\n\", i);\n}\n```\n\nThe above code can also be compressed and written as a single pragma `#pragma omp parallel for` and put above a `for` loop.\n\nNote, `for` is simply a directive that instructs the `parallel` pragma to execute something in parallel. Further, the extended syntax allows us to run other blocks of code in parallel with the for loop even and is a powerful construct to be aware of.\n\nThere are also a lot of modifiers that you can apply to the `parallel` pragma to control it’s behavior.\n\n1. `num_threads(n)`\n \n This instructs OMP to allocate exactly `n` threads to the worker group associated with the `parallel` block. The default is to associate one thread for each physical thread supported by your CPU. This can be useful to limit the number of threads generated.\n \n2. `proc_bind(close/master/spread)`\n \n Let’s say you were working on a NUMA architecture where the distances of threads from each other mattered. `spread` instructs OMP to bind each virtual thread to physical threads that are as far away apart from each other as possible. This is useful when the threads are accessing memory unrelated to each other and you don’t want to end up with multiple threads pulling unrelated data into the cache and causing thrashing.\n \n `master` makes each virtual thread execute on the same physical thread.\n \n `close` instructs OMP to bind each virtual thread to physical threads that are close by each other. This is useful when the threads are accessing common data. Now both threads can pull in shared data into the cache and both might benefit from increased cache hit rates.\n \n\n`for` also has some useful modifiers. In particular, `schedule(guided/static/dynamic)`. The default option for `for` is `static`. This is best explained through an example.\n\nLet’s say we had a `for` loop like so and a total of 4 threads on our machine.\n\n```c\n#pragma omp parallel for\nfor(int i=0; i\u003c16; i++) {\n\t\t// S++\n}\n```\n\nLet an asterisk `*` represent the work done in each iteration of the loop and let the threads associated with this block of code be `0, 1, 2, 3`. This is how the work distribution across threads would look like.\n\n```c\nThread 0: ****\nThread 1: ****\nThread 3: ****\nThread 4: ****\n```\n\nEssentially, the loop is divided into chunks of $\\frac{N}{staticsize}$ and each thread is allotted a chunk to compute. When $staticsize$ is not mentioned simply divides $N$ as uniformly as possible. Let’s say I used `schedule(static 2)`. We would get work division like below:\n\n```c\nThread 0: ** **\nThread 1: ** **\nThread 2: ** **\nThread 3: ** **\n```\n\nEach thread is given $\\frac{N}{staticsize}$ work to do and it is split up in a round-robin manner.\n\nWhen we use `dynamic`, the work division looks like this:\n\n![omp-1](/images/omp-1.webp)\n\nIf we’re working with data in a manner where preserving thread locality is crucial dynamic is a terrible choice, but\n\n\u003e Take as an example the case where the time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of `dynamic`  scheduling is the `guided` where smaller and smaller iteration blocks are given to each task as the work progresses.\n\nAs mentioned before, OMP does **not** make your code magically thread safe.\n\n![omp-2](/images/omp-2.webp)\n\nThis will not work as expected. To make specific lines of code execute atomically we can add the pragma `#pragma omp atomic` to instruct OMP that the next instruction must be executed atomically. To instruct OMP that an entire block of code must be executed atomically we can use the pragma `#pragma omp critical` on a code block.\n\nHowever, let’s say we were trying to parallelize the summing of an array.\n\n```c\nfor(int i=0; i\u003cn; i++) sum += arr[i];\n```\n\nIf we parallelized this using `atomic` every iteration will release and acquire the lock and the code is hardly parallel anymore. For special types of operations like accumulation we can instead use\n\n```c\nint sum = 0;\n#pragma omp parallel for reduction(+:sum)\nfor(int i=0; i\u003cn; i++)\n\t\tsum += arr[i];\n```\n\n`reduction` has far lesser overhead than `critical` or `atomic`. It essentially instructs each thread to have its own accumulator and then finally sum up the accumulators (accumulate the accumulator) at the very end. This reduces the number of acquire/lock operations we have to perform to `num_threads` instead of $N$.\n\n## Execution model\n\n![omp-3](/images/omp-3.webp)\n\nIn general, this is how OMP handles the parallel execution of code. However, OMP is pretty advanced and we can do `async` tasks as well using the `task` pragma. Instead of having to deal with an implicit barrier at every join after a `fork`, we can continue execution and only `wait` (`#pragma omp taskwait`) when we really need to wait for a dependency to finish computing.\n\n# References\nThese notes are quite old, and I wasn't rigorously collecting references back then. If any of the content used above belongs to you or someone you know, please let me know, and I'll attribute it accordingly.", - "lastmodified": "2026-01-11T05:06:22.708922904+05:30", + "lastmodified": "2026-01-11T10:59:35.066129292+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -338,7 +338,7 @@ "/blog/profiling-a-program": { "title": "Profiling a Program", "content": "# Profiling Programs\n\nAny examples shown below are from running the profiler on a small piece of code I wrote to test the Goldbach Conjecture, usually run until $N \\leq 5000$.\n[Sample Goldbach Conjecture test program to profile](https://gist.github.com/akcube/bbb98e9a97321c452cc406de66d49d03)\n\n## `time`\n\nThe simplest way to profile a program is to perhaps run the Linux `time` command, which simply executes the given program and returns 3 values.\n\n- `real` - The elapsed real time between invocation and termination\n- `user` - The amount of time spent by the calling process and its children executing instructions in user space\n- `system` - The amount of time spent by the kernel in kernel space performing tasks on behalf of the calling process and it’s children\n\nIt’s the quickest and simplest way to profile a program, however, it gives very little information to go off of. _Which part of the code is the bottleneck? Are algorithms or memory usage the bottleneck?_\n\n\u003e **Program Instrumentation** A natural idea to benchmark code is to insert a tiny chunk of code just before and after the function call site, like so:\n\u003e\n\u003e`tick()` `foo()` `tock()`\n\u003e\n\u003eTick and tock are small bits of code that log information like CPU time or wall-clock time and other state info before and after the function call. We call this “Instrumenting” the code. This would essentially give us the time each function in our code takes to execute but this is understandably laborious to do, especially in large codebases.\n\u003e\n\u003eThe obvious solution here is to offload this manual work to the compiler. There are two ways to do compile-time program instrumentation.\n\u003e\n\u003e1. The `-pg` flag - We discuss this in great detail in the section about the `gprof` program.\n\u003e2. The `-finstrument-functions` flag - This essentially allows us to write our own custom `tick` and `tock` functions which the compiler automatically instruments every function in our code with. We can optionally exclude functions as well. Details can be found in [this Jacob Sorber video](https://www.youtube.com/watch?v=M6RCUiZzl8Y).\n\n## `gprof`\n\nThe `gprof` tool relies on compile-time program instrumentation tactics to produce an execution profile of a program.\n\n**How to use?**\n\n1. Compile and link the source code with the `-pg` flag.\n2. Run the generated executable normally. If your program takes some input, it might be worthwhile to run the program on both slow and fast cases.)\n3. Run `gprof a.out gmon.out \u003e fast-tc.txt`. Here, `a.out` is the name of your executable, and `gmon.out` is the name of the file created upon initial execution. Since you might be profiling different source codes/test cases/compiler options it’s good practice to pipe the output to well-named files.\n\nIt provides two main information _views_. A **Flat Profile** and a **Call Graph.**\n\n- **Flat Profile:** It shows how much time our program spends in each function and the number of times it is called. Gives concise information about which functions were the big bottlenecks in the program.\n\n```python\nFlat profile:\n\nEach sample counts as 0.01 seconds.\n % cumulative self self total \n time seconds seconds calls ms/call ms/call name \n 77.01 2.11 2.11 2497 0.85 1.05 verify_odd_goldbach\n 18.98 2.63 0.52 1216197574 0.00 0.00 is_prime\n 4.01 2.74 0.11 1 110.00 110.00 prec\n 0.00 2.74 0.00 4995 0.00 0.53 verify_goldbach\n 0.00 2.74 0.00 2498 0.00 0.00 verify_even_goldbach\n```\n\nMost of the columns are pretty self-explanatory. `self seconds` is the amount of time spent in that function call frame and `calls` is the number of times it was called. `total ms/call` is the total amount of time spent in the functions call frame **+ its descendants** per call.\n\n- **Call Graph**: For some function $f$, it shows the functions which called it and which other functions it called, and how many times they were called. It also provides an estimate of how much time was spent in the subroutines of each function. This can suggest places where you might try to eliminate function calls that use a lot of time.\n\n```python\nindex % time self children called name\n \u003cspontaneous\u003e\n[1] 100.0 0.00 2.74 main [1]\n 0.00 2.63 4995/4995 verify_goldbach [2]\n 0.11 0.00 1/1 prec [5]\n-----------------------------------------------\n 0.00 2.63 4995/4995 main [1]\n[2] 96.0 0.00 2.63 4995 verify_goldbach [2]\n 2.11 0.52 2497/2497 verify_odd_goldbach [3]\n 0.00 0.00 2498/2498 verify_even_goldbach [6]\n-----------------------------------------------\n 2.11 0.52 2497/2497 verify_goldbach [2]\n[3] 96.0 2.11 0.52 2497 verify_odd_goldbach [3]\n 0.52 0.00 1216186360/1216197574 is_prime [4]\n-----------------------------------------------\n 0.00 0.00 11214/1216197574 verify_even_goldbach [6]\n 0.52 0.00 1216186360/1216197574 verify_odd_goldbach [3]\n[4] 19.0 0.52 0.00 1216197574 is_prime [4]\n-----------------------------------------------\n 0.11 0.00 1/1 main [1]\n[5] 4.0 0.11 0.00 1 prec [5]\n-----------------------------------------------\n 0.00 0.00 2498/2498 verify_goldbach [2]\n[6] 0.0 0.00 0.00 2498 verify_even_goldbach [6]\n 0.00 0.00 11214/1216197574 is_prime [4]\n-----------------------------------------------\nIndex by function name\n\n [4] is_prime [6] verify_even_goldbach [3] verify_odd_goldbach\n [5] prec [2] verify_goldbach\n```\n\nReading this is simple and mostly self-explanatory as well. Each function is assigned a _unique id_ number. Now, in each entry of the table, the line containing the index id on the leftmost column is the current function. Every line in the entry above this line is a function that _called_ the current function and everything below it is a function that _was called_ by the current function.\n\nThe good part about it is that it is extremely fast and provides _some_ worthwhile profile information about the program.\n\nThe primary issue with `gprof` is that it requires the original executable to be instrumented, which means changing the original source code. To be more specific, the extra instrumenting functions called before and after every function call site add extra overhead to the function call. This can skew the profile information returned, especially when dealing with small fast functions which are called many many numbers of times.\n\nTo not have to deal with skewed results due to instrumentation, we might consider using a different method for profiling, **sampling profiling**\n\n## `valgrind --tool=callgrind`\n\n`callgrind` is similar to `gprof` in the sense it also provides a flat profile and call graph except it is a lot more visually appealing and because it runs on `valgrind`, it can capture a lot more detailed information about the executable. However, this extra information and visual appeal come at a major cost, running the program on `valgrind` is computationally much more expensive.\n\nFor context, compiling the instrumented `-pg` program and executing it with no optimizations took 14.563 seconds total as measured by `time`. Running it using `callgrind` on the other hand took a whopping 3 mins and 15 seconds total. Running with `O3` optimizations turned on, we find that `-pg` takes 0.622 seconds to run while `callgrind` took 1 min and 1 second.\n\n`callgrind` works by emulating a CPU and getting samples of the program execution at different points in time. This slows down the program overall but slows down all parts by relatively the same amount so the final ratios/percentages it returns are fairly accurate and not skewed by any external overheads. We can use tools like `kcachegrind` to present the output of `callgrind` in a very visually pleasing manner. `kcachegrind` provides us with all the information that `gprof` does and more. We also get to see a line-by-line analysis of the source code showing what percentage of time a particular instruction/line of code is executed.\n\n![kcachegrind-1](/images/kcachegrind-1.webp)\n\n## `perf`\n\nYou can run `perf` using the command `sudo perf stat ./goldbach`. It uses a type of sampling which relies on certain specific hardware registers used for profiling. Sample output looks something like this,\n\nNote: **Could not run it for `goldbach` because it was unable to track cache-misses, branch-misses, instructions, cycles, etc. due to lack of hardware support.** (Unsure if it is due to VM or AMD CPUs)\n\n```python\nPerformance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':\n\n 5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)\n 235,384 cache-references # 0.246 M/sec (scaled from 66.56%)\n 9,281,660 branch-misses # 3.858 % (scaled from 33.50%)\n 240,609,766 branches # 251.559 M/sec (scaled from 33.66%)\n 1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)\n 2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)\n 217 page-faults # 0.000 M/sec\n 3 CPU-migrations # 0.000 M/sec\n 83 context-switches # 0.000 M/sec\n 956.474238 task-clock-msecs # 0.999 CPUs utilized\n\n 0.957617512 seconds time elapsed\n```\n\nThe output is useful in determining which type of operation is becoming the major bottleneck in the program. `perf report` can be used to get the flat profile and call graph for the program as well.\n\nWe can quickly determine if the program is parallelized or not (_0.999 CPUs utilized_ for example, certainly means that the program is not parallelized.), and also quickly determine what is severely limiting the program, cache misses or instruction execution. Many times, memory accesses are a limiting factor due to the slow speeds of memory, and `perf` can quickly provide you statistics about that.\n\nHardware sampling is also not the most accurate it can be. It’s better than instrumentation but not as accurate as `valgrind` either.\n\n## `PIN`, `PAPI`, `VTune`, etc.\n\nA plethora of other tools exists as well. `PIN` is a fairly advanced tool that works by instrumenting binary code. `PAPI` is an API that helps profile-specific sections of code. `VTune` is a similar profiling tool provided by Intel that has received high praise.\n\n# Profiling via Time Measurement vs Stack sampling\n\nWhile profiling by measuring the time of execution of different parts of the program is taught as the mainstream way to locate “hotspots” in code, there’s another very interesting idea one might use to locate bottlenecks. Most of the following content comes from what [Mike Dunlavey](https://stackoverflow.com/users/23771/mike-dunlavey) pitches on multiple Stack Overflow posts. The following is long but **definitely** worth going over once.\n\n1. [Why _not_ use `gprof` and similar tools for profiling?](https://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343)\n2. [How to profile code using stack samples and why is it often better?](https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux/378024#378024)\n3. [A very simple explained example workflow](https://stackoverflow.com/questions/890222/analyzing-code-for-efficiency/893272#893272)\n4. [A more complex example explained](https://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773)\n\n## A brief of what the above posts try to convey\n\nA lot of profiling software look at the correct data, they take samples of the call stack and process data which contains enough info to infer where the hotspots in the code are, but most end up “summarizing” the data and lose a lot of valuable information in this process.\n\nLet the cloud here represent the bottleneck,\n\n![dundundun-1](/images/dundundun-1.webp)\n\nThe measurement in the profile tells us what function took up a major chunk of the time but it fails to give a clear understanding of _why?_ and it is this information that we are trying to hang on to. Without knowing the “why?” it’s as good as an educated guess when trying to identify the hotspots. When sampling call stacks, we get a detailed picture of the sequence of events leading up to something and get a much better idea of _why_ something might be eating up a lot of time and what specifically to address.\n\nThis specific information is often lost when profilers end up summarizing the data for end users to see. To quote,\n\n\u003e Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, _and if you see it on a second sample_, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.\n\nThe stack samples don't just tell you how much inclusive time a function or line of code costs, they tell you why it's being done, and what possible silliness it takes to accomplish it.\n# References\nThese notes are quite old, and I wasn't rigorously collecting references back then. If any of the content used above belongs to you or someone you know, please let me know, and I'll attribute it accordingly.", - "lastmodified": "2026-01-11T05:06:23.169047261+05:30", + "lastmodified": "2026-01-11T10:59:35.531358314+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -346,7 +346,7 @@ "/blog/public-key-cryptography-coming-up-with-rsa": { "title": "Public Key Cryptography, Coming Up With RSA", "content": "# Mathematical operations across number systems\n\nWhy do we use the decimal number system? We as humans have been taught to count, add, subtract, multiply and divide all in the base 10 number system. We could've just as easily used binary, or maybe even roman numerals. But we chose 10. Why?\n\nIn fact, humans didn't always use the decimal number system. Back in the day, counting began with something akin to tally marks. However, it was pretty much impossible to do much more than count small numbers. Working with large numbers meant we had to make a LOT of marks.\n\nEventually, this lead to the birth of roman numerals. But even this wasn't great. It was very difficult to add and multiply numbers. Comparisons were not easy either. Over the evolution of number systems, at some point, we decided to settle on decimal because it provided a convenient representation to add and compare numbers in. Multiplication is still slow but not as bad as it was with Roman or the Tally systems.\n\n![pkc-1](/images/pkc-1.webp)\n\nA rough comparison of the different systems of representation.\n\nI watched this video recently and I think it provides a great overview of how our choice of number systems came to be what it is and how they might change in the future.\n\n[Why We Might Use Different Numbers in the Future - Up and Atom](https://www.youtube.com/watch?v=JS40jPaogM4)\n\nOur choice of number systems might change. If we find a different base that gives us something more, there's a very good chance that we might indeed ditch the decimal number systems and switch to something else altogether.\n\n2 cool number systems to consider are the base 8 and base 12 number systems.\n\n## Making multiplication _faster?_\n\nWell, we might not change how fast computers are able to multiply two numbers, but what about humans?\n\nConsider multiplying in base 10, we're able to multiply numbers quickly because we remembered the times tables for the different numbers as kids. The _easiest_ times tables to remember are the ones that are a factor of the base number. For decimal, we have easy 2 and 5 times tables.\n\nThis is intuitively understood from the fact that dividing 10 by those numbers gives us an integer. This has the overarching implication that their times tables follow a regular pattern that is easier to remember.\n\nConsider octal (base 8) and duodecimal (base 12)\n\nWith octal, it has a smaller set of times tables and has the same number of factors. 2 and 4. Further, it gives us an additional property, simple halving of base. $\\frac{8}{2} = 4 \\text{ and } \\frac{4}{2} = 2 \\text{ and } \\frac{2}{2} = 1$.\n\nDuodecimal has a bigger times tables but has even more factors. 4! We have: 2, 3, 4 and 6. It also has better halving than decimal. $\\frac{12}{2} = 6 \\text{ and } \\frac{6}{2} = 3$.\n\nThese simple properties might not make them any better to work with for computational purposes, but for humans, it might make handling different computations easier :)\n\n# Public Key Cryptography \u0026 RSA Encryption\n\nSo far we only talked about ways to make these operations faster, because in general, we always consider faster to be better. However, there is a field where making things slower is better. That is the field of public-key cryptography.\n\nLet's say we have 2 people Alice and Bob trying to communicate with each other. Most cryptography revolving around how they can securely send each other a message relies on the fact that they both had previously agreed upon a secret key. If this setup was possible, they could do something as simple as a Caesar cipher to encrypt the message. However, in many situations, it is not possible for Alice and Bob to previously agree upon such a key.\n\nThe public key cryptography problem is the problem of sending this key itself privately between Alice and Bob.\n\n**Diffie-Hellman** key exchange is a well-known algorithm that proposes a great solution to this problem. It is theoretically nice to hear, but it relies on the fact that we can mathematically come up with a construct where we are able to generate a trapdoor function with the following properties.\n\n1. Let's say Alice and Bob each get to keep a public key and a private key. $A_{public}, A_{private}, B_{public}, B_{private}$\n2. We want to have a method where we can **encrypt** a message using the public key quickly.\n3. But at the same time, the **decryption** process must be very slow using the public key.\n4. However, the decryption process must be very quick using the private key.\n\n## The trapdoor\n\nWe use the idea of modular inverse here to come up with a sound mathematical model of such a trapdoor function.\n\n### The slow operation\n\nFor our trapdoor to work, we need some operation that is extremely slow to compute. The operation that we will be looking at is the factorization of prime numbers. Integer prime factorization is a **hard** problem. There is no known polynomial-time algorithm that can factorize a number into its primes.\n\n_However,_ multiplying the factors to get the original number is **easy**.\n\nI found [this visualization](https://www.khanacademy.org/computer-programming/time-complexity-exploration/1466763719) quite nice to understand the idea from.\n\n![pkc-2](/images/pkc-2.webp)\n\n![pkc-3](/images/pkc-3.webp)\n\nFor smaller inputs, integer prime factorization is quite fast. But with an increasing number of bits in the input, the algorithm shows its exponential complexity. It becomes pretty much unfeasible for any computational device that we have today to solve the problem in a reasonable amount of time. Notice that multiplication, however, remains quite fast.\n\nNow, to understand the RSA algorithm better, it is important to have an understanding of [Wilson's Theorem, Fermat's Little Theorem \u0026 Euler's Totient Function](/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function). Once we have those tools to help us, we can build the rest of the devices we need to build our algorithm.\n### One last trapdoor\n\nLet's suppose we had some integer $m$ and we performed the following operation on it.\n\n$$ m^e \\ mod \\ n \\equiv c $$\n\nNotice that computing $c$ is **easy**. We can compute the above expression quickly using techniques like binary exponentiation. However, given just $c$, $n$ and $e$, it is **very hard** to compute $m$. Any algorithm that attempts to try this would have to perform a lot of trial and error.\n\n**Notice that there are no proofs for why these trapdoor functions are like so. If it can be proven that we can compute these \"inverse\" operations in polynomial time, we would be able to break the RSA encryption algorithm. The safety of RSA hinges on the hope that $P \\neq NP$.**\n\n## The RSA Algorithm\n\nNow all that is left to do is to tie up these mathematical trapdoors we've constructed to an algorithm that can effectively solve the key exchange problem.\n\nWe will begin by demarcating the different variables used in the algorithm and the domain they are visible in.\n\n**Private domain:**\n\n1. The private key $d$. This contains info about the prime factorization of $n$.\n2. The decoded message $m$.\n\n**Public domain:**\n\n1. The public encryption key $E$, which consists of the following two things\n - A public exponent $e$\n - The product of two large primes $n$. Note that the factorization is not known in the public domain. Only the product is visible.\n2. The encrypted message $c$.\n\nNote that the variables in the private domain are visible **ONLY** to their owner. They must never be sent in the public domain. Only the public encryption key and encoded message are sent in the public domain.\n\n### Working\n\nLet's suppose that Bob wants to send a secret message to Alice. The secret message here is represented by an integer $m$. Notice that because Alice's encryption key is available in the public domain, Bob can use Alice's encryption key to encrypt the message as follows.\n\n1. **Encryption**\n \n Bob performs the following operation to his message $m$ to encrypt it.\n \n $$ m^e \\ mod \\ n \\equiv c $$\n \n2. **Sending the message**\n \n Bob now sends his encrypted message $c$ in the public domain. Notice that in the public domain, only the values of $c$, $e$ and $n$ are known. This is **not** enough to compute the value of $m$ easily. It is a hard problem and computationally not feasible to solve. Hence no potential attacker in the public domain can compromise / gain access to the secret message $m$.\n \n3. **Decoding**\n \n Once Alice has received the message $c$, she needs a fast way of computing back $m$. Recall that $n$ was the product of two huge primes and Alice knows the prime factorization of $n$. Now, she needs to somehow use this additional knowledge to quickly compute the inverse of the encryption. For this, we will go back to Euler. Notice that,\n \n $$ \n \\begin{aligned} \n m^{\\phi(n)} \\equiv 1 \\ mod \\ n \\\\ \\implies m^{k\\phi(n)} \\equiv 1 \\ mod \\ n \\\\ \\implies m\\times m^{k\\phi(n)} \\equiv m \\ mod \\ n \\\\ \\implies m^{k\\phi(n) + 1} \\equiv m \\ mod \\ n \n \\end{aligned}\n $$\n \n Recall that Alice needed an easy way to get the **inverse** of the encryption that Bob performed. That is, if Bob raised $m^e$ to mod $c$, Alice needed an integer $d$ such that $(m^e)^d \\ mod \\ n \\equiv m$.\n \n Notice that this means that she needed $m^{e \\times d} \\equiv m \\ mod \\ n$. From the above-derived equation, we can see how the puzzle finally fits together.\n \n If we set\n \n $$ d = \\frac{k \\times \\phi(n) + 1}{e} $$\n \n Notice that the value of $d$ **depends on $\\phi(n)$.** And $\\phi(n)$ is a **hard** problem to compute if the factorization of $n$ is unknown. Therefore, even if $n$ and $e$ are visible in the public domain, an attacker cannot compute $d$ as he/she cannot compute the value of $\\phi(n)$ easily without knowing the prime factorization of $n$.\n \n However, Alice knows the prime factorization of $n$! This means that she can compute and store the value of $d$ privately and use it to **decode** any encrypted message sent to her **quickly**.\n \n\nAnd that's it! We have an algorithm that solves the key exchange problem effectively by using the idea of modular inverse and number theory to generate trapdoor functions that allow us to construct this beautiful cryptography algorithm, RSA.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Time Complexity (Exploration) - Khan Academy](https://www.khanacademy.org/computer-programming/time-complexity-exploration/1466763719)\n3. [Why We Might Use Different Numbers in the Future - Up and Atom](https://youtu.be/JS40jPaogM4?si=2DM7YS6xnipvUO5C) (Great Channel, recommend checking out)", - "lastmodified": "2026-01-11T05:06:23.358902166+05:30", + "lastmodified": "2026-01-11T10:59:35.722302086+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-math-number-theory" @@ -355,7 +355,7 @@ "/blog/randomization-primality-testing-algorithms": { "title": "Randomization, Primality Testing Algorithms", "content": "# Randomized Algorithms\nSo far, we've discussed a lot of cool algorithms which we can use to solve many different problems. However, all these problems had a polynomial-time solution that we were able to come up with. This is not true for all problems. There are many **hard** problems for which there exists **no known** polynomial-time algorithm. A lot of these problems are quite important and an efficient way to solve them is a must. Without a polynomial-time solution to them, it is not feasible to compute their solutions for large inputs on any known computational device that man has access to.\n\nIn situations like these, we try to _probabilistic-ally_ solve the problem. We sacrifice being 100% accurate for an immense boost in speed. In some sense our algorithm is _wrong_. It does not pass the test of giving the correct answer to _every_ test case. But if it can do so with great accuracy, it might be the best \"solution\" we have.\n\n## A more \"mathematical\" explanation\n\nLet's suppose we're given a **hard** problem that has no known polynomial-time solution. However, what we do have is a set of $n$ _efficient_ but **not-correct** algorithms which output the correct answer for only $\\frac{2k}{3}$ of all valid inputs. Let us denote the set of these \"probabilistic\" algorithms by\n\n$A = \\{ a_1, a_2, \\dots,a_n\\}$\n\nNow, let's say I pick some random input $i$ and give it to my algorithm $a_1$. Since it gives me the correct answer for $\\frac{2k}{3}$ of all possible inputs, my chances of getting a wrong answer are equal to $1-\\frac{2}{3} = \\frac{1}{3}$.\n\nThis is not terrible, but still not great. However, notice that I still have $n-1$ other algorithms that give me a probabilistic-ally correct answer. If we run some $m \\leq n$ such algorithms on the same input, notice that the chances of getting a wrong answer diminish to $\\frac{1}{3}^m$. After running just 5 such algorithms, our chances of getting a wrong answer are as low as 0.243%. This is a **very** good approximation and we can always do this since running 5 such efficient algorithms is _always_ much faster than running an exponential-time algorithm.\n\n# Primality testing\n\nA known computationally hard problem is primality testing. There is no easy way to test if a number is prime or not without iterating through at least all its factors $\\leq \\sqrt n$. Notice that here, the number is given as input in bits and for every added bit we have an exponential increase in complexity. The actual complexity is $\\sqrt{2^n} = 2^{\\frac{n}{2}}$. This is assuming that we are able to test divisibility in $O(1)$, which might not be true for large numbers.\n\nPrimality testing is also a very important algorithm. Algorithms like RSA are used worldwide to secure communication in web browsers, email, VPNs, etc. and it relies on us knowing very large prime numbers. Large prime numbers are difficult to find, especially if we don't have an algorithm that can test primality very quickly. But since it is a **hard** problem, we have come up with probabilistic algorithms to efficiently \"almost\" solve this problem.\n\n## Fermat Primality Test\n\nJust recently, we talked about Fermat's little theorem ([Wilson's Theorem, Fermat's Little Theorem \u0026 Euler's Totient Function](/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function)) which gives us the following equation. For any prime $p$ and _any_ integer $a$ co-prime to $p$,\n\n$$ a^{p-1} \\equiv 1 \\ mod \\ p $$\n\nThis equation always holds for primes and in general does _not_ hold for composite numbers. Notice that for different values of $a$, we essentially have an all-new algorithm to test the primality of $p$. If the equation does not hold for _any_ value $a$ co-prime to $p$, then we know for sure that the number is not prime. We can prove that by trying all values of $2 \\leq a \\leq p-2$ we can indeed guarantee that $p$ is prime. However, doing that would be worse than just iterating over all its factors and testing primality. Hence we can choose to just try the algorithm for many different values of $a$. This turns out to be a _very efficient_ probabilistic test for checking the primality of some number $p$.\n\nBelow is the implementation of such an algorithm that relies on randomness.\n\n```cpp\nbool fermatPrimalityTest(int p, int rep=10){\n\t\tif(p \u003c= 3) return p == 2 || p == 3;\n\t\t\n\t\tfor(int _=0; _\u003crep; _++){\n\t\t\t\tint a = rand()%(p-3) + 2;\n\t\t\t\tif(binpow(a, p-1, p) != 1) return false; // Fermat witness\n\t\t}\n\t\treturn true;\n}\n```\n\nThis algorithm will return the right answer most of the time. Further, notice that we don't care if $a$ is co-prime to $p$ or not. The condition is imposed on this version of Fermat's because if $p$ divides $a$ then $a \\equiv 0 \\ mod \\ p$. But this will not be an issue for the values of $a$ that we are picking.\n\nWhile performing the check, if our equation fails for some base $a$, then we call $a$ the **Fermat witness** for the compositeness of $p$. If our number $p$ passes the test for some base $a$ but $p$ is actually composite, then we call base $a$ a **Fermat liar**.\n\nA natural question to ask here is, how many such composite numbers pass this test very frequently. Are there any composite numbers that pass this test for **all** $a$ co-prime to $n$ maybe?\n\n### Carmichael numbers\n\nSadly, there are such composite numbers for which this test returns true for **all** $a$ co-prime to $p$. They are the ****[Carmichael numbers](https://en.wikipedia.org/wiki/Carmichael_number). We can identify these false positives only if try bases that are not co-prime to $p$. This makes Fermat's primality test a weak prime test. However, it is not very bad and Carmichael numbers are fairly rare. There exist only 646 such numbers $\\leq 10^9$ and only 1401644 such numbers $\\leq 10^{18}$. This is still reasonable for such a fast and efficient algorithm.\n\n## Miller-Rabin Primality Test\n\nThe idea behind this primality test is somewhat an extension of Fermat's. Let's say we are testing the primality of some integer $p$. If $p$ is even, it is obviously not prime for all values of $p \\neq 2$.\n\nLet us eliminate all even numbers (excluding the trivial case of 2). Now, given that $p$ is a odd number who's primality we're testing, $p$ being odd $\\implies p-1$ is even. This means that it has _at least_ one factor of two.\n\nLet us write $p-1 = 2^k\\cdot q$, we are essentially factorizing all the $2$ factors from the number $p$. From this construction, it must be true that $q$ is odd. Substituting this back in Fermat's test we can write it as\n\n$$ a^{p-1} \\equiv 1 \\ mod \\ p \\iff a^{2^k \\cdot q}-1 \\equiv 0 \\ mod \\ p $$\n\nNotice that we can factorize this expression further. Any term of the form $x^2-1=(x+1)(x-1)$. So we can write the above term as\n\n$$ a^{2^k \\cdot q}-1 \\equiv 0 \\ mod \\ p \\iff (a^{2^{k-1}\\cdot q}+1)(a^{2^{k-1}\\cdot q}-1) \\equiv 0 \\ mod \\ p $$\n\nNotice that the 2nd term on the RHS can be factorized further until we run out of powers of 2. That is, we can factorize it $k-1$ times to get the following expression.\n\n$$ \n\\begin{aligned}\n(a^{2^{k-1}\\cdot q}+1)(a^{2^{k-1}\\cdot q}-1) \\equiv 0 \\ mod \\ p \\\\ \\iff (a^{2^{k-1}\\cdot q}+1)(a^{2^{k-2}\\cdot q}+1)\\cdots(a^q+1)(a^q-1)\\equiv 0 \\ mod \\ p \n\\end{aligned}\n$$\n\nThis equation must be true for $p$ to be prime. This means that _at least_ one of these terms must be divisible by $p$. That is, either\n\n$$ a^q - 1 \\equiv 0 \\ mod \\ p \\iff a^q \\equiv 1 \\ mod \\ p $$\n\nholds or for some $0 \\leq r \\leq k-1$ we check if\n\n$$ a^{2^rq}+1 \\equiv 0 \\ mod \\ p \\iff a^{2^rq} \\equiv -1 \\ mod \\ p $$\n\nholds.\n\nIf none of these statements hold (for all values of $r$) then we know that $p$ **must** be composite. We call the base $a$ a _witness_ for the compositeness of $p$. However, recall that this test is only probabilistic. It is possible for certain bases to pass this test even for composite $p$. We call such bases a _strong liar._\n\n## Why Miller-Rabin over Fermat?\n\nThe nice part about this test is that unlike with Fermat, there are _no_ numbers like the Carmichael numbers where all non-trivial bases lie.\n\nWe give the name [Strong pseudoprime](https://en.wikipedia.org/wiki/Strong_pseudoprime) to composite numbers which pass the Miller-Rabin test. From the Wiki,\n\n\u003e A composite number $n$ is a strong pseudoprime to at most one quarter of all bases below $n$. Of the first 25,000,000,000 positive integers, there are 1,091,987,405 integers that are probable primes to **base 2**, but only 21,853 of them are pseudoprimes, and even fewer of them are strong pseudoprimes\n\nThe proof for the bound that for any composite $p$, the probability that a random integer $a \\in [1, N-1]$ is a witness for the compositeness of $p$ is at least $\\frac{3}{4}$ can be found here: [Primality Proving - Lecture Notes 12 from MIT](https://math.mit.edu/classes/18.783/2017/LectureNotes12.pdf)\n\nWe can take this even further!\n\nThe Miller-Rabin primality test can be made **deterministic** by _only_ testing all bases $a \\leq 2ln(p)^2$. The proof for this claim relies on the [Generalized Riemann Hypothesis](https://en.wikipedia.org/wiki/Generalized_Riemann_hypothesis) being true. However, if it does hold true, then we have a polynomial time deterministic test for the primality of some number $p$!\n\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:06:23.367883579+05:30", + "lastmodified": "2026-01-11T10:59:35.730835375+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-math-number-theory" @@ -364,7 +364,7 @@ "/blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns": { "title": "Reasoning, Acting, and Learning : A Survey of Single-Agent LLM Patterns", "content": "\nLarge Language Models (LLMs) are fundamentally probabilistic engines designed to predict the next token. We've seen them achieve several impressive feats purely by throwing more data and compute at them. However, they still tend to occasionally hallucinate, have certain \"model specific tendencies\" etc. which are difficult to overcome purely by training. A recent popular example is Gemini 3, who's performance varies by a sizable margin on coding tasks depending on the harness used to prompt it. There's been a good bunch of research done in the space of prompting / agent architectures. \n\nIn this post, I'll attempt to survey the academic literature surrounding single-agent reasoning strategies. From linear prompt engineering techniques like **Chain-of-Thought (CoT)** to more natural successors which are non-linear like **Tree-of-Thoughts (ToT)** and **Graph-of-Thoughts (GoT)**. Finally, we will look at how to ground these \"thoughts\" in reality using Action loops (**ReAct**) and persistent \"learnt\" memory (**Reflexion**).\n# Evolving Single-Agent Reasoning Strategies: From CoT to GoT\nThis section focuses on improving single-agent LLM reasoning beyond simple Input-Output (IO) prompting.\n## Chain-of-Thought (CoT)\n[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models [Google Research, Brain Team]](https://arxiv.org/pdf/2201.11903)\n\nCoT at its core just suggests that LLMs are able to perform significantly better at solving hard ‘reasoning’ problems such as arithmetic, common-sense reasoning and symbolic reasoning when prompted to “think step-by-step” before generating a final verdict versus just outputting the final answer. This attempts to mimic human problem-solving where we start by breaking down a complex problem into digestible and ‘solveable’ chunks and then put the whole thing together to solve the complete complex problem. \n\nIn general, you could trivially make any ‘non-thinking’ variant of an LLM ‘think’ by adding this sentence\n\n\u003e “Let’s think through the problem step-by-step in `\u003cthink\u003e \u003c/think\u003e` tags before outputting the final answer”\n\nTo the prompt. Of course, the actual thinking models use Few-shot CoT and are fine tuned (SFT / RLHF / DRPO etc.) to perform better, but it’s still a cheap trick to elicit reasoning in non-thinking LLMs. Here’s an example interacting with Llama 4 Maverick:\n\n\u003e *\"What is the answer to the universe?\"*\n\n```\nWhat a profound and intriguing question! Unfortunately, there is no straightforward answer to this query, as it is a complex and multifaceted one that has puzzled philosophers, scientists, and thinkers for centuries.\n\nThe concept of \"the universe\" encompasses a vast and intricate web of existence, including matter, energy, space, and time. As such, attempting to distill a single \"answer\" to the universe is a daunting task, if not impossible.\n\nDifferent perspectives and disciplines offer various insights into the nature of the universe. For instance:\n\n1. Scientific perspective: From a scientific viewpoint, our understanding of the universe is based on empirical evidence and observations. The laws of physics, such as gravity, electromagnetism, and quantum mechanics, govern the behavior of the universe. However, there is still much to be discovered, and ongoing research continues to refine our understanding of the cosmos.\n\n2. Philosophical perspective: Philosophers have long grappled with the meaning and purpose of the universe. Some propose that the universe is governed by inherent laws and principles, while others argue that it is the result of chance or randomness. Others still explore the idea of a multiverse or the concept of a creator or higher power.\n\n3. Cosmological perspective: Cosmologists study the origins, evolution, and fate of the universe. The Big Bang theory suggests that the universe began as a singularity around 13.8 billion\n```\n\n\u003e *\"What is the answer to the universe? Let’s think through the problem step-by-step in `\u003cthink\u003e \u003c/think\u003e` tags before outputting the final answer\"*\n\n```\n\u003cthink\u003e\nThe question \"What is the answer to the universe?\" is a classic reference to Douglas Adams' science fiction series \"The Hitchhiker's Guide to the Galaxy.\" In the book, a supercomputer named Deep Thought is asked to calculate the \"Answer to the Ultimate Question of Life, the Universe, and Everything.\" After 7.5 million years of computation, Deep Thought reveals that the answer is 42. However, the characters then realize that they don't actually know what the ultimate question is. This narrative serves as a satirical commentary on the search for meaning and answers in life and the universe. Given this context, a straightforward or serious answer to the question isn't possible without more specific context or clarification on what \"the answer to the universe\" is supposed to mean. A humorous or thematic response related to \"42\" could be appropriate, or a more philosophical discussion about the nature of the question could be had.\n\u003c/think\u003e\n\nThe answer is, of course, 42. But what's the question again?\n```\n### Pros\n1. **Simplicity:** Compared to the others, CoT is relatively easy to implement, and is a very quick-and-cheap way to get some extra accuracy out of the models.\n2. **Interpretability:** Compared to single-shot prompts, the generated step-by-step reasoning provides very good insight into the LLMs thought process and how it arrives at a decision. This reduces likelihood of hallucination and also allows the users to quickly and easily fact-check and build trust with the LLM’s analysis.\n### Cons\n1. **Lack of Exploration / Generality:** It has to follow a single reasoning path, which makes it less effective (compared to ToT) for handling more complex scenarios where there may be multiple cause-effect chains to debug or where parallel investigation paths are needed. (e.g., high latency could be a slow query or high connection wait time due to RPC spam or resource saturation – CoT might explore only one).\n## Tree-of-Thoughts (ToT) Reasoning\n[Tree of Thoughts: Deliberate Problem Solving with Large Language Models [DeepMind + Princeton]](https://arxiv.org/pdf/2305.10601)\n\nA picture is worth a thousand words, so here’s the picture.\n\n![pasted-image-20251204071654](/images/pasted-image-20251204071654.webp)\n### CoT-SC \n[Self-Consistency Improves Chain of Thought Reasoning in Language Models [Google Research, Brain Team]](https://arxiv.org/pdf/2203.11171)\n\nYou may ask what “CoT-SC” is here. In short, when LLMs do token prediction, there’s usually a “temperature” variable that is used to control the sampling from the token probability distribution. This temperature is what causes LLMs to give differently worded responses to the same query on different runs. Theoretically, if you set the temperature to 0, the LLM would always pick the next token with highest probability and give you deterministic results. Google lets you set this value in [AIStudio](https://aistudio.google.com/prompts) for example. You can experiment by asking questions using temperature 0 and you’ll notice that the responses are deterministic. \n\nCoT-SC lets you capitalize on this ‘randomness’ by sampling *k* different CoTs and returning the most frequent output. The hope would be that with non-zero temperature, each CoT may explore different thought processes. \n\n\u003e*“CoT-SC improves upon CoT, because there are generally different thought processes for the same problem (e.g. different ways to prove the same theorem), and the output decision can be more faithful by exploring a richer set of thoughts. However, within each chain there is no local exploration of different thought steps, and the “most frequent” heuristic only applies when the output space is limited (e.g. multi-choice QA).”*\n### ToT\nTree of Thoughts (ToT) extends CoT by allowing the model to explore multiple reasoning paths simultaneously rather than a single linear chain. This is more or less inspired by research on human problem-solving which suggests that humans search through a combinatorial problem space. A tree search similar to BFS / Monte-Carlo Tree Search (MCTS). In this tree, nodes represent ‘partial solutions’ and branches correspond to operators that modify them. ([A. Newell, H. A. Simon, et al. Human problem solving. Prentice-Hall, 1972. [CMU]](https://iiif.library.cmu.edu/file/Newell_box00018_fld01306_doc0001/Newell_box00018_fld01306_doc0001.pdf))\n\n*“A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution is finally discovered.”*\n\nTo do ToT, we need the following 4 components. \n\n1. **Thought decomposition**: ToT explicitly benefits from better problem / thought decomposition. This doesn’t matter much in CoT since it’s a single LLM eliciting reasoning prowess. For ToT, each thought will be judged. So it’s important that each thought is “small” enough so that LLMs can generate promising \u0026 diverse samples, yet “big” enough that LLMs can evaluate its prospect toward problem solving. \n\n2. **Thought Generator**: Given a thought ‘chain’, we need a way to generate *k* candidates for the next thought step. You can do something similar to CoT-SC where you just sample *k* times independently with non-zero temperature or you can ask the model to propose *k* different thoughts in one go, with awareness of the other proposed thoughts and follow each branch as a separate thought. The former works better when the search space is rich and the latter when the search space is more constrained.\n\n3. **State Evaluator:** Given the ‘frontier’ or leaf-node in the thoughts of our tree search, we need a state evaluator to evaluate the progress that was made towards solving the problem. For example, you can imagine a chess engine that implements standard mini-max to some depth, but needs a way to *heuristically* evaluate the utility score of a position when it needs to prune branches. The paper proposes two strategies to heuristically evaluate state:\n\n 4. **Value:** A secondary ‘judge’ LLM reasons about a given state *s* and generates a scalar value (*score*) to associate the ‘thought node’ with a score, that can be used to rank and sort between the frontiers. The key point here is that each ‘frontier’ or ‘leaf-node’ **independently** has a value associated with it.\n\n\t \u003e*“ The basis of such evaluative reasoning can vary across problems and thought steps. In this work, we explore evaluation via few lookahead simulations (e.g. quickly confirm that 5, 5, 14 can reach 24 via 5 \\+ 5 \\+ 14, or “hot l” can mean “inn” via filling “e” in “ ”) plus commonsense (e.g. 1 2 3 are too small to reach 24, or no word can start with “tzxc”). While the former might promote “good” states, the latter could help eliminate “bad” states. Such valuations do not need to be perfect, and only need to be approximately helpful for decision making”*\n\n 5. **Vote:** Here, we vote *across* different frontiers. You have the judge LLM **compare** between different states and vote out ‘bad’ states by deliberately comparing the different frontiers. This is most similar to the concept of **differential diagnosis** and we’ll explore this in detail later.\n\n6. **Search Algorithm:** This is fairly simple, but given that the underlying data structure of this tree-style reasoning is well… a tree, you can explore, vote and continue the search using any well-known tree search algorithms. This includes DFS, BFS, MCTS, etc. \n\n Note that with DFS, you would explore the most promising idea before backtracking (after hitting an ‘impossible-to-win-from’ state), in BFS you’d explore multiple options and likely prune by comparing frontiers, etc. \n\nOnce the best *final* leaf node with the complete solution is found, all other branches are ignored and only the winning path from root-to-solution\\_node is presented to the user as the ‘best’ CoT the model managed to achieve. This way you still have the interpretability benefit from normal CoT.\n### Pros\n1. **Generality:** It’s much better at handling ambiguity and can explore a wider search space. You can think of CoT, CoT-SC and normal single-shot prompts as all special cases of ToT. \n2. **Self-Correction / Resilience:** Evaluation / differential diagnosis helps it prune unpromising diagnostic branches if they lead to dead ends.\n### Cons\n1. **More compute / cost:** Pretty obvious, but generating and evaluating multiple branches involves more LLM calls and more tool/API executions. More cost.\n## Graph-of-Thoughts (GoT) Reasoning\n[Graph of Thoughts: Solving Complex Problems with Large Language Models [Google DeepMind + ETH Zürich + Princeton]](https://arxiv.org/pdf/2308.09687)\n\nWhile ToT improves upon CoT by allowing exploration of multiple reasoning paths, it still largely follows a tree based branching structure. Graph-of-Thoughts (GoT) generalizes this further by modeling the thought process as an arbitrary graph. Once again, motivated by the fact that humans often re-use \u0026 backtrack between different ‘branches of thoughts’ in their head when performing reasoning. This paper makes two key sets of contributions. We’ll cover both here.\n### Extending Transformations \u0026 Generalizability\nThe paper proposed a few thought transformations and other ‘generalizable’ ideas which extend the operations and ideas we had across all the above constructions: COT, COT-SC \u0026 ToT. \n#### Heterogenous Graphs\nNodes in the thought process can belong to different *classes*. This is particularly helpful in aiding / providing the agents with a human-organization-esque clarity on different sections of their job. For example, in writing tasks, some vertices model plans of writing a paragraph, while other vertices model the actual paragraphs of text. \n\nMore formally, the reasoning process can be modeled as *G \\= (V, E, c)*, where *V* is the set of thought-vertices, *E* is the set of dependency-edges, and *c* maps vertices to different classes (*C*). This is useful for complex tasks like incident diagnosis. We can define different classes of \"thought nodes,\" essentially creating specialized \"agents\" or \"experts\" within the graph.\n#### GoT Components and Thought Transformations\nGoT operates through a framework involving 2 key components and types of transformations applied to the graph of thoughts (GRS \\- Graph Reasoning State), guided by a predefined plan (GoO \\- Graph of Operations). Refer to this picture when reading the below section for clearer understanding.\n\n![pasted-image-20251204061722](/images/pasted-image-20251204061722.webp)\n#### Thought Decomposition\nLike in ToT, GoT benefits from breaking the problem down. But additionally, the graph structure allows for more complex decomposition and recombination strategies.\n#### Thought Transformations\nIn previous ideas like ToT \u0026 CoT, we had a forward-fixing-esque strategy for reasoning. In GoT, we formalize this by defining three distinct types of graph operations you can do with the ‘thought nodes’ in the graph.\n\n1. **Generation:** Create one or more new thoughts *(v+1, ..., v+k)* based on a single existing thought (*v*). This is the only type of operation we’ve seen so far (ex: branching in ToT or extending a chain in CoT). Graphically, edges *(v, v+1), ..., (v, v+k)* are added. \n\n2. **Refining:** This involves improving an existing thought *v*. Graphically, this is represented as a self-loop *(v, v)* or (in edges) a chain *v \\-\\\u003e v' \\-\\\u003e v'' ...*, which represents iterative improvement of the same thought based on new feedback or to enable further analysis (e.g., self-correction, asking the LLM to improve its previous analysis of a metric or maybe it doubts a previous diagnosis and wants more metrics analyzed to validate itself).\n\n3. **Aggregation:** This is the key new operation the graph structure allows for. It allows combining multiple distinct thought vertices *(v\\_1, ..., v\\_k)* into a new, aggregated synergistic thought *(v\\_+)*. Graphically, this means creating a new vertex *v\\_+* with incoming edges from *v\\_1, ..., v\\_k*. Here’s some examples where this could help: \n 4. Merging findings from different diagnostic paths (e.g., combining insights from log analysis and metric spikes). \n 5. Synthesizing information from various \"experts\" into a higher-level conclusion.\n\n6. **Ranking/Selection (R):** This ranking function selects the *h* most relevant or highest-scoring thoughts from the current graph state *G*. This is run on the GRS (graph reasoning state), to decide which thoughts to aggregate or which paths to explore further. (ex: `KeepBest(N)` to preserve the top *N* most ‘successful’ nodes).\n\n7. **Graph of Operations (GoO):** This is a static, predefined plan that dictates the sequence and dependencies of operations (like *Generate, Aggregate, Score, Improve, KeepBest*) to be applied during the reasoning process. It defines the ‘workflow’ for solving the problem.\n\n8. **Graph Reasoning State (GRS):** This is the dynamic graph *G \\= (V, E)* at any point during the reasoning process. It represents the state at any point in time with the actual thoughts generated and their relationships as the GoO plan is executed. It's continuously updated by the thought transformations.\n### Walkthrough\nSince GoT is slightly more complex than the other approaches, let’s do a quick walkthrough of the GoT loop to see how we might imagine progress with this architecture.\n\n1. Initialize an empty GRS. Create the root node (ex: Generating a diagnosis plan from an alert node) \n2. The **Controller** consults the GoO to determine the next operation(s) to perform (e.g., Generate hypotheses based on the initial alert node). \n3. It selects the relevant input thought(s) from the GRS (e.g., the alert node). \n4. The **Prompter** constructs the appropriate prompt for the LLM based on the operation and input thought(s). \n5. The LLM generates a response (a new set of thoughts). \n6. The **Parser** extracts the relevant information from the LLM's response and updates the GRS, adding new vertices (thoughts) and edges (dependencies) according to the operation type (e.g., adding 3 hypothesis nodes connected to the alert node via a *Generate(root, k=3)* operation). \n7. The **Scoring** module evaluates the newly generated thought(s) using the ranking function *R*, using calls to a judge LLM. Scores are stored in the GRS. \n8. The Controller now might use the `KeepBest(N)` operation in the GoO to prune less promising thoughts or select the best ones for the next step. \n9. Loop back to step 4 and keep iterating until the scoring module is able to judge that a satisfactory solution node is reached. (We can bound iterations with static limits to bound infinite loops). \n### Pros\n1. **Maximum Generalizability:** Defining it as a graph of operations, we allow for self-loops (refinement) \u0026 aggregation which allows more complex workflows that the previously discussed techniques cannot represent. \n2. **(Potentially) Better Resource Use:** A well-designed GoO can guide the LLM more efficiently than naively exploring all branches of a deep/wide tree. This is because with a good ranking function, it can focus computation on aggregating promising paths rather than fully exploring less likely ones.\n### Cons\n1. **Complexity:** Defining an effective Graph of Operations (GoO) requires **significant** upfront effort. The user needs to carefully plan the workflow, dependencies, and types of transformations needed for the specific task. Debugging the graph of thought reasoning by backtesting for prompt engineering is even harder. The controller for handling control flow is a lot more complicated. So is scoring, where a node’s value might now depend on multiple predecessors and its aggregated value or its potential contribution to multiple future paths instead of just its value. \n2. **(Potentially) Worse Resource Use:** If it’s not very well designed, the ranking function has to rank all the (at least updated) nodes in the GRS and can easily cause more resource usage as well.\n# Enabling Action and Learning: ReAct and Reflexion\nThinking helps LLMs a great deal in improving reasoning performance and in [explainability](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-medical). However, ‘thinking’ alone is insufficient. Agents can benefit a lot by accessing and interacting with the environment (via tool calling) and also learn from their mistakes. When an agent is tasked to solve the same type of task over several ‘trials’, it can reflect on mistakes, maintain ‘notes’ and learn from them. \n## ReAct (Reason+Act)\n[REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS [Google Brain, Princeton]](https://arxiv.org/pdf/2210.03629)\n\nIn short, ReAct is a prompting technique to interleave ‘reasoning’ and ‘tool calling’ in LLMs. In ReAct, the LLM is prompted to generate verbal reasoning traces for a tool call and for interpreting the observations from the results of the tool call. In hindsight, the paper is pretty ‘obvious’, but it essentially formalized a central template for LLMs to follow: \n\n![pasted-image-20251204062016](/images/pasted-image-20251204062016.webp)\n\nMaking this a formal ‘template’ also implies that it’s now easier to ‘teach’ this template to models via **few-shot prompting**. The paper also performs significant testing to empirically prove that ReAct performs better than just using CoT or only doing tool calls without thinking / reasoning.\n## Reflexion\n[Reflexion: Language Agents with Verbal Reinforcement Learning [NU, MIT, Princeton]](https://arxiv.org/pdf/2303.11366)\n\nAt its core, Reflexion is an optimization technique that uses natural language for policy optimization \\- similar to reinforcement learning, but instead of updating weights, it relies on linguistic feedback. Here’s a diagram showing how Reflexion works in three different use cases:\n\n![pasted-image-20251204062105](/images/pasted-image-20251204062105.webp)\n### The Reflection Step\nThe *“evaluation”* section is key to Reflexion working well. If we have a way of evaluating the trajectory of a LLM’s progression in solving a task, the **reflection** step articulates why the failure occurred and proposes specific, actionable changes to the plan or strategy for the next attempt. This is then fed back to the LLM to incorporate as feedback and improve its original solution.\n### Episodic Memory\nReflexion provides a framework to improve an LLM across multiple *trials* of the same task. Here’s how:\n\n![pasted-image-20251204062151](/images/pasted-image-20251204062151.webp)\n#### Short Term Episodic Memory\nThe generated verbal ‘reflections’ during the progression of completing one trial of the task is stored in a short-term episodic memory buffer specific to that task solving instance (trial). Then, they are also stored to a ‘longer term’ storage (like a SQL database). \n#### Long Term Memory\nFor subsequent attempts at the same task, these stored reflections are added to the agent's prompt/context. This explicitly guides the agent to avoid repeating the same mistakes and incorporate the learned strategies. \n\nFor example, let’s say when asked to implement a sqrt function, it failed to handle the \\\u003c 0 case. This was caught during unit tests or during a LLM judge stage and stored as a reflection. In a future run to code a binary exponentiation function, the reflection to handle special cases around \\\u003c 0 is injected into the prompt under the ‘reflection’ section. This allows the LLM to sort of develop its own “lessons learned” notes as it attempts the task multiple times. \n\nThe core-loop for our building block LLM can now be: \n\nAct (using e.g., ReAct) \\-\\\u003e Evaluate \\-\\\u003e Reflect (if failed) \\-\\\u003e Store Reflection \\-\\\u003e Act Again (using reflections).\n\n---\nAnd that's about all I have for this survey. We can further greatly improve accuracy \u0026 orchestration over long horizon tasks by using multiple \"agents\". Read on to [Multi-Agent Systems; Harnessing Collective Intelligence - A Survey](/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey) for more details! ", - "lastmodified": "2026-01-11T05:06:24.580941841+05:30", + "lastmodified": "2026-01-11T10:59:36.958056921+05:30", "tags": [ "domain-cs-ai-ml-llms" ] @@ -372,15 +372,15 @@ "/blog/riemann-series-rearrangement": { "title": "Riemann Series Rearrangement", "content": "# Riemann Series Rearrangement\n\nTake an arbitrary infinite sequence of real numbers $\\left( a_1, a_2, a_3, \\ldots \\right)$ such that $\\sum\\limits_{n=1}^\\infty a_n$ is conditionally convergent. Let $K$ be any number belonging to the set of the extended real numbers. Then there exists a permutation\n\n$g: \\mathbb{N}\\to \\mathbb{N}$ such that\n$\\sum\\limits_{n=1}^\\infty a_{g(n)} = K$\n## Proof\n\n### Existence of a rearrangement that converges to a finite real number\n\nConsider K to be any real positive number. Let the series be denoted by $S = \\sum\\limits_{i=0}^\\infty a_i$. It is conditionally convergent. This means that it has an infinite number of positive and an infinite number of negative terms each. Let us denote them as follows:\n\nLet $\\left(p_1, p_2, p_3, \\ldots\\right)$ denote the sub-sequence of all positive terms in $S$ and $\\left(n_1, n_2, n_3, \\ldots\\right)$ denote the sub-sequence of all negative terms in $S$. Since the series is conditionally convergent, both the positive and negative series $(p_i) \\ \\\u0026 \\ (n_i)$ will diverge to $\\pm\\infty$. Hence, we have:\n\n$$\\sum{p} = +\\infty$$ $$\\sum{n} = -\\infty$$\n\nSince $\\sum{p}$ tends to $\\infty$, it implies that there exist a minimum natural number $N_1$ such that for all $N \\geq N_1$ the following holds true: If $S_k$ denotes the partial sum of the first k terms of this rearranged series, $$S_{N} = \\sum\\limits_{i=1}^{N} p_i \u003e K$$Since $N_1$ is the minimum such number, it implies that:\n\n$$\\sum\\limits_{i=1}^{N_1-1} p_i \\leq K \u003c \\sum\\limits_{i=1}^{N_1} p_i$$\nWe can begin to develop a mapping $\\sigma : {\\mathbb{N}}\\to {\\mathbb{N}}$ such that,\n$$\\sum\\limits_{i=1}^{N_1} p_i = \\left( a_{\\sigma(1)} + a_{\\sigma(2)} + a_{\\sigma(3)} + \\cdots + a_{\\sigma(N_1)} \\right)$$\n\nNow, since $\\sum n$ also diverges to $\\infty$, it is possible to add just enough terms from $(n_i)$ so that the resulting sum\n$$S_{N_1+M} = \\sum\\limits_{i=1}^{N_1} p_i + \\sum\\limits_{i=1}^{M} n_i \\leq K$$\n\nLet $M_1$ be the minimum number of terms required from $(n_i)$ for the above statement to hold true. This implies that,\n\n$$\\sum\\limits_{i=1}^{N_1} p_i + \\sum\\limits_{i=1}^{M_1-1} n_i \u003e K \\geq S_{N_1+M_1}$$\n\nConsider equation (1), if we subtract $S_{N_1}$ from the inequality and flip the signs, we get: $$0 \\leq S_{N_1} - K \u003c p_{N_1}$$\n\nIn equation (2), if we subtract $S_{N_1+M_1}$ from the inequality, we get: $$0 \\leq K - S_{N_1+M_1} \u003c -n_{M_1}$$\n\nNow, we can write $S_{N_1+M_1}$ as\n\n$$S_{N_1+M_1} = a_{\\sigma(1)} + a_{\\sigma(2)} + a_{\\sigma(3)} + \\ \\cdots \\ + a_{\\sigma(N_1)} + a_{\\sigma(N_1+1)} + a_{\\sigma(N_1+2)} + a_{\\sigma(N_1+3)} + \\ \\cdots \\ + a_{\\sigma(N_1+M_1)}$$\n\nNotice, that this mapping of $\\sigma$ is injective. Now, we can repeat the process we performed above. Add just enough positive terms from $\\sum p$ till the partial sum of this new rearranged series is just greater than K, then add enough negative terms from $\\sum n$ till the partial sum is lesser than or equal to K. Because both $\\sum n \\ \\\u0026 \\ \\sum p$ diverge to infinity, this process can be carried out infinitely many times.\n\nIn general, our rearranged series would look like\n$$p_1 + p_2 + \\cdots + p_{N_1} + n_1 + n_2 + \\cdots + n_{M_1} + p_{N_1+1} + p_{N_1+2} + \\cdots + p_{N_2} + n_{M_1+1} + n_{M_1+2} \\cdots + n_{M_2} + \\ldots$$\n\nNote that for every partial sum who's last summation step was adding terms from the positive series, $$S_{p_i} - K \u003c p_{N_i}$$ \nand for every partial sum who's last summation step was adding terms from the negative\nseries, $$K - S_{n_i} \u003c n_{M_i}$$\n\nMore generally, we can say that at every \"change in direction\" or magnitude, the partial sum of the rearranged series at that point differs from our real number K by at most $|p_{N_i}|$ or $|n_{M_i}|$. But we know that $\\sum\\limits_{i=n}^{\\infty} a_n$ converges. Therefore, as $n$ tends to $\\infty$, $a_n$ also tends to 0. Consequentially, $|p_{N_i}|$ \\\u0026 $|n_{M_i}|$ must also tend to 0.\nFrom the above two observations, we can say that the following is true:\n\nAs $n$ tends to $\\infty$, the partial sums of our rearranged series $\\sum a_{\\sigma(n)}$ tends to K.\n\n$$\\sum\\limits_{n=1}^{\\infty} a_{\\sigma(n)} = K$$\n\nThis same method can be used to show convergence to any negative real number K or K = 0.\n\n### Existence of A Rearrangement That Diverges to Infinity\n\nLet the series be denoted by $S = \\sum\\limits_{i=0}^\\infty a_i$. It is conditionally convergent. This means that it has an infinite number of positive and an infinite number of negative terms each. Let us denote them as follows:\n\nLet $\\left(p_1, p_2, p_3, \\ldots\\right)$ denote the sub-sequence of all positive terms in $S$ and $\\left(n_1, n_2, n_3, \\ldots\\right)$ denote the sub-sequence of all negative terms in $S$. Since the series is conditionally convergent, both the positive and negative series $(p_i) \\ \\\u0026 \\ (n_i)$ will diverge to $\\pm\\infty$. Hence, we have:\n\n$$\\sum{p} = +\\infty$$ $$\\sum{n} = -\\infty$$\n\nSince $\\sum{p}$ tends to $\\infty$, it implies that there exist a minimum natural number $N_1$ such that for all $N \\geq N_1$ the following holds true:\n$$ \\sum\\limits_{i=1}^{N_1} p_i \u003e |n_1| + c$$\nWhere c is some constant positive real number.\nSimilarly we can find a $N_2$ such that it is the smallest natural number for which the following holds true:\n$$ \\sum\\limits_{i=N_1+1}^{N_2} p_i \u003e |n_2| + c $$\nWe can do this repeatedly an infinite number of times because the sub-sequence of positive terms diverges.\nThis gives us our rearranged series:\n$$\\sum\\limits_{n=1}^{\\infty} a_{\\sigma(n)} = p_1 + p_2 + \\cdots + p_{N_1} + n_1 + p_{N_1+1} + p_{N_1 + 2} + \\cdots p_{N_2} + n_2 + p_{N_2 + 1} + \\ldots$$\n\nOwing to the way we chose $N_1$, the first $N_1 + 1$ terms of the series have a partial sum that is at least $c$ and no partial sum in this group is negative. Similarly, the partial sum of the first $N_2 + 1$ terms of this series are at least greater than $2c$ and no partial sum in this group is negative. In general, for any $N_i + 1$ terms of this series, we can say that the partial sum is at least $N_i*c$ and no partial sum in that group is negative. Hence, we can say that as n tends to $\\infty$, $N_i$ tends to $\\infty$ and therefore, the sequence of partial sums of the series tends to $\\infty$.\n\n## Code to Analyze Series Rearrangement!\n\nAll programs related to this post can be found here: [Repository Link](https://github.com/akcube/notes/tree/main/series-rearrangement).\n\nFrom the proof, we can observe the algorithm one can use to rearrange a conditionally convergent series to sum up to any such real number K.\nHere, we will attempt to do two things. \n\nWe will observe how the rearranged series looks like for it to converge to some real number M. You can use the programs in the above repo to print the series up to a certain number of terms for any real number M. Here, we will attach the output for what the beginning of the series looks like when we attempt to rearrange it to sum to 0.534. \nWe run the program like so: `./print_rearrangement 0.535 100 100`\\\n\nThe program will print what the series looks like for the first 100 groups of positive and negative terms.\n\n![sum_to_0-535](/images/sum_to_0-535.webp)\n\nFurther, we can use prog1.cpp and prog2.cpp to generate data points for plotting. prog1.c will generate data points of the partial sums of the alternating harmonic series up to a given number of terms. We can use prog2.c to generate data points of the partial sums for a rearranged alternating harmonic series that converges to some given real number M. \n\nFor the sake of illustration, we have chosen to plot the partial sums as n keeps increasing for the following rearrangements.\n\n- The normal alternating harmonic series\n\n- A rearrangement of the alternating harmonic series that sums to 0.5512\n\n- A rearrangement of the alternating harmonic series that sums to 0.8932\n\nPlotting them gives us the below graph.\n\n![partial_sums](/images/partial_sums.webp)\n\nHopefully, this graph is able to paint more intuition as to why rearranging the terms of an infinite conditionally convergent series changes its sum. By rearranging the terms such that the sequence of its partial sums oscillates around a limit of our choice, we're able to effectively choose the limit we wish the sequence of partial sums to approach. This is due to the fact that the sum of the positive and negative terms individually diverge to infinity. But the series itself converges to some limit, hence the $n^{th}$ term of the series approaches 0. \n\nBoth these properties are true for conditionally convergent series and it is due to this very reason that we're able to rearrange the infinite sum to converge to whatsoever real sum of our choice.", - "lastmodified": "2026-01-11T05:06:24.994909412+05:30", + "lastmodified": "2026-01-11T10:59:37.380062682+05:30", "tags": [ "domain-math-analysis" ] }, "/blog/set-cover-approximation-algorithms": { "title": "Set Cover \u0026 Approximation Algorithms", - "content": "# Greedy (cont.)\n\nWe previously discussed how the greedy strategy to solving problems is often the **best** way to solve a problem ([More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union), [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding)). It, almost always, provides a very simple implementation of an algorithm which is also very efficient. This is because we are able to reduce the overarching problem to a simple local problem that we can solve quickly at every step. This makes it a great solution when it works.\n\nHowever, as is the case with all things that appear amazing, not all problems can be broken down be solved for a local optimum which restructures the problem into smaller versions of itself.\n\nGreedy algorithms can also often trick the person into believing that they are right. This is because it appears to always do \"the right\" thing. Often changes taken locally affect the global optimum. They are enticing but often **not** optimal. Hence it is quite important for an algorithm analyst to ensure that his greedy strategy is indeed optimal and avoid getting baited.\n\n## Use as approximation algorithms\n\nThat said, greedy algorithms often give us a very _good_ answer. The answer may not be optimal, but it gives us a _\"decent\"_ approximation of the answer for an average case. This is somewhat intuitively understood from the fact that since the greedy is taking the optimal path at every step, it should at least give a decent result. While this is also the reason for it baiting people into believing it is optimal, it is also a good approximation algorithm and comes in clutch when we are tasked with **hard** problems.\n\nConsider the set of **NP-Complete** problems. The **Set Cover** problem belongs to the set of NP-Complete problems. This means that it is one of the hardest problems to solve in NP. There exists no polynomial-time algorithm to solve **Set Cover** _deterministically_. (At least, as of now.)\n\nComputers take a long long long time to solve NP-Complete problems. It is not feasible to expect a computer to solve the set cover problem for n \u003e 100000 anytime in a few hundred years even. However, Set Cover is a common problem, and solving it could be _very_ useful to us.\n\n- [Solving Sudoku can be reduced to an exact cover problem](https://en.wikipedia.org/wiki/Exact_cover#Sudoku)\n- Companies (ex: airlines) trying to plan personnel shifts, often find themselves tasked with solving this exact problem\n- Many tiling problems and fuzz-testing of programs also need to solve set cover\n- Determining the fewest locations to place Wi-Fi routers to cover the entire campus\n\nBut it is not physically feasible for a computer to solve Set Cover. In cases like these, we turn to our savior, the enticing greedy algorithms. The greedy solutions for this problem are **not** optimal. But they run quickly and in most cases, provide a _\"close-to-optimal\"_ answer.\n\nBecause the strategy is not optimal and relies on picking the local optimum, it is obviously going to be possible to reverse engineer a test case against our greedy which makes it often output a not-very-optimal answer, but the point is, in the real world, we have a high probability of not facing such specific cases. This makes them a great solution to our problem.\n\n# The Set Cover Problem\n\nWe mentioned why the set cover problem is useful \u0026 said that it belonged to the **NP-Complete** set of problems. But we never stated the problem formally. The Set Cover problem asks the following question, _Given a set of elements $U$(called the universe) and a collection $S$ of $m$ sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of $S$ whose union is the universe $U$_\n\nThe brute force for this problem is $O(m^n)$. Since this is not feasible to compute, let us consider greedy approximations.\n\n## A greedy approximation algorithm\n\nAn intuitive greedy that comes to mind is the following, _\"at every local step, pick the set which covers the most uncovered elements in the universe.\"_ This intuitively makes sense because we are trying to pick the set $s_i$ which contributes the most towards completing the set cover. However note that this is not optimal and it can, in fact, be _tricked_ into picking the wrong solution at every step.\n\n### Code \u0026 Complexity\n\nThe following code snippet is a C++ implementation of the greedy algorithm. Let's try to put a bound on the complexity.\n\n- The initial sorting step takes $O(nlogn) + [O(|s_1|log|s_1|)+\\dots+O(|s_m|log|s_m|)]$\n- The outer while loop may run as many as $O(n)$ iterations in the worst case. (Consider all disjoint singleton sets)\n - The loop inside may run as many as $O(m)$ iterations\n - Finally, applying two pointers on these strings will again take linear time. We can write this as $O(max\\{|s_1|, \\dots, |s_m|\\})$.\n\nThe dominant term in this definitely comes from the nested while loop and not the sorting. Discarding the complexity from sorting and focusing on the loop, we see that the total complexity is\n\n$O(nm*max\\{|s_1|,\\dots, |s_m|\\})$\n\nIn general, we can say the greedy runs in **cubic** time complexity. This is a **huge** improvement from our NP-Hard $O(m^n)$.\n\n```cpp\n// Input\n\tstring U = \"adehilnorstu\";\n\tvector\u003cstring\u003e S = {\"arid\", \"dash\", \"drain\", \"heard\", \"lost\", \"nose\", \"shun\", \"slate\", \"snare\", \"thread\", \"lid\", \"roast\"};\n\n\t// Sort to allow 2 pointers later\n\tsort(U.begin(), U.end());\n\tfor(auto \u0026s:S) sort(s.begin(), s.end());\n\n\tint left = U.size();\n\tint ans = 0;\n\t// The brute force loop\n\twhile(left){\n\t\tint max_covered = 0;\n\t\tint best_pick = -1;\n\t\t// Go through all subsets of S and pick best one\n\t\tfor(int i=0, covered=0; i\u003cS.size(); i++){\n\t\t\t// Do two pointers to count new elements we are covering\n\t\t\tfor(int j=0, k=0; j\u003cS[i].size() \u0026\u0026 k\u003cU.size(); j++){\n\t\t\t\tif(S[i][j]==U[k]) covered++, k++;\n\t\t\t\telse k++, j--;\n\t\t\t}\n\t\t\t// Update pick choice\n\t\t\tif(covered\u003emax_covered) best_pick = i;\n\t\t\tmax_covered = max(max_covered, covered);\n\t\t}\n\t\t// Cleanup / Updates. Unimportant\n\t\tans++;\n\t\tstring new_string;\n\t\tset\u003cchar\u003e temp; for(auto \u0026c:S[best_pick]) temp.insert(c);\n\t\tfor(auto \u0026c:U) if(temp.find(c)==temp.end()) new_string += c;\n\t\tswap(U, new_string); left = U.size();\n\t}\n\tcout\u003c\u003cans\u003c\u003cendl;\n```\n\n### Tricking the greedy\n\nHowever, since greedy is not optimal, we can trick it into always giving the wrong answer.\n\nConsider this following case,\n\n![set-cover-1](/images/set-cover-1.webp)\n\nOur greedy strategy will end up picking $\\{ s_1, s_2, s_3 \\}$ while the optimal answer is actually $\\{ s_4, s_5 \\}$. Notice that this logic for \"hacking\" the algorithm can be extended to work for any power of 2 $\\geq 3$\n\nThis isn't a very specifically tailored case and something we might even end up finding in real life. This is a little worrying and naturally raises the question, _\"What is the worst approximation that the algorithm can give?\"_\n\nThis might seem a little difficult to put a bound on, but it is possible to do so with just one key observation.\n\n### Putting a bound on the approximation\n\nLet's suppose that our universe set is $U$ and we are attempting to cover $U$ using the $n$ sets belonging to the collection $B$.\n\nNow, let us suppose that we **know** the optimal answer beforehand. Let this optimal answer be $k$. This means that we can always pick some $k$ sets from $B$ such that $\\cup_{b_i}^k = U$.\n\nNow, following along with the greedy strategy, we know that there will be a certain number of elements left **uncovered** after the $t^{th}$ iteration. Let's call this number $n_t$. In the beginning, the entire set is uncovered, and hence $n_0 = 0$.\n\n\u003e The pigeonhole principle states that if $n$ items are put into $m$ containers, with $n\\gt m$, then at least one container must contain more than one item.\n\nNote that at the $t^{th}$ iteration, **if** we have $n_t$ elements left and the optimal answer is $k$, then by the pigeon hole principle, there **must** be a set that has not been picked yet that can **at least** cover $\\frac{n_t}{k}$\n\nelements. This is the key observation which we can use to bound our approximation strategy. Our greedy will (by definition) pick the largest such set which covers $\\geq \\frac{n_t}{k}$ elements. This lets us put the following bound,\n\n$$ n_{t+1}\\leq n_t - \\frac{n_t}{k} = n_t . \\left( 1-\\frac{1}{k} \\right) \\\\ \\implies n_t \\leq n_0 \\left(1-\\frac{1}{k}\\right)^t \\\\ \\text{Now, } 1-x\\leq e^{-x} \\text{ and this equality only holds for } x=0\\\\ \\implies n_t \\leq n_0\\left(1-\\frac{1}{k}\\right)^t \\lt n_0(e^\\frac{-1}{k})^t=ne^{\\frac{-t}{k}} $$\n\nFurther, if we substitute $t = k \\ ln(n)$\n\n$$ n_t \\lt ne^\\frac{-t}{k} = ne^{\\frac{-k\\ ln(n)}{k}} \\\\ = ne^{-ln(n)} = ne^{ln(\\frac{1}{n})} = n.\\frac{1}{n} = 1 $$\n\nNote that $n_t$ is the number of elements left at the $i^{th}$ iteration. Therefore it must be a non-negative integer $\\lt 1$. The only possible answer is 0. When $n_t=0$, notice that the set has been completely covered and we have our answer.\n\nThis **must** mean that the algorithm will terminate after $t=k\\ ln(n)$ iterations. Our algorithm picks exactly 1 set per iteration. This also implies that if our optimal answer is $k$, our greedy strategy will pick at most $k \\ ln(n)$ sets. Hence we have successfully managed to put a bound on the approximation.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n\n---\n\n## Related Reading\n\n- [DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn)\n", - "lastmodified": "2026-01-11T05:06:25.145084756+05:30", + "content": "# Greedy (cont.)\n\nWe previously discussed how the greedy strategy to solving problems is often the **best** way to solve a problem ([More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union), [Activity Selection \u0026 Huffman Encoding](/blog/activity-selection-huffman-encoding)). It, almost always, provides a very simple implementation of an algorithm which is also very efficient. This is because we are able to reduce the overarching problem to a simple local problem that we can solve quickly at every step. This makes it a great solution when it works.\n\nHowever, as is the case with all things that appear amazing, not all problems can be broken down be solved for a local optimum which restructures the problem into smaller versions of itself.\n\nGreedy algorithms can also often trick the person into believing that they are right. This is because it appears to always do \"the right\" thing. Often changes taken locally affect the global optimum. They are enticing but often **not** optimal. Hence it is quite important for an algorithm analyst to ensure that his greedy strategy is indeed optimal and avoid getting baited.\n\n## Use as approximation algorithms\n\nThat said, greedy algorithms often give us a very _good_ answer. The answer may not be optimal, but it gives us a _\"decent\"_ approximation of the answer for an average case. This is somewhat intuitively understood from the fact that since the greedy is taking the optimal path at every step, it should at least give a decent result. While this is also the reason for it baiting people into believing it is optimal, it is also a good approximation algorithm and comes in clutch when we are tasked with **hard** problems.\n\nConsider the set of **NP-Complete** problems. The **Set Cover** problem belongs to the set of NP-Complete problems. This means that it is one of the hardest problems to solve in NP. There exists no polynomial-time algorithm to solve **Set Cover** _deterministically_. (At least, as of now.)\n\nComputers take a long long long time to solve NP-Complete problems. It is not feasible to expect a computer to solve the set cover problem for n \u003e 100000 anytime in a few hundred years even. However, Set Cover is a common problem, and solving it could be _very_ useful to us.\n\n- [Solving Sudoku can be reduced to an exact cover problem](https://en.wikipedia.org/wiki/Exact_cover#Sudoku)\n- Companies (ex: airlines) trying to plan personnel shifts, often find themselves tasked with solving this exact problem\n- Many tiling problems and fuzz-testing of programs also need to solve set cover\n- Determining the fewest locations to place Wi-Fi routers to cover the entire campus\n\nBut it is not physically feasible for a computer to solve Set Cover. In cases like these, we turn to our savior, the enticing greedy algorithms. The greedy solutions for this problem are **not** optimal. But they run quickly and in most cases, provide a _\"close-to-optimal\"_ answer.\n\nBecause the strategy is not optimal and relies on picking the local optimum, it is obviously going to be possible to reverse engineer a test case against our greedy which makes it often output a not-very-optimal answer, but the point is, in the real world, we have a high probability of not facing such specific cases. This makes them a great solution to our problem.\n\n# The Set Cover Problem\n\nWe mentioned why the set cover problem is useful \u0026 said that it belonged to the **NP-Complete** set of problems. But we never stated the problem formally. The Set Cover problem asks the following question, _Given a set of elements $U$(called the universe) and a collection $S$ of $m$ sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of $S$ whose union is the universe $U$_\n\nThe brute force for this problem is $O(m^n)$. Since this is not feasible to compute, let us consider greedy approximations.\n\n## A greedy approximation algorithm\n\nAn intuitive greedy that comes to mind is the following, _\"at every local step, pick the set which covers the most uncovered elements in the universe.\"_ This intuitively makes sense because we are trying to pick the set $s_i$ which contributes the most towards completing the set cover. However note that this is not optimal and it can, in fact, be _tricked_ into picking the wrong solution at every step.\n\n### Code \u0026 Complexity\n\nThe following code snippet is a C++ implementation of the greedy algorithm. Let's try to put a bound on the complexity.\n\n- The initial sorting step takes $O(nlogn) + [O(|s_1|log|s_1|)+\\dots+O(|s_m|log|s_m|)]$\n- The outer while loop may run as many as $O(n)$ iterations in the worst case. (Consider all disjoint singleton sets)\n - The loop inside may run as many as $O(m)$ iterations\n - Finally, applying two pointers on these strings will again take linear time. We can write this as $O(max\\{|s_1|, \\dots, |s_m|\\})$.\n\nThe dominant term in this definitely comes from the nested while loop and not the sorting. Discarding the complexity from sorting and focusing on the loop, we see that the total complexity is\n\n$O(nm*max\\{|s_1|,\\dots, |s_m|\\})$\n\nIn general, we can say the greedy runs in **cubic** time complexity. This is a **huge** improvement from our NP-Hard $O(m^n)$.\n\n```cpp\n// Input\n\tstring U = \"adehilnorstu\";\n\tvector\u003cstring\u003e S = {\"arid\", \"dash\", \"drain\", \"heard\", \"lost\", \"nose\", \"shun\", \"slate\", \"snare\", \"thread\", \"lid\", \"roast\"};\n\n\t// Sort to allow 2 pointers later\n\tsort(U.begin(), U.end());\n\tfor(auto \u0026s:S) sort(s.begin(), s.end());\n\n\tint left = U.size();\n\tint ans = 0;\n\t// The brute force loop\n\twhile(left){\n\t\tint max_covered = 0;\n\t\tint best_pick = -1;\n\t\t// Go through all subsets of S and pick best one\n\t\tfor(int i=0, covered=0; i\u003cS.size(); i++){\n\t\t\t// Do two pointers to count new elements we are covering\n\t\t\tfor(int j=0, k=0; j\u003cS[i].size() \u0026\u0026 k\u003cU.size(); j++){\n\t\t\t\tif(S[i][j]==U[k]) covered++, k++;\n\t\t\t\telse k++, j--;\n\t\t\t}\n\t\t\t// Update pick choice\n\t\t\tif(covered\u003emax_covered) best_pick = i;\n\t\t\tmax_covered = max(max_covered, covered);\n\t\t}\n\t\t// Cleanup / Updates. Unimportant\n\t\tans++;\n\t\tstring new_string;\n\t\tset\u003cchar\u003e temp; for(auto \u0026c:S[best_pick]) temp.insert(c);\n\t\tfor(auto \u0026c:U) if(temp.find(c)==temp.end()) new_string += c;\n\t\tswap(U, new_string); left = U.size();\n\t}\n\tcout\u003c\u003cans\u003c\u003cendl;\n```\n\n### Tricking the greedy\n\nHowever, since greedy is not optimal, we can trick it into always giving the wrong answer.\n\nConsider this following case,\n\n![set-cover-1](/images/set-cover-1.webp)\n\nOur greedy strategy will end up picking $\\{ s_1, s_2, s_3 \\}$ while the optimal answer is actually $\\{ s_4, s_5 \\}$. Notice that this logic for \"hacking\" the algorithm can be extended to work for any power of 2 $\\geq 3$\n\nThis isn't a very specifically tailored case and something we might even end up finding in real life. This is a little worrying and naturally raises the question, _\"What is the worst approximation that the algorithm can give?\"_\n\nThis might seem a little difficult to put a bound on, but it is possible to do so with just one key observation.\n\n### Putting a bound on the approximation\n\nLet's suppose that our universe set is $U$ and we are attempting to cover $U$ using the $n$ sets belonging to the collection $B$.\n\nNow, let us suppose that we **know** the optimal answer beforehand. Let this optimal answer be $k$. This means that we can always pick some $k$ sets from $B$ such that $\\cup_{b_i}^k = U$.\n\nNow, following along with the greedy strategy, we know that there will be a certain number of elements left **uncovered** after the $t^{th}$ iteration. Let's call this number $n_t$. In the beginning, the entire set is uncovered, and hence $n_0 = 0$.\n\n\u003e The pigeonhole principle states that if $n$ items are put into $m$ containers, with $n\\gt m$, then at least one container must contain more than one item.\n\nNote that at the $t^{th}$ iteration, **if** we have $n_t$ elements left and the optimal answer is $k$, then by the pigeon hole principle, there **must** be a set that has not been picked yet that can **at least** cover $\\frac{n_t}{k}$\n\nelements. This is the key observation which we can use to bound our approximation strategy. Our greedy will (by definition) pick the largest such set which covers $\\geq \\frac{n_t}{k}$ elements. This lets us put the following bound,\n\n$$ n_{t+1}\\leq n_t - \\frac{n_t}{k} = n_t . \\left( 1-\\frac{1}{k} \\right) \\\\ \\implies n_t \\leq n_0 \\left(1-\\frac{1}{k}\\right)^t \\\\ \\text{Now, } 1-x\\leq e^{-x} \\text{ and this equality only holds for } x=0\\\\ \\implies n_t \\leq n_0\\left(1-\\frac{1}{k}\\right)^t \\lt n_0(e^\\frac{-1}{k})^t=ne^{\\frac{-t}{k}} $$\n\nFurther, if we substitute $t = k \\ ln(n)$\n\n$$ n_t \\lt ne^\\frac{-t}{k} = ne^{\\frac{-k\\ ln(n)}{k}} \\\\ = ne^{-ln(n)} = ne^{ln(\\frac{1}{n})} = n.\\frac{1}{n} = 1 $$\n\nNote that $n_t$ is the number of elements left at the $i^{th}$ iteration. Therefore it must be a non-negative integer $\\lt 1$. The only possible answer is 0. When $n_t=0$, notice that the set has been completely covered and we have our answer.\n\nThis **must** mean that the algorithm will terminate after $t=k\\ ln(n)$ iterations. Our algorithm picks exactly 1 set per iteration. This also implies that if our optimal answer is $k$, our greedy strategy will pick at most $k \\ ln(n)$ sets. Hence we have successfully managed to put a bound on the approximation.\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", + "lastmodified": "2026-01-11T10:59:49.781799282+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -388,7 +388,7 @@ "/blog/shortest-common-superstring-de-brujin-graphs": { "title": "Shortest Common Superstring \u0026 De Brujin Graphs", "content": "# Preface \u0026 References\nI document topics I've discovered and my exploration of these topics while following the course, [Algorithms for DNA Sequencing, by John Hopkins University](https://www.coursera.org/learn/dna-sequencing) on [Coursera](https://www.coursera.org/). The course is taken by two instructors [Ben Langmead](https://scholar.google.com/citations?user=2JMaTKsAAAAJ\u0026hl=en) and [Jacob Pritt](https://www.coursera.org/instructor/jacobpritt).\n\nWe will study the fundamental ideas, techniques, and data structures needed to analyze DNA sequencing data. In order to put these and related concepts into practice, we will combine what we learn with our programming expertise. Real genome sequences and real sequencing data will be used in our study. We will use Boyer-Moore to enhance naïve precise matching. We then learn indexing, preprocessing, grouping and ordering in indexing, K-mers, k-mer indices and to solve the approximate matching problem. Finally, we will discuss solving the alignment problem and explore interesting topics such as De Brujin Graphs, Eulerian walks and the Shortest common super-string problem. \n\nAlong the way, I document content I've read about while exploring related topics such as suffix string structures and relations to my research work on the STAR aligner.\n# Shortest Common Superstring (SCP)\n\nWe will now attempt to model the assembly problem ([De-Novo Assembly \u0026 Overlap Graphs](/blog/de-novo-assembly-overlap-graphs)) as computational problems. Our first attempt at this will be modelling it as solving the SCP problem.\n\nA **shortest common superstring** is a string that is a combination of two or more strings, such that the resulting string is the shortest possible string that contains all of the original strings as sub-strings.\n\nThe problem of finding the shortest common superstring of these sequences is equivalent to finding the original genome sequence, as it is the shortest possible sequence that contains all of the original sequences as sub-strings. Thus, solving the shortest common superstring problem can be used to assemble a genome from a set of overlapping DNA sequences. However, a sad reality is that this problem is **NP-Complete**.\n\n\u003e***Proof sketch:** The shortest common superstring problem is NP-Complete because it is a generalization of the NP-Complete Shortest Hamiltonian Path problem. In the Hamiltonian Path problem, we are given a graph and must find a path that visits every vertex exactly once. To reduce the Hamiltonian Path problem to the shortest common superstring problem, we can represent the graph as a set of strings, where each string corresponds to a vertex in the graph. We can then create a new string for each possible path in the graph by concatenating the corresponding strings in the order that they appear in the path. The resulting set of strings will contain all possible paths in the graph as sub-strings. Finally, we can find the shortest common superstring of these strings, which will be the shortest possible path that visits every vertex in the graph exactly once. Because the shortest common superstring problem is at least as hard as the Shortest Hamiltonian Path problem, it is NP-Complete.*\n\n## Greedy Approach\n\nA greedy approach to solving the shortest common superstring problem involves iteratively selecting the pair of strings that overlap the most, and merging them into a single string. This process is repeated until all of the strings have been merged into a single superstring. It selects the pair of strings that appears to be the best choice without considering the overall optimality of the solution. This can give us a decent reconstruction but is sadly still pretty inaccurate in practice.\n\n## 3rd Law of Assembly: Repeats Are Bad\n\nThis is probably the most **frustrating** problem in genome assembly and what makes it pretty much impossible to solve the assembly problem with $100\\%$ certainty.\n\nConsider the following example,\n\n![greedy-repeat-fail](/images/greedy-repeat-fail.webp)\n\nOur greedy solution gave us a shorter sequence than the original genome, this is due to the presence of overlapping reads from a repeating portion of our genome which is **extremely hard** to unambigiously solve. The primary problem here is that we are aware of its existence due to the pieced together multiple reads but we are not sure about the **frequency** of these repeats.\n\n# De Brujin Graphs\n\nDe Bruijn graphs are a mathematical construct that is often used in the field of computational biology, particularly in the context of genome assembly. In a De Bruijn graph, each vertex represents a k-mer, which is a sub-sequence of length $k$ from a given string. Edges in the graph represent overlaps between k-mers, such that two vertices are connected by an edge if the corresponding k-mers overlap by $k-1$ bases. The graph can then be used to efficiently represent the overlaps between the k-mers in the original string, and can be used to reconstruct the original string by finding a path through the graph that visits every vertex exactly once. This is also called an **Eulerian Walk.** \n\nIt has exactly one node per *distinct* k-mer and one edge per *each* k-mer.\n\n![euler-walk-in-de-brujin-graph](/images/euler-walk-in-de-brujin-graph.webp)\n\nHowever, we have still not dealt with the problem of repeats. For example, if the graph contains multiple cycles, then it may not be possible to find an Eulerian walk that correctly reconstructs the original genome sequence, as the path may not be able to distinguish between the different cycles. Additionally, if the graph contains errors or missing k-mers, then an Eulerian walk may not be able to correctly reconstruct the original genome. This is all mainly caused due to the presence of repeats in the original genome. \n\n![debrujin-fail](/images/debrujin-fail.webp)\n\nThe issue in the above example occurs primarily due to the repeating term *AB*. This gives us multiple reshuffles of the sequence and we cannot deterministically figure out which reconstruction is correct. \n\n## Fixing What We Can\n\n![prune-useless-edges](/images/prune-useless-edges.webp)\n\nWe often have edges like these showing up in the De Brujin graph where the existence of the blue edges nullify any information we might gain from the green edge. We can prune these from the graph.\n\n![dbg-mp-fail](/images/dbg-mp-fail.webp)\n\nMaternal / Paternal chromosomes can have one different base in a read causing cycles like these to form. We can attempt to prune these from the graph as well. \n\n![independent-solving](/images/independent-solving.webp)\n\nBecause repeats **always** cause ambiguity, we can attempt to break up the graph into parts and solve only the deterministic chunks first and mark the chunks with repeats as *ambiguous*. In fact, this is how most assemblers work in practice nowadays. Excluding small genomes, it is very difficult to get accurate reconstructions of a complete genome. Even the Human Genome, the most widely studied genome on the planet still has many gaps in it today due to the uncertainties caused by repeats in the genome.\n\n## Attempts at Discerning the Ambiguity\n\nOne simple solution we could provide here is not from a computational point of view but from the point of view of the technology that generates the sequences. Increasing the lengths of the reads could allow the repeating fragments to also contain some potion of distinct / unique read fragments which allows them to now be uniquely matched with better certainty. Another type of sequencing which gathers some metadata from the surrounding reads is also making it's way into the mainstream. In the end, we'll need to get more data than we already have and then develop algorithms to solve these new problems with the additional metadata to try to get better certainty about the ambiguous portions of sequenced Genomes. ", - "lastmodified": "2026-01-11T05:06:25.856052557+05:30", + "lastmodified": "2026-01-11T10:59:38.087334632+05:30", "tags": [ "domain-cs-algorithms-analysis", "domain-science-bioinformatics" @@ -397,7 +397,7 @@ "/blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree": { "title": "Shortest Reliable Path, Floyd Warshall \u0026 Max-Independent Set (Tree)", "content": "Last time we discussed [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem). Today, we'll look at three more interesting problems with cool Dynamic Programming solutions.\n# Shortest Reliable Path\n\nConsider the following dispatch problem. Often when trying to schedule deliveries of goods, it is not good enough to only determine the shortest path from source to destination. One needs to also take into account the number of points at which the goods must switch transport vehicles. This could have an effect on the quality of goods received. We can have similar applications in networking where we do not want to switch _edges_ multiple times. In these cases, we try to solve a slight variation of the shortest path problem.\n\nThe shortest reliable path problem asks the following question, _\"Given a graph $G$ with weighted edges, what is the shortest path from location $s$ to location $t$ such that the path consists of **at most** k-edges?\"_\n\nWe can solve this problem using dynamic programming.\n\n## The DP solution\n\nLet's think about the following recurrence. If I know what the shortest path to some vertex $v$ is using $i$ edges, I can just go over all my edges again in a \"relaxation\" step and find out what the shortest path to vertex $v$ is using $i+1$ edges. We have identified our subproblem!\n\nLet's define $dp[i][j]$ as the shortest path to reach vertex $i$ using just $j$ edges.\n\n1. **Number of subproblems**\n \n Notice that we have $|V|$ number of vertices and will have to compute the answer for $i:1\\to k$ edges. Therefore we will have $|V|k$ subproblems. $k$ can be around $m$. This would then require $O(|V|m)$ problems solved.\n \n2. **Finding how to brute force the solution to some subproblem state**\n \n To go from knowing the shortest paths using $i$ edges, to know the solution when using $i+1$ edges, we will have to \"relax\" all the edges once. We can solve **all** the subproblems for some number of edges $k'$ by just iterating over the entire edge list in $O(m)$\n \n3. **Finding the recurrence**\n \n As mentioned previously, relaxing all edges will net us the desired result.\n \n We can write the recurrence as\n \n $$ dp[v][i] = min_{(u,v)\\in E}(dp[u][i-1]+l(u,v), dp[v][i]) $$\n \n Notice that this implies that we initially consider all distances from $v$ to any other vertex as $\\infty$.\n \n4. **Figuring out DAG structure**\n \n We can visualize this as a simple linear chain. We solve the problem for **all** vertices using $i+1$ edges in one go. So we can just think of it as a linear chain going from $i=1\\to 2 \\to 3\\to \\dots \\to k$ edges.\n \n5. **Completing the solution**\n \n Armed with all the information we need, all we need to do now is calculate the final solution. Since we're computing $k$ problems in $O(m)$ iterations each, the solution has overall $O(km)$ complexity where $m = |E|$. In the worst case when $k \\to m$ we can have $O(m^2)$ complexity.\n \n\n### A tighter bound\n\nNotice that our solution is **very** similar to the Bellman-Ford algorithm. It's because Bellman-Ford and our algorithm work on the same principle. Both the algorithms solve the very same subproblems. But notice that **any spanning tree** of our graph will connect all vertices and this implies that there will always be a path between two vertices using just $n-1$ vertices. This means repeating our algorithm $n-1$ times will converge at the optimal shortest distance solution.\n\nFrom this fact, we can naturally conclude that the bound on the value of $k$ is $|V|$. Therefore our solution will not have $O(m^2)$ complexity as we can bound $k = min(k, |V|-1)$. This gives our algorithm a better runtime of $O(|V||E|)$.\n\n### 1D Row Optimization\n\nNotice that again, we are computing the answer for all $v \\in V$ using $i$ edges using the answer for $i-1$ edges. This means that we in fact do not need to store the solution for **all** $O(|V|k)$ subproblems. Simply storing the answer for $O(|V|)$ subproblems would be enough.\n\nHence we can optimize it to just using 1 row.\n\n### Code\n\nThe code for this DP solution is quite beautiful and short. Vector `d` stores the DP values for any given state. Here, we assume the graph is stored in edge list representation. `e` is the edge list.\n\n```cpp\nvector\u003cint\u003e d (n, INF);\nd[v] = 0;\nfor (int i=0; i\u003cmin(k, n-1); ++i)\n for (int j=0; j\u003cm; ++j)\n if (d[e[j].a] \u003c INF)\n d[e[j].b] = min (d[e[j].b], d[e[j].a] + e[j].cost);\n```\n\n## An alternate Greedy + DP solution\n\nDijkstra is a greedy algorithm that computes the shortest paths solution in $O(ElogV)$ with the help of a priority queue implementation using some heap. Notice that we can modify how the heap stores its top element and eliminate some skipping to arrive at a solution for the shortest reliable paths problem!\n\nLet's say I said that my new criteria for highest priority were a pair $(i, dis[v])$. In this notation, I first sort priority using $i$. The pair with the lowest $i$ is given the highest priority. Once sorted by $i$, we assign priority based on the smallest $dis[v]$.\n\n### Our claim\n\nI claim that with this additional bookkeeping, we will be able to solve this problem once we eliminate a speedup check in the original Dijkstra.\n\nLet's think about what this additional bookkeeping is doing. By enforcing this constraint, we are essentially saying that we **must** first update all reachable vertices using the smallest number of edges $i$. So we are just simply running Dijkstra for a more constrained graph. This means that I will be able to compute the solution using $i$ edges.\n\nHowever, Dijkstra skips over all the nodes already visited. This is essential in keeping the complexity down. Consider this case.\n\n![srp-1](/images/srp-1.webp)\n\nWe will not be able to update the third node from the left do distance 3 once it has already been processed for reachability using 2 edges. Hence we will have to eliminate this skipping and force the algorithm to process new vertices again.\n\n### How is this different from the previous solution?\n\nNotice that in the previous solution, for any randomized sparse graph, we would, in the beginning, be iterating over many edges that are from the reach of the source node using a small number of edges. This is redundant work that we were doing. Here, we are **only** iterating over the edges that are reachable.\n\n**The complexity of this solution**\n\nAssuming we are using a priority queue, our solution has the worst time runtime of $O(kElogV)$. However, notice that because we are not iterating over every edge on every iteration, for sparse graphs where $k$ is small, we might have a better/faster runtime using this solution.\n\n# Floyd Warshall\n\nThe problem is as follows, _\"Given a graph G, find the shortest distance between all pairs of points.\"_\n\nNotice that we can compute the answer to this problem simply by running Dijkstra $|V|$ times. This would have an overall runtime of $O(|V||E|log|V|)$. For dense graphs, the complexity might reach $O(n^3logn)$ where $n = |V|$. We also require at the very minimum, a binary heap implementation of a priority queue.\n\nFurther, this solution will **not** work if the graph contains any negative edge weights.\n\n## The DP Solution\n\nThe first step to solving it with DP is identifying a subproblem. Let's say I order my nodes in some arbitrary fashion. This implies that my nodes are always in some order and the concept of _\"first k nodes\"_ can be applied to them. Now, I can define by DP state as follows:\n\nLet $dp[i][j][k]$ represent the length of the **shortest** path from nodes $i \\to j$ using _just_ $k$ nodes as _intermediaries._ Notice that now, we can define a recurrence between subproblems as follows:\n\n$$ dp[i][j][k] = min(dp[i][k][k-1] + dp[k][j][k-1], dp[i][j][k-1]) $$\n\nLet's see what this means. When computing the shortest distance between any two nodes $i, j$ using $k$ intermediary nodes, we assume that we **know** the optimal solution to the distance between them when using just $k-1$ intermediaries.\n\nIf these subproblems have been solved, then when computing the shortest distance between $i,j$ using $k$ intermediaries, the question essentially boils to asking _\"Should we include intermediate node $k$ in the shortest path?\"_\n\nTo answer this, we check what the shortest path from $i \\to k$ is and $k \\to j$ is using $k-1$ intermediate nodes. If the sum of these distances is lesser than the min computed so far, we can include node $k$. Notice that we are simply including 1 node. Therefore our computation for the DP state will be correct.\n\n![fw-1](/images/fw-1.webp)\n\nThis is a visual representation of the sub-problem we're attempting to solve.\n\nNow for the base case, the distance between any two nodes using 0 intermediary nodes will be $\\infty$ when they're not connected and $l(u,v)$ when they are connected. It's essentially the adjacency matrix representation of the graph with disconnected vertices marked with $\\infty$.\n\n### Time complexity\n\nWe have $i\\cdot j \\cdot k$ subproblems to solve and each sub-problem takes $O(1)$ computation to solve.\n\nTherefore overall time complexity of our algorithm will be $O(n\\times n\\times n) = O(n^3)$ . Here $n = |V|$.\n\n### Space complexity\n\nNotice that naively, we must store the computation for $O(n^3)$ subproblems and hence require $O(n^3)$ space. However, notice that we can do something very similar to 1D row optimization. Notice that for computing all subproblems with DP state $k$, we only require the solution of all-pairs shortest paths using $k-1$ intermediaries. This means we only need to store $O(n^2)$ solutions at any point in time. Hence we can reduce the space complexity down to $O(n^2)$.\n\n### Code\n\nAgain, as with most DP solutions, the code is quite short and sweet :)\n\n```cpp\nfor (int k = 0; k \u003c n; ++k) {\n for (int i = 0; i \u003c n; ++i) {\n for (int j = 0; j \u003c n; ++j) {\n d[i][j] = min(d[i][j], d[i][k] + d[k][j]); \n }\n }\n}\n```\n\n# Independent Set in a tree\n\nThe problem we're trying to solve here is as follows, _\"Given a tree G, find the **largest** independent set of vertices belonging to the tree. Here, we define a subset of vertices $S$ of $V$as independent if there are **no** edges belonging to G which connect **any** two pair of vertices in the subset $S$.\"_\n\nFrom the definition of \"independent set\", we can easily conclude that the set $S$ must be a bipartite subset of $G$. However, notice that any bipartite coloring won't do. More specifically, a bipartite coloring where we color one node then skip it's children and proceed won't do.\n\n![ist-1](/images/ist-1.webp)\n\nThis is a simple counter case to that solution. We require both the lumps of vertices and the bottom and the top for the optimal solution. Notice that this hints us towards the sub-problem we require to solve. \n\n## The DP Solution\nWe can define our sub-problem as follows, _\"Should we include node $u$ in the answer or not?\"_ To further this and make this more useful, we can define a DP state as follows: _\"How many nodes would I get in my optimal matching if I included node $u$ in the subset?\"_\n\nIf our DP stores this, notice that every node $u$ is the root of some subtree. This means we can calculate the answer for each subtree of $G$ and the answer will be the $DP$ state for the root of the tree.\n\nNow, how do we find the recurrence?\n\nThis can be done greedily.\n\nNotice that **if** we include $u$ in the answer, we **cannot** include any child of $u$. The next **best** option is to include the grandchildren (children of children) of $u$.\n\nNotice that this is optimal. Because every $DP$ state stores some positive quantity and **not** choosing to include a grandchild would imply we missed a chance to increase the value. Further, each DP state is only dependent on its children and grandchildren. Hence this decision does not affect future DP states.\n\nIf we **do not** include $u$ in the answer, then we **must** pick all its children. The reasoning for this is the same as the above.\n\nNow, we have a recurrence.\n\nWith our $DP$ state defined as\n\n$$ DP[i] = \\text{ size of largest indepdendent set in subtree rooted at i} $$\n\nwe can define the recurrence as follows\n\n$DP[i] = max(1 + \\sum_{grandchildnren \\ x} DP[x] , \\sum_{children \\ y} DP[y])$\n\nThe first term is the maximum answer attainable when **including $i$.** The second term is the maximum attainable when **not including $i$.** These are the only two conditions possible.\n\n### Time complexity\n\nNotice that we have $O(n)$ where $n = |V|$ subproblems to solve for and each subproblem takes $O(1)$ complexity. Therefore the overall time complexity of our algorithm is $O(|V|)$.\n\n### Space complexity\n\nWe have $O(n)$ subproblems to solve. This gives us a space complexity of $O(n)$.\n\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:06:26.026999644+05:30", + "lastmodified": "2026-01-11T10:59:38.254074614+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -405,7 +405,7 @@ "/blog/smart-pointers": { "title": "Smart Pointers", "content": "\nResources referred to:\n1. [SMART POINTERS in C++ (std::unique_ptr, std::shared_ptr, std::weak_ptr) - The Cherno](https://www.youtube.com/@TheCherno)\n2. [Back to Basics: Smart Pointers and RAII - Inbal Levi - CppCon 2021](https://www.youtube.com/@CppCon)\n3. [Back to Basics: C++ Smart Pointers - David Olsen - CppCon 2022](https://www.youtube.com/@CppCon)\n\nBefore reading this section, I recommend reading the previous section on [`new` and `delete`](/blog/new-and-delete) to get a better idea of the problem(s) we have with memory allocation and manipulation and how we're trying to fix them. We figured out how to workaround / solve the uninitialized memory problem, but we still have to deal with the issue of memory leaks and dangling pointers.\n# Preface\n\nAs programmers, when working in a large code base, it is often difficult to manually keep track of all the memory allocations and remember to free them correctly. Tools like `valgrind` can help identify memory leaks, but it's still a pain to run a massive project on it. Even worse, sometimes programmers are just too lazy to free memory correctly. \n\nJava and other more \"modern\" languages have the idea of keeping [garbage collectors](/blog/garbage-collectors). Its primary purpose is to automatically manage memory by identifying and reclaiming memory that is no longer needed or accessible by the program, thus preventing memory leaks and ensuring efficient memory usage. However, the existence of a garbage collector means that the program can only be run on a system on a runtime. For Java, this would be the JVM. This introduces a performance overhead because the garbage collector is a tool that is actively working in the background to identify unreachable objects and freeing them. Being C++ nerds, we don't want performance bottlenecks. \n\n![pasted-image-20230909082901](/images/pasted-image-20230909082901.webp)\nsource: [Back to Basics: Smart Pointers and RAII - Inbal Levi - CppCon 2021](https://www.youtube.com/@CppCon)\n\nInstead, the goal for C++ is to introduce an \"API\" of sorts that programmers can use to manage their memory right. Smart pointers are a cool interface provided by the C++ standard library to leverage the power of runtime stack allocation to manage memory efficiently by automating the process of calling `new` and `delete`. In essence, smart pointers are just simple wrappers around raw pointers.\n\n# The ownership model\n\nThe way C++ tries to solve the automatic memory management problem is by introducing the ownership model. Smart pointers **enforce** this model for dynamically allocated heap memory. \n- `std::unique_ptr` =\u003e Represents a \"single owner\" model. Memory managed by an `std::unique_ptr` can only \"be owned by\" by that one instance of an unique pointer. It cannot be owned (copied) by multiple unique pointer instances. The only way to *change* ownership is to **move** the ownership to a different instance of an unique pointer. Original pointer releases, new pointer acquires. \n- `std::shared_ptr` =\u003e This builds on the unique pointer by now allowing a resource to be shared / \"owned\" by multiple `std::shared_ptr` instances. Multiple shared pointers can now share and copy the ownership rights over a shared pointer. The memory being managed is freed only when *every last* owner of the shared pointer has gone out of scope.\n- `std::weak_ptr` =\u003e This is a weaker version of the shared pointer. Weak pointers can copy / gain weak ownership over a shared pointer. This means that it does have *strong* ownership rights. If the strong owners go out of scope, the weak pointer will be invalidated. In essence, it does not hold any power over when the pointer may be invalidated / cleaned up.\n# `std::unique_ptr`\n\nAn `unique_ptr` is perhaps the simplest type of a smart pointer. It's a scoped object, which just means that when the pointer goes out of scope, it gets destroyed. Unique pointers are called unique pointers because you **cannot copy** an unique pointer. Why? Because all an unique pointer really is, is just a `class` wrapper around your raw pointer. If it was copied, we would now have **2** instances of this *manager class*. When they go out of scope, we call the destructor twice, the second one attempting to free a `nullptr`.\n\n```cpp\n// My sample simple implementation of an unique_ptr\ntemplate\u003ctypename T\u003e\nclass unique_ptr{\npublic:\n\texplicit unique_ptr() : obj(nullptr) {}\n\texplicit unique_ptr(T* obj) : obj(obj) {}\n\texplicit unique_ptr(unique_ptr \u0026other) = delete;\n\t~unique_ptr() { delete obj; }\nprivate:\n\tT* obj;\n};\n```\n\nLet's go back to our toy-class `Entity` (from [`new` and `delete`](/blog/new-and-delete)) and see how it works now:\n\n```cpp\nstd::unique_ptr\u003cEntity\u003e oobj(new Entity); // Output: Constructor!\\nDestructor!\\n\nunique_ptr\u003cEntity\u003e fobj(new Entity); // Output: Constructor!\\nDestructor!\\n\n```\n\nNote that the `explicit` constructors and deleted copy-constructors means that the following code will not compile.\n\n```cpp\nstd::unique_ptr\u003cEntity\u003e obj = new Entity; \n```\n\u003e `conversion from ‘Entity*’ to non-scalar type ‘std::unique_ptr\u003cEntity\u003e’ requested`\n\n\u003chr\u003e\n### `std::make_unique`\n\nThe 'recommended' way to initialize an unique pointer in C++ is using `std::make_unique\u003cT\u003e`. The primary reason this is recommended is because of **exception safety.** If the constructor happens to throw an exception we'll now not end up with a memory leak or dangling pointer. `make_unique` is basically a way to shorten writing:\n\n```cpp\n// allocation. `new` can throw an exception if constructor fails.\nEntity *b = new Entity(\"42\"); \n// Handle the memory to unique_ptr to manage the memory\nstd::unique_ptr\u003cEntity\u003e uptr(b); \n\n// OR do both in one-step with std::make_unique\nstd::unique_ptr\u003cEntity\u003e uptr = std::make_unique\u003cEntity\u003e(\"42\"); \n```\n\nHere is an example where this exception safety is of good use:\n\n```cpp\n// unsafe\nfoo(std::unique_ptr\u003cint\u003e(new int(4)), std::unique_ptr\u003cint\u003e(new int(2))); \n// safe\nfoo(std::make_unique\u003cint\u003e(4), std::make_unique\u003cint\u003e(4));\n```\n\nIn the first example, first of all we have no guarantee of order of evaluation. From [cppreference.com](https://en.cppreference.com/w/cpp/language/eval_order):\n\u003e Order of evaluation of any part of any expression, including order of evaluation of function arguments is _unspecified_ (with some exceptions listed below). The compiler can evaluate operands and other sub-expressions in any order, and may choose another order when the same expression is evaluated again.\n\n[IBM Docs](https://www.ibm.com/docs/en/zos/2.2.0?topic=only-stack-unwinding-c)\n\u003e When an exception is thrown and control passes from a try block to a handler, the C++ run time calls destructors for all automatic objects constructed since the beginning of the try block. This process is called stack unwinding. The automatic objects are destroyed in reverse order of their construction.\n\nSay the 2nd one throws an exception in the constructor, its guarding destructor will not be called and we'll be left with a memory leak. Because order of evaluation is not guaranteed we can't even easily determine the leak.\n\n\u003chr\u003e\n\nComing back to unique pointers, we know that it is unique because the copy and copy-assignment constructors have been deleted and that the memory gets freed only when the destructor is called. Not true! There are 4 cases when the memory that is managed by a shared pointer can be freed. \n1. When the object goes out of scope\n2. When we **`move`** a resource from one unique pointer to another\n\n```cpp\n\tauto e1 = std::make_unique\u003cEntity\u003e(\"abcd\");\n\tauto e2 = std::move(e1); \n\t// Leaves e1 in an 'invalidated' state (implementation defined). Accessing e1 is UB.\n```\n\n3. Explicitly `release` ownership. This stops the unique pointer instance from actively managing the raw memory and returns the raw pointer.\n\n```cpp\n\tT *raw_ptr = e1.release(); // Frees the memory\n```\n\n4. Terminate the object and replace the ownership\n\n\t```cpp\n\te1.reset(new Entity(\"efgh\")); // replace the ownership\n\t```\n\n#### Custom Destructors!\n\nA cool thing about smart pointers is that they accept custom destructors. For example:\n\n```cpp\nEntity *e = new Entity;\nstd::unique_ptr\u003cEntity, std::function\u003cvoid(Entity*)\u003e\u003e uptr(e, [\u0026](Entity *e){\n\tstd::cout \u003c\u003c \"Custom destructor!\" \u003c\u003c std::endl;\n});\n// Output: Constructor!\\nCustom destructor!\\n\n```\n### A revised implementation of `std::unique_ptr`\n\n```cpp\n// A slightly superior implementation.\ntemplate\u003ctypename T\u003e\nclass unique_ptr{\npublic:\n\texplicit unique_ptr() noexcept : obj(nullptr) {}\n\texplicit unique_ptr(T* obj) noexcept : obj(obj) {}\n\texplicit unique_ptr(unique_ptr \u0026other) = delete;\n\tunique_ptr\u0026 operator=(const unique_ptr\u0026) = delete;\n\tunique_ptr(unique_ptr\u0026\u0026 other) noexcept : obj(other.release()) {}\n\tunique_ptr\u0026 operator=(unique_ptr\u0026\u0026 other){\n\t\tif(this != \u0026other)\n\t\t\treset(other.release());\n\t\treturn *this;\n\t}\n\t~unique_ptr() noexcept { delete obj; }\n\n\tT* get() { return obj; }\n\tT* release() {\n\t\tT* cpy = obj;\n\t\tobj = nullptr;\n\t\treturn cpy;\n\t}\n\tvoid reset(T *upd) noexcept {\n\t\tdelete obj;\n\t\tobj = upd;\n\t}\nprivate:\n\tT* obj;\n};\n```\n\n# `std::shared_ptr`\n\nLike previously mentioned, a `std::shared_ptr` is an unique pointer that allows 'sharing' ownership. This means that shared pointers can be copied and assigned. \n\n```cpp\nstd::shared_ptr\u003cEntity\u003e outer;\n{\n\tstd::cout \u003c\u003c \"Start of inner scope\" \u003c\u003c std::endl;\n\tstd::shared_ptr\u003cEntity\u003e inner = std::make_shared\u003cEntity\u003e(\"abcd\");\n\tstd::cout \u003c\u003c \"End of inner scope\" \u003c\u003c std::endl;\n}\n/**\n * Output:\n * Start of inner scope\n * P-Constructor!, name: abcd\n * End of inner scope\n * Destructor!\n */\n```\n\nYou will notice, that we were now able to use the copy-assignment operator with our shared pointer object. And further, even though the inner shared pointer is out of scope, the destructor is called only after the outer shared pointer (which received ownership via the copy-assignment operator) goes out of scope.\n\n\u003chr\u003e\n## `std::make_shared`\n\nApart from the same reasons listed for `std::make_unique`, there are more reasons to use `std::make_shared` instead of `std::shared_ptr\u003cEntity\u003e inner(new Entity())`. The reason somewhat comes down to the implementation and overhead associated with shared pointers.\n\n\u003chr\u003e\n### Implementation Notes\n\nHow `std::shared_ptr` is implemented is ultimately up to the compiler and what standard library we are using. It is implementation specific, and there is no standard defined for *how* the sharing must be implemented. However, it is almost always implemented in the popular libraries using **reference counting.**\n\nWhat this means is that a shared pointer essentially manages two blocks of memory. There is a \"control block\" which contains information regarding to the reference count and then there's the memory that's being managed. In essence, you can think of the control block as being a dynamically allocated integer object that keeps reference count of the number of shared pointers which hold ownership over the managed memory that are still in scope. \n\nThis memory only needs to be allocated once, in the normal constructor of a shared pointer. Because this is when we are stating the existence of a new ownership. Now when ownership is being copied, we just need to increment this reference count. This can be done in the copy constructor and copy-assignment operator calls. Finally, as each shared pointer goes out of scope and it's destructor is called, it can just decrement the value of the reference count. When the final shared pointer goes out of scope, decrementing the reference count to zero, we now that there exist no more shared pointers which ownership of the memory being managed, and hence we can then safely de-allocate both blocks of memory. \n\n\u003chr\u003e\n\nWhy does this matter? Unlike with `std::unique_ptr`, there is an overhead associated with using a `std::shared_ptr` in the form of the control block memory that it must additionally allocate and share. If we call `std::shared_ptr\u003cEntity\u003e inner(new Entity())`, there is first an allocation in the inner call to the `new` operator. This is then followed by an extra call to allocate the control block memory. That's 2 allocations. \n\nHowever, with `std::make_shared`, it can actually construct them **together**, essentially halving the allocation requests and also keeping them close by in memory. This is significantly faster. (Remember, memory allocation cost is often **very** expensive in comparison to the other book-keeping operations here). Hence it's almost always a good idea to use `std::make_shared` instead of passing an already allocated memory-pointer to `std::shared_ptr`.\n\n\u003chr\u003e\n\n# `std::weak_ptr`\n\n`std::weak_ptr` is the final member of our little group of smart pointers which completes C++'s ownership ideology. A `std::weak_ptr` can copy ownership from a `std::shared_ptr`, except this ownership is *weak*. You can imagine it as a shared pointer which when copying, does **not** increase the reference count of the original shared pointer. Due to this, it is possible the weak pointer is still in scope but because all the *strong* owners of the managed memory have gone out of scope, the memory has been freed and the weak pointer is now in an **invalidated** state.\n\nIt's like saying, I don't actually *want ownership* of the object, but I just want to keep a reference to the allocated entity. This means `std::weak_ptr` has member functions that allow querying things like \"is the memory that the weak pointer is pointing to still alive?\" \n\n```cpp\nstd::weak_ptr\u003cEntity\u003e outer;\nstd::cout \u003c\u003c \"Outer weak_ptr use count: \" \u003c\u003c outer.use_count() \u003c\u003c std::endl;\nstd::cout \u003c\u003c \"Outer weak_ptr expired: \" \u003c\u003c outer.expired() \u003c\u003c std::endl;\n{\n\tstd::cout \u003c\u003c \"Start of inner scope\" \u003c\u003c std::endl;\n\tstd::shared_ptr\u003cEntity\u003e inner = std::make_shared\u003cEntity\u003e(\"abcd\");\n\tstd::cout \u003c\u003c \"Inner shared_ptr use count: \" \u003c\u003c inner.use_count() \u003c\u003c std::endl;\n\touter = inner;\n\tstd::cout \u003c\u003c \"Outer weak_ptr use count: \" \u003c\u003c outer.use_count() \u003c\u003c std::endl;\n\tstd::cout \u003c\u003c \"Outer weak_ptr expired: \" \u003c\u003c outer.expired() \u003c\u003c std::endl;\n\tstd::cout \u003c\u003c \"End of inner scope\" \u003c\u003c std::endl;\n}\nstd::cout \u003c\u003c \"Outer weak_ptr use count: \" \u003c\u003c outer.use_count() \u003c\u003c std::endl;\nstd::cout \u003c\u003c \"Outer weak_ptr expired: \" \u003c\u003c outer.expired() \u003c\u003c std::endl;\n/**\n * Output:\n * Outer weak_ptr use count: 0\n * Outer weak_ptr expired: 1\n * \t\tStart of inner scope\n * \t\tP-Constructor!, name: abcd\n * \t\tInner shared_ptr use count: 1\n * \t\tOuter weak_ptr use count: 1\n * \t\tOuter weak_ptr expired: 0\n * \t\tEnd of inner scope\n * Destructor!\n * Outer weak_ptr use count: 0\n * Outer weak_ptr expired: 1\n */\n```\n\nThe above code block shows the working of the two pointers succinctly. The weak pointer does not increase the `use_count` of the shared pointer. And as soon as the shared pointer exits the inner scope, the memory is freed and our shared pointer now points to invalidated memory, as shown in the output of `outer.expired()`.\n", - "lastmodified": "2026-01-11T05:06:26.409989481+05:30", + "lastmodified": "2026-01-11T10:59:38.632717095+05:30", "tags": [ "domain-cs-languages-cpp" ] @@ -413,7 +413,7 @@ "/blog/stock-multiples": { "title": "Stock Multiples", "content": "When deciding how to invest in stocks, we all know that it's best to \"buy low and sell high\", but when is a stock price *low*? And when is it considered *high*? Are there more quantifiable ways to measure these qualitative terms? Stock multiples try to solve this problem by helping traders figure out how much you pay for a stock's underlying business and if this price has changed over time. Essentially, a stock multiple is a ratio that compares the current stock price to some *fundamental* quantity of the stock's underlying business. In general, the higher the multiple, the more expensive the stock is considered to be. The idea can be more intuitively explained via the following example:\n\u003eIn shopping for Pork, Beef, or Chicken it's difficult to compare the total prices since the quantity you get is different for each cut. But if we look at the *price per pound*, you can easily figure out which cut is the best bang for your buck. Multiple's work in a similar way, allowing us to compare the price of a stock to the underlying fundamentals you get with the purchase - [Stock Multiples: How to Tell When a Stock is Cheap/Expensive - The Plain Bagel](https://www.youtube.com/watch?v=21STUhQ-iP0)\n# PE Ratio\nA PE Ratio is a stock multiple which compares a company's current stock price to its earnings per share (EPS). It's one of the most popularly used stock multiples and it helps assess the relative value of a company's stock. It's very useful when used to compare a company's valuation against it's *historical performance*, or even other firms in the industry or the entire market in general. ([What is the Stock Market?](/blog/what-is-the-stock-market))\n\nFor example, let's say company $A$ was split into 20000 shares and each share was currently trading for ₹10 in the open market, and the company's earnings (net income) for the *previous* year was say ₹10,000, then the EPS is $\\frac{10000}{20000} = 0.5$. Computing the PE from this is, $PE = \\frac{\\text{Price}}{\\text{EPS}} = \\frac{10}{0.5} = 20$. \n\nNote that in this example, we used the *previous* year's earnings. But this might not be very representative of how the company will do this year. Perhaps it's an oil company and the company has just placed several environmental restrictions on it that might restrict the profit making abilities of the company significantly. Or perhaps the company was involved in some massive scandal which caused consumers to lose faith in the company's products. Regardless, this measure of EPS is a *trailing* measure. And hence this computation of the PE ratio is called as the **Trailing P/E**. However, it is also possible for analysts to try to estimate the earnings of the business for the current financial year using publicly available data and compute a new PE using these *expected* earnings per share. It is then called the **Forward P/E**. While forward P/E can escape the traps that trailing P/E is susceptible to, it has it's own drawbacks. The primary one being that expected EPS is, as the name says, expected. These estimations may not pan out and then we would have some unexpected error margin to deal with. \n\nNow, let's say we have decided to use one of the P/E measures and compute the value to be $x$. What can we infer about a stock's price based on this value? Using just $x$, we can't say much. This is because P/E is a relative measure. It does not make sense by itself. But we *can* compare the stock's current P/E to it's historical P/E values, or even with the stocks of other companies in the same sector. \n## Historical Comparison\n\u003e\n\u003e![pasted-image-20240507113833](/images/pasted-image-20240507113833.webp)\n\u003e\n\u003eIf the company's earnings are expected to increase, but the price of the stock has fallen, it would mean that the multiple has contracted, and investors don't value the profitability of the firm as much as they used to. Alternatively, if earnings are falling but the price has risen, the multiple has expanded, meaning people are paying more for less profit. We could also compare the P/E multiple to the stock's long-term average to see whether the margin is larger or smaller than normal. If the stock's ten-year average P/E is 15 times, for example, we can assume that the stock's multiple is temporarily cheaper than normal and may want to buy if we pick up the stock in the multiple later expands back to its long-term average. Then we could earn a return even if the company's earnings are flat. A key assumption here is that a multiple is expected to revert to its mean over time, and while that doesn't always hold true, investors sometimes look for extreme variations from the mean, with many believing that short-term volatility in the stock, which could be caused by a bad press release or negative near-term headwind, will eventually subside, causing the multiple to return to its normal level. - [Stock Multiples: How to Tell When a Stock is Cheap/Expensive - The Plain Bagel](https://www.youtube.com/watch?v=21STUhQ-iP0)\n## Industry Comparison\nComparison of the multiple w.r.t to it's historical performance is useful, but it's important to compare it against other companies in the industry as well. Of course, a company might have higher P/E than it's peers simply due to having a better culture, better marketing team, etc. but it's still useful when comparing two companies to directly compare how two companies are valued against each other simply due to the fact that the underlying product they are selling remains the same. If a stock $A$ has higher multiple than stock $B$, where both the underlying companies only sell footballs, it does not immediately imply that company $A$'s stock is worse bang for your buck. P/Es are limited in the information that they are able to capture. Company $A$ might be growing at a much faster pace than company $B$, or it might have the necessary comparative advantage to quickly explore and capture market share in other industries, say football pumps or soccer shoes. This is why stocks in the tech space usually have higher multiples than their peers since they tend to have high growth potential. \n\nA multiple contraction might imply that it is trading at a lower price than it's supposed to, and might signal a potential buying opportunity, but it could also be a value trap. A value trap is where an investor buys a cheaper lower-quality item just because it's cheaper. Sometimes the contraction could be justified, and in other situations it might signal a good buying opportunity. It is therefore important to use stock multiples along with a strong understanding of the fundamentals of the company to decide which of the two we as an investor believe the contraction to be a result of. Multiples are a very handy way to quickly understand a stock's price values. Some investors even contest that the P/E ratio in particular is very limited since EPS can easily be manipulated by accounting decisions and manipulation. \n\n\u003e\"It's far better to buy a wonderful company at a fair price... than a fair company at a wonderful price.\" - Warren Buffet\n\n# References\n1. [Stock Multiples: How to Tell When a Stock is Cheap/Expensive - The Plain Bagel](https://www.youtube.com/watch?v=21STUhQ-iP0)", - "lastmodified": "2026-01-11T05:06:27.207759926+05:30", + "lastmodified": "2026-01-11T10:59:39.417884866+05:30", "tags": [ "domain-finance-quantitative" ] @@ -421,7 +421,7 @@ "/blog/technical-analysis": { "title": "Technical Analysis", "content": "# Systematic vs Discretionary Trading\nTrading strategies can be classified into two broad categories:\n## Systematic Trading\nSystematic trading involves adhering to a predefined set of rules or algorithms to execute trades. These rules are usually based on historical data analysis and quantitative modelling. The advantage of these systems is that you can provably verify your hypothesis / system on real-world data using back-testing / forward-testing and obtain quantifiable metrics to evaluate and build trust in your algorithm. It is dependable. \n### Caution\n- It is however **very** important to note that such an algorithmic system is usually NOT one that can self sustain unmonitored. Essentially, having an algorithmic system in place does not mean that the trader employing the strategy can just expect it to work in all markets and leave it unmonitored. Some systems might work very well in bull markets and work very poorly in ranging markets. It is important to identify the right situations to use the right model and have sufficient risk management policies in place to protect against any catastrophic failures. There are very few systems which work on all types of markets. Focus on a portfolio of strategies; no single strategy will save you. Your strategy will stop working at some point. Don't grow attached; it won't feel anything when it loses you money.\n- It is also important to note that even in algorithmic trading, a strategy is only as good as your discipline to follow it. Many algorithmic traders have made large losses due to changing their strategy when it was live before ample testing and not trusting their own back-tested data. You must have belief in your system and not make discretionary choices.\n## Discretionary System\nDiscretionary trading, on the other hand, relies on the trader's judgement, intuition, and subjective interpretation of the market conditions to make trading decisions. Discretionary trading is subject to heavy emotional bias and they only have their own experience, market knowledge and instinct to make trading decisions on. Most retail investors do discretionary trading. This is not advised unless you are extremely experienced in the field and have a lot of experience and intuition to rely on when making trading decisions. A common form of professional discretionary trading is Scalp Trading. The primary disadvantage here being that as a human you are extremely susceptible to your decisions being swayed by news, social media and other such irrelevant factors negatively affecting your ability to making better trading decisions.\n# Trading System\nA system in trading refers to a structured framework of rules that guides the trader in making decisions regarding when to enter and exit positions in the market. These rules are developed through market analysis and optimization to maximize profit potential while minimizing risk. To test whether a system is effective or not we back-test it on old data. Back-testing is the backbone of creating your system. After this, we can also forward test the model on live data to further build confidence. \n\n\u003e Good systems generally have a very small difference between the profit and loss percentage, but that small margin is enough to make big profits because in any good system when you earn, you earn big and when you lose you lose small, so effectively you end up with a pretty high profit. - [Basics of Stock Trading - Honestly by Tanmay Bhat](https://www.youtube.com/playlist?list=PLhKwz7hYMTDVUXV-hkJ2wnwnQECzn-egm)\n\nThe idea here is that we step strict stop losses. This might increase our loss percentage overall but it ensures that we can set a strict upper limit on our max loss even before we enter any trade. We trade some wins for the security of limiting our loss amount in the losing trades. And with a good system we should ensure that the money we gain in winning trades is much higher than the losses realized in losing trades. One trick to handle both greed and risk in winning trades is to incorporate a trailing stop-loss. \n## Trailing Stop-Loss\nThe idea is pretty simple. Let's say our stop loss on some trade is 10% from entry and now let's say the stock has gained 10% profit. We can then dynamically move the stop loss to the entry point. This way we can greedily stick with the trade in hopes of getting more profit while ensuring that the maximum possible loss on the trade is 0%. Similarly if we get to 20% profit we can move the stop loss to 10% and so on. We greedily increase the stop loss by some function while sticking with the trade. \n## How to Make a System\nTo make a system we need to learn technical analysis. Technical analysis is essentially a fancy term for \"finding patterns in charts and price-volume data and using that to predict price movements.\" In theory, technical analysis is bullshit. The stock market ([What is the Stock Market?](/blog/what-is-the-stock-market)) is just a market for companies to request investment from speculative investors and the stock price is supposed to be reflective of the company's earnings and growth in that period. It should theoretically not be possible to obtain this number simply from reading candlestick chart patterns. But the arguments given for technical analysis are as follows:\n- Large fundamental analysis firms, who control most of the stock price movements (since they're the massive institutional investors) do their research well in advance and they are far more accurate than the research an average retail investor can afford to conduct. As a consequence, they are likely to take decisions long before indicators that retail investors use for investment come out. Essentially, the stock price should've already corrected for the negative/positive news or earnings report that might be released to the public later as we can trust these large institutional investors to have done their due diligence well in advance. When an institutional investor makes a large volume trade on the market, it is often broken down into smaller trades over a longer period of time, and it is these patterns that technical analysis hopes to find. Essentially indicators of institutional investors investing and trusting their well-funded research over your own. \n- As someone once said, *\"Technical analysis works because other people believe in technical analysis.\"* If we model the stock market as a game that we are designing an algorithm to win money in, our algorithm which follows some strategy $A$ will always have a counter strategy $A'$ that it will lose a lot of money to. Because there are so many players with large amounts of money playing this \"game\", it is often possible to design a strategy that beats the strategy of other investors and hence, technical analysis is able to make profit. \n# How to Do Technical Analysis? \nTechnical analysis is the study of price movement.\n- **Recognition of Patterns:** Technical analysis operates on the premise that historical price patterns tend to repeat themselves due to recurring human emotions such as greed and fear. Patterns can also indicate institutional investment as mentioned previously. These patterns manifest as identifiable trends in the market. \n- **Emphasis on Price:** Contrary to fundamental analysis, which considers external factors such as news and economic indicators, technical analysis prioritizes price action. The price reflects all available information, making it a reliable indicator of market sentiment and future price movements. Stems from [THE EFFICIENT MARKET HYPOTHESIS](/blog/the-efficient-market-hypothesis). Slightly paradoxical. \n## Trends\n\n ![pasted-image-20240311214037](/images/pasted-image-20240311214037.webp)\n \n- **Uptrend →** Higher lows\n- **Downtrend →** Lower highs\n- **Sideways →** Consistent highs and lows around the same value\n\n**Don’t follow the news, follow the price.** Retailers and institutional investors involved in the market, who do fundamental analysis have done far more research than you could hope to do and have good deductions about the condition of the market say even, 6 months down the line. This is not something that is feasible for us to do. However, based on these deductions, big groups of retailers starts buying/selling stocks and this will be reflected in the price. The news always follows the market. The news finds should've already been reflected and accounted for in the stock price. \n## Candlestick Charts\nThe solid body of the candlestick represents the price range between the open and close of that day’s trading. When the real body is filled in black / red then the close was lower than the open. When it is not filled in / filled in green then the close was higher than the open. The two “wicks” on either end represent the **highest / lowest** price at which the stock was traded for on that day.\n\n![pasted-image-20240311214150](/images/pasted-image-20240311214150.webp)\n\n### Time-Frames\nChoosing different time-frames can greatly alter the granularity at which data is visualized and can show varying trends. Shorter time-frames, like hourly or even minute-by-minute charts, provide detailed insights into intraday price fluctuations. Conversely, longer time-frames, such as weekly or monthly charts, smooth out noise and highlight broader trends, making them useful for identifying significant market movements and potential long-term patterns. Patterns in larger scale time-frames are often an indication that there is a lot of money behind it. This is interesting because when a lot of money is invested, it is very likely to be backed by big institutional investment. In weekly time-frames, the open is the open of the market on Monday morning and close is the close on Friday end.\n## Todo: Support \u0026 Resistance\n\n## Indicators\n### Moving Average (`Ma`)\nAn $n$ day moving average is a single number used to obtain the average of a financial instrument's price over the last $n$ days. Computing the moving average over several days just be thought of sliding a window containing the terms we're averaging along the time axis. It's essentially a constantly updated *average price*.\n$$\n\\begin{aligned}\n\\text{MA} = \\frac{1}{N} \\sum_{i=0}^{N-1} P_{i} \\\\\n\\end{aligned}\n\\begin{aligned}\n\\quad \\quad \\quad\n\u0026MA \\text{ is the Moving Average} \\\\\n\u0026N \\text{ is the number of time periods used to calculate the } MA \\\\\n\u0026P \\text{ is the representative price of the stock at time period } i \\\\\n\\end{aligned}\n$$\nThe moving average can be thought of as a more smoothed-out representation of all the noise and fluctuations inherent to a stock's price movement at shorter timescales. Also note, the moving average is a *trend-following* or *lagging* indicator. This is because it is based on past prices and is not dependent on future prices. \n### Use Case #1: Identifying Trends\nMoving averages are very useful visual indicators to identify the general trend of stock movements. There's multiple ways you can use the $MA$ to judge trends. \n#### Price vs Ma\nIf the price is consistently and currently above the moving average line, then you can consider it an uptrend and should look for buying opportunities. On the other hand, if the price is consistently below, then it's a downtrend and you should look for exit opportunities. \n\n![pasted-image-20240506185056](/images/pasted-image-20240506185056.webp)\n\nAs can be seen with the sideways ranging markets, it's important to wait for a consistent showing of price above or below the trend line before categorizing a stock to be in an upward or downward trend. \n#### Using Two Moving Averages\nWe can also use 2 MA's to identify trends. For example, the 20 MA is a short term moving average which gives you the short term trend of the market. And the 200 MA is a long term moving average which gives you the long term trend of the market. If we just used the 20 MA, the few times the price negatively crossed the 20 MA might have been identified as possible sell opportunities. But comparing it with the 200 MA, we can tell that the price is still in an overall uptrend and hence conclude that selling in those short term dip periods is risky. \n\n![pasted-image-20240506190104](/images/pasted-image-20240506190104.webp)\n\n*\"Using two moving averages will not provide you with better trades. But it will help you avoid bad trades.\"*\n\nWhen working with moving averages, it is very important to understand that moving averages should not be used as a buy or sell signal, but primarily as a confirmatory indicator to ensure that we aren't entering a bad trade against the trend when using trend based strategies. As you might note from the above figures, the moving average strategy works exceedingly terribly in sideways ranging markets and will often flip continuously between short-term buy / sell opportunities. A trader acting on these impulses in a sideways ranging market is likely to make loss. It is hence imperative to note that this strategy works only in **trending** markets. As mentioned before, it is very important to be able to decide when to employ what strategy. \n### Use Case #2: Identifying Trend Reversals\nThis idea is extremely similar to the idea of using two moving averages. Say we use the 50 MA and 200 MA, we always want a smaller and a larger MA. The cross over points of these lines could often signify potential trend reversals in the market. This is one of the most commonly used / beginners 'learn-indicators' introduction to the stock market. In practice, do not place trades at these reversals, but use these signals along with other factors to make better trade decisions. \n\n![pasted-image-20240506191205](/images/pasted-image-20240506191205.webp)\n\n### Use Case #3: Identifying Dynamic Support and Resistance\nWhen talking about support and resistance, we usually talk about a fixed price level at which we expect the price to reverse and bounce back and forth between. However, these 'fixed' levels are often not fixed, especially in trending markets. \n\n![pasted-image-20240506192308](/images/pasted-image-20240506192308.webp)\n\nIn these situations it makes a lot of sense to also use the moving average lines as a dynamic moving set of support and resistance points. You can then (in an uptrend) choose to buy close to the MA line. Similarly in a downtrend, you can identify selling opportunities when the stock price is close to the MA line. \n### Conclusion\nIn short, this indicator works relatively well only in trending markets. It is hence imperative to use it only in markets that you can clearly classify as trending. Some pros include that the indicator is very simple and easy to build or incorporate into strategies. There are clear entry OR exit points. Primary disadvantage is requiring to be able to identify a market as being trending or sideways ranging. Further, you also don't have any easy way to identify target take profits using just this indicator. Another possible pro with this indicator is that in practice, you are likely to misclassify several entry / exit points and take several short term minor losses, but the few times when the trend following strategy works, you are likely to enter in long term investments that generate a lot of profit. \n\n\u003e![pasted-image-20240506193412](/images/pasted-image-20240506193412.webp)\n\u003e- [11. Moving averages - Zerodha Varsity](https://www.youtube.com/watch?v=810jmf7drFw)\n\nThe moving average shines best when trading on less-volatile stocks like the NIFTY and other index / large-cap stocks on large time-frames (weekly / monthly). \n### Exponential Moving Average (`Ema`)\nThe exponential moving average is similar to a moving average, but works on the idea that recent prices should be given more weight than older prices in the average. For example, a large volume move on a single day (due to either institutional trading or world events) would be better captured in an EMA than a MA. Hence it is more suited to acting as a crossover in the crossover strategy for more volatile stocks or for shorter time frame swing trading. \n$$\n\\begin{aligned}\nEMA_t = \\Big [ P_t \\times \\Big(\\frac{s}{d+1}\\Big) \\Big] + EMA_{t-1} \\times \\Big[ 1 - \\Big(\\frac{s}{d+1}\\Big) \\Big]\\\\\n\\end{aligned}\n\\begin{aligned}\n\\quad \\quad \n\u0026EMA_t \\text{ is the EMA at time period } t \\\\\n\u0026P_t \\text{ is the representative price of the stock at time period } t \\\\\n\u0026s \\text{ is the smoothing factor } \\\\\n\u0026d \\text{ is the number of time periods } \\\\\n\\end{aligned}\n$$\nEssentially, the quantity $\\frac{s}{d+1}$ is the constant factor (or multiplier) used for smoothing and calculating the EMA. If we look at how the terms are expanded, \n$$\n\\begin{aligned}\nEMA_0 = \\alpha P_0 \\\\\nEMA_1 = \\alpha P_1 + (\\alpha P_0) \\times (1 - \\alpha) = \\alpha \\times (P_1 + P_0) - \\alpha^2 P_0 \\\\\nEMA_2 = \\alpha P_2 + (\\alpha \\times (P_1 + P_0) - \\alpha^2 P_0) \\times (1 - \\alpha) = \\alpha \\times (P_2 + P_1 + P_0) - \\alpha^2(2P_0 + P_1) - \\alpha^3 P_0 \\\\\n\\end{aligned}\n$$\nFor a 20-day moving average with $s = 2$, the factor $\\alpha$ would be around $0.0952$. The below diagram represents the difference between using the MA and the EMA to study stock trend movements. \n\n\u003eIn the figure below, the number of periods used in each average is 15, but the EMA responds more quickly to the changing prices than the SMA. The EMA has a higher value when the price is rising than the SMA and it falls faster than the SMA when the price is declining. This responsiveness to price changes is the main reason why some traders prefer to use the EMA over the SMA.\n\u003e\n\u003e![pasted-image-20240507020420](/images/pasted-image-20240507020420.webp)\n\u003e\n\u003e- [Moving Average (MA): Purpose, Uses, Formula, and Examples](https://www.investopedia.com/terms/m/movingaverage.asp)\n### Conclusion\nIn short, the EMA is a more responsive indicator than the MA. The whole idea with a MA was to smooth out the volatility in daily price changes over larger time-frames. However, when working with more volatile stocks / shorter time periods it might make sense to consult the EMA instead. It is also especially useful to detect trend reversals using the crossover strategy faster. It is one of the most famous known strategies used, similar to the MA crossover. \n## Moving Average Convergence Divergence (`Macd`)\nThe Moving Average Convergence Divergence (MACD) is a technical analysis tool utilized by traders to gauge the relationship between two exponential moving averages (EMAs) of an asset's price. It helps in identifying potential trend reversals, momentum shifts, and confirming the strength of a trend. The MACD is calculated by subtracting the 26-day EMA from the 12-day EMA: \n $$\\text{MACD}=\\text{12-day EMA}−\\text{26-day EMA}$$\nThe MACD line is then smoothed by calculating a 9-day EMA, known as the signal line:\n $$\n \\text{Signal Line}=\\text{9-day EMA of MACD}\n $$\nThis signal line assists in identifying potential buy or sell signals and to confirm the strength of a trend. Also, since MACD uses just EMAs, it is also a lagging indicator. \n\n\u003e The MACD lines, however, do not have concrete overbought/oversold levels like the RSI and other oscillator studies. Rather, they function on a relative basis. An investor or trader should focus on the level and direction of the MACD/signal lines compared with preceding price movements in the security at hand, as shown below. \n\u003e \n\u003e ![pasted-image-20240507023344](/images/pasted-image-20240507023344.webp)\n\u003e \n\u003e MACD measures the relationship between two EMAs, while the RSI measures price change to recent price highs and lows. Both measure momentum in a market, but because they measure different factors, they sometimes give contrary results. The RSI may show a reading above 70 (overbought) for a sustained period, indicating a market is overextended to the buy side of recent prices. In contrast, the MACD indicates that the market is still increasing in buying momentum. Either indicator may signal an upcoming trend change by showing divergence from price (price continues higher while the indicator turns lower, or vice versa).\n\u003e - [What Is MACD? - Investopedia](https://www.investopedia.com/terms/m/macd.asp)\n### Use Case #1: Crossover Strategy\nThe most common use of this indicator it to identify buy / sell signals using the crossover of the MACD and Signal lines. Let's first try to understand what the MACD terms signify. If the MACD value is positive, it implies that the short-term average (12-day EMA) is above the long-term average (26-day EMA). As discussed in the explanations of EMA and MA above, this implies that in the short term, there is an upward momentum in the price and signals a likely uptrend. Similarly, negative MACD values signal likely downward momentum. Now, we can use the MACD line as a sort of indicator of short term price trends. Combining this with the signal line, if we see the 9-day EMA of the MACD crossover with the MACD line, when the MACD is also below the 0-line, it implies that the trend until now was a downtrend (since MACD was below 0), and the crossover with the signal line (which is the EMA of the MACD) implies that there was sudden reversal and upward momentum. Hence it is a very useful indicator to identify swing trading opportunities in the short-term. This is also a very popular strategy.\n\n![pasted-image-20240507022648](/images/pasted-image-20240507022648.webp)\n\nThe green line on top of the price movement chart is the 50-day MA, the yellow line is the 200-day MA and at the bottom, the blue line is the MACD indicator and the orange line is the signal line. The histogram represents the difference between the signal and the MACD, serving as a measure of strength of the trend. It is also centered on the 0 line. \n### Use Case #2: Confirming Trend Strength \nYou can also use the difference / gap between the MACD line and the signal line as an identification of trend strength. A large gap implies a fast upward momentum push, whereas a smaller gap would imply a weakening trend that is losing steam. However, you should note that often when this happens, it implies that the underlying financial security is probably overbought or oversold and might soon pull back to normal levels. You can combine this analysis with the RSI to verify such conditions. \n### Use Case #3: Macd Divergence\n\u003eWhen MACD forms highs or lows that exceed the corresponding highs and lows on the price, it is called a divergence. A bullish divergence appears when MACD forms two rising lows that correspond with two falling lows on the price. This is a valid bullish signal when the long-term trend is still positive. Some traders will look for bullish divergences even when the long-term trend is negative because they can signal a change in the trend, although this technique is less reliable.\n\u003e\n\u003e![pasted-image-20240507024424](/images/pasted-image-20240507024424.webp)\n\u003e\n\u003eWhen MACD forms a series of two falling highs that correspond with two rising highs on the price, a bearish divergence has been formed. A bearish divergence that appears during a long-term bearish trend is considered confirmation that the trend is likely to continue. Some traders will watch for bearish divergences during long-term bullish trends because they can signal weakness in the trend. However, it is not as reliable as a bearish divergence during a bearish trend.\n\u003e- [What Is MACD? - Investopedia](https://www.investopedia.com/terms/m/macd.asp)\n### Conclusion\nMACD is a very versatile indicator that attempts to compress all the info you could get from MAs / EMAs into a single, easy to use, user-friendly indicator. This is also one of the reasons why it's very popular among those just getting into trading. Since it relies completely on EMAs, it has the same advantages / disadvantages as the MA strategies. It works very well in trending markets and will definitely help avoid bad trades that rely on trends, however, it will do poorly in ranging sideways markets. MACD is best used with daily periods, where the traditional settings of 26/12/9 days is the default.\n## Relative Strength Index (`Rsi`)\nThe RSI is a *momentum* indicator that was developed by J. Welles Wilder Jr. and introduced in his 1978 book, *New Concepts in Technical Trading Systems*. The RSI was developed as an indicator that is able to judge the strength on days when prices go up to its strength on days when prices go down. It is used in strategies often to check when a stock is *overbought* or *oversold*. \n\n\u003eThe RSI uses a two-part calculation that starts with the following formula:\n\u003e $$\n\\begin{aligned}\nRSI_{\\text{step one}} = 100 - \\Big[ \\frac{100}{1 + \\frac{\\text{Average Gain}}{\\text{Average Loss}}}\\Big]\n\\end{aligned}\n\\quad\n\\begin{aligned}\n\\text{The average gain or loss used in this calculation is the average percentage} \\\\ \\text{gain or loss during a look-back period.} \\\\ \\text{The formula uses a positive value for the average loss.} \\\\ \\text{Periods with price losses are counted as zero in the calculations of average gain.} \\\\ \\text{Periods with price increases are counted as zero in the calculations of average loss.}\n \\end{aligned}\n $$ \n\u003e The standard number of periods used to calculate the initial RSI value is 14. For example, imagine the market closed higher seven out of the past 14 days with an initial average gain of 1%. The remaining seven days all closed lower with an initial average loss of −0.8%. The first calculation for the RSI would look like the following expanded calculation:\n\u003e $$\n RSI_{\\text{step one}} = 100 - \\Big[ \\frac{100}{1 + \\frac{\\frac{1\\%}{14}}{\\frac{0.8\\%}{14}}} \\Big] = 0.55\n $$\n\u003e Once there are 14 periods of data available, the second calculation can be done. Its purpose is to smooth the results so that the RSI only nears 100 or zero in a strongly trending market. \n\u003e $$\n RSI_{\\text{step two}} = 100 - \\Big[ \\frac{100}{1 + \\frac{(\\text{Previous Average Gain} \\times 13) + \\text{Current Gain}}{(\\text{Previous Average Loss} \\times 13) + \\text{Current Loss}}}\\Big]\n $$\n\u003e- [Relative Strength Index (RSI) Indicator Explained With Formula - Investopedia](https://www.investopedia.com/terms/r/rsi.asp)\n\nLet's start by clearing a common misconception. The RSI crossing a value of 70 does imply that it is overbought, but that does not imply that it cannot stay overbought for long periods of time. The below image is a clear example of this:\n\n![pasted-image-20240507055535](/images/pasted-image-20240507055535.webp)\n\nSimilarly, it can also remain oversold for long periods of time. The RSI is a *momentum indicator*. So if it is overbought, that implies that the momentum is to the upside. It should not be used as a *reversal indicator*. \n### Use Case #1: Trend Strength\nDuring trends, the RSI should frequently stay near the upper band of 70. Similarly during a downtrend, it is likely to frequently hover around 30. For example, if the RSI can’t reach 70 on a number of consecutive price swings during an uptrend, but then drops below 30, the trend has weakened and could be reversing lower.\n### Use Case #2: Rsi Divergence\n\u003eAn RSI divergence occurs when price moves in the opposite direction of the RSI. In other words, a chart might display a change in momentum before a corresponding change in price. A bullish divergence occurs when the RSI displays an oversold reading followed by a higher low that appears with lower lows in the price. This may indicate rising bullish momentum, and a break above oversold territory could be used to trigger a new long position. A bearish divergence occurs when the RSI creates an overbought reading followed by a lower high that appears with higher highs on the price.\n\u003eAs you can see in the following chart, a bullish divergence was identified when the RSI formed higher lows as the price formed lower lows. This was a valid signal, but divergences can be rare when a stock is in a stable long-term trend. Using flexible oversold or overbought readings will help identify more potential signals.\n\u003e\n\u003e![pasted-image-20240507060616](/images/pasted-image-20240507060616.webp)\n\u003e\n\u003e- [Relative Strength Index (RSI) Indicator Explained With Formula - Investopedia](https://www.investopedia.com/terms/r/rsi.asp)\n### Conclusion\nSimilar to MACD, RSI is also a momentum indicator that works well in trending markets. There is a common misconception among new traders that RSI crossing 70 / 30 mark is a reversal indicator. This is not true, the RSI simply measures the relative strength in gains to the relative strength of losses. In an uptrend for example, you would *expect* the RSI to constantly be very high as the relative gains must be higher than the relative losses in an uptrend. Hence we should instead use it as a confirmation for determining trends. Similar to MACD and EMA, the RSI will also do poorly in ranging markets as we would expect the relative strength of gains and losses to be roughly equal with a few random spikes here and there. \n## Lorentzian Classifier \nA Lorentzian Distance Classifier (LDC) is a Machine Learning classification algorithm capable of categorizing historical data from a multi-dimensional feature space. This indicator demonstrates how Lorentzian Classification can also be used to predict the direction of future price movements when used as the distance metric for a novel implementation of an Approximate Nearest Neighbors (ANN) algorithm.\n\nIn physics, Lorentzian space is well-known for its role in Einstein's General Relativity, describing the curvature of space-time. Interestingly, this abstract concept has practical applications in trading. Recent studies suggest Lorentzian space's suitability for analyzing time-series data. Empirical evidence shows that Lorentzian distance handles outliers and noise better than Euclidean distance and outperforms other distance metrics like Manhattan distance, Bhattacharyya similarity, and Cosine similarity. It consistently yields higher mean accuracy across various time series datasets compared to other metrics, except for Dynamic Time Warping approaches, which are too computationally demanding for current PineScript capabilities.\n \nEuclidean distance is commonly used as the default distance metric for NN-based search algorithms, but it may not always be the best choice when dealing with financial market data. This is because financial market data can be significantly impacted by proximity to major world events such as FOMC Meetings and Black Swan events. This event-based distortion of market data can be framed as similar to the gravitational warping caused by a massive object on the space-time continuum. For financial markets, the analogous continuum that experiences warping can be referred to as \"price-time\".\n\nBelow is a side-by-side comparison of how neighborhoods of similar historical points appear in three-dimensional Euclidean Space and Lorentzian Space:\n\n![pasted-image-20240507100741](/images/pasted-image-20240507100741.webp)\n\nThis figure demonstrates how Lorentzian space can better accommodate the warping of price-time since the Lorentzian distance function compresses the Euclidean neighborhood in such a way that the new neighborhood distribution in Lorentzian space tends to cluster around each of the major feature axes in addition to the origin itself. This means that, even though some nearest neighbors will be the same regardless of the distance metric used, Lorentzian space will also allow for the consideration of historical points that would otherwise never be considered with a Euclidean distance metric. \n \nIntuitively, the advantage inherent in the Lorentzian distance metric makes sense. For example, it is logical that the price action that occurs in the hours after Chairman Powell finishes delivering a speech would resemble at least some of the previous times when he finished delivering a speech. This may be true regardless of other factors, such as whether or not the market was overbought or oversold at the time or if the macro conditions were more bullish or bearish overall. These historical reference points are extremely valuable for predictive models, yet the Euclidean distance metric would miss these neighbors entirely, often in favor of irrelevant data points from the day before the event. By using Lorentzian distance as a metric, the ML model is instead able to consider the warping of price-time caused by the event and, ultimately, transcend the temporal bias imposed on it by the time series.\n\nSource: [https://www.ig.com/en/trading-strategies/16-candlestick-patterns-every-trader-should-know-180615](https://www.ig.com/en/trading-strategies/16-candlestick-patterns-every-trader-should-know-180615)\n# References\n1. [Basics of Stock Trading - Honestly by Tanmay Bhat](https://www.youtube.com/playlist?list=PLhKwz7hYMTDVUXV-hkJ2wnwnQECzn-egm)\n2. [https://www.ig.com/en/trading-strategies/16-candlestick-patterns-every-trader-should-know-180615](https://www.ig.com/en/trading-strategies/16-candlestick-patterns-every-trader-should-know-180615)\n3. [Relative Strength Index (RSI) Indicator Explained With Formula - Investopedia](https://www.investopedia.com/terms/r/rsi.asp)\n4. [What Is MACD? - Investopedia](https://www.investopedia.com/terms/m/macd.asp)\n5. [Moving Average (MA): Purpose, Uses, Formula, and Examples - Investopedia](https://www.investopedia.com/terms/m/movingaverage.asp)\n6. [11. Moving averages - Zerodha Varsity](https://www.youtube.com/watch?v=810jmf7drFw)", - "lastmodified": "2026-01-11T05:06:30.283896764+05:30", + "lastmodified": "2026-01-11T10:59:42.501132564+05:30", "tags": [ "domain-finance-quantitative" ] @@ -429,7 +429,7 @@ "/blog/the-black-scholes-merton-equation": { "title": "The Black-Scholes-Merton Equation", "content": "This single equation spawned multi-trillion dollar industries and transformed everyone's approach to risk.\n$$\n\\frac{\\partial V}{\\partial t} + rS\\frac{\\partial V}{\\partial S} + \\frac{1}{2}\\sigma^2S^2\\frac{\\partial^2V}{\\partial S^2}-rV = 0\n$$\nBut to understand how we arrived here, we need to go back and understand what options are, and understand the evolution of this equation over time.\n# Phase 1 - Louis Bachelier - Théorie De La Spéculation\nLouis Bachelier (born in 1870) stands as a pioneer in the application of mathematics to financial markets, particularly in the realm of option pricing. Both of his parents died when he was 18, and he had to take over his father's wine business. He sold the business a few years later and moved to Paris to study physics, but since he needed a job to support himself and his family financially, he took up a job at the Paris Stock Exchange (the Bourse). This experience, exposed him to the chaotic world of trading. In particular, his interest was drawn to a specific type of financial instrument that was being traded, contracts known as **options**. (Covered in [Derivatives - Options](/blog/derivatives-options))\n\nEven though options had been around for hundreds of years, no one had found a good way to price them. Traders would solely rely on bargaining and feel to come to an agreement about what the price of an option should be. Pricing an 'option' to buy an asset at some fixed strike price in the future was difficult, primarily due to the inherent randomness in stock price movements. Bachelier, who was already interested in probability, thought that there had to be a mathematical solution to this problem, and proposed this as his PhD topic to his advisor Henri Poincaré. Although finance wasn't really something mathematicians looked into back then, Poincaré agreed. It was this doctoral thesis, that would later lay the foundation for applying mathematical pricing models to options trading. \n\nAs mentioned previously, the difficulty in pricing options is primarily due to it being pretty much impossible for any individual to account for a multitude of unpredictable factors responsible for influencing the price of a stock. It's basically determined by a tug of war between buyers and sellers, and the numbers on either side can be influenced by nearly anything from weather, politics, competitors, etc. Bachelier's key insight here was to model stock prices as a random walk, with each movement up or down equally likely. Randomness is a hallmark of an **efficient market** ([THE EFFICIENT MARKET HYPOTHESIS](/blog/the-efficient-market-hypothesis)). It essentially states that the more people try to make money by predicting stock prices and trading, the less predictable those prices are. The argument is essentially that if you were able to predict that some stock $A$ would go up tomorrow and we buy it, our actions would make the stock price go up today. The very act of predicting essentially influences the stock price. That said, there are plenty of instances throughout history of mathematicians, physicists, etc. finding 'edges' in the stock market ([What is the Stock Market?](/blog/what-is-the-stock-market)) and using them to make consistent profits over long periods of time. The most famous example being Jim Simon's Medallion fund, averaging a $71.8\\%$ annual return (before fees) for almost a decade. \n\nAn important property of random walks is that over time, the expected outcomes of a random walk take up the shape of a normal distribution. \n\n![pasted-image-20240311040835](/images/pasted-image-20240311040835.webp)\n![pasted-image-20240311040740](/images/pasted-image-20240311040740.webp)\n\nEssentially, over a short period of time, there's not much influence on the stock price by random-walk steps to allow it to reach extreme deviations from the stock's current price. But over a period of time, the probability of it reaching more extreme prices increases, but the majority of the expected stock price is still close to the stock's current price. This may not be very consistent with our observation of the general trend of the market to increase over a long period of time, but back then, there wasn't a lot of data available and this is how Bachelier modeled it. So after a short time, the stock price could only move up or down a little, but after more time, a wider range of prices is possible. He modeled the expected future price of a stock by a normal distribution, centered on the current price which spreads out over time. \n\n\u003e**Side note**: He realized that he had rediscovered the exact equation which describes how head radiates from regions of high temperature to regions of low temperature, originally discovered by Joseph Fourier in 1822. Thus, he called his discovery the 'radiation of probabilities'. Bachelier's random walk theory would later find application in solving the longstanding physics mystery of Brownian motion, the erratic movement of microscopic particles observed by botanist Robert Brown. Remarkably, Albert Einstein, in his explanation of Brownian motion in 1905, unknowingly built upon the same random walk principles established by Bachelier years earlier.\n\nBachelier's crowing achievement, was that he had finally figured out a mathematical way to price an option by applying his random walk theory. \n\n![pasted-image-20240311042335](/images/pasted-image-20240311042335.webp)\n\n- The probability that the option buyer makes profit is the probability that the **stock price increases by more than the price paid for the option**. We call this the **stock price at exercise**. Otherwise the buyer would just let the option expire. This is the green shaded area.\n\n![pasted-image-20240311042301](/images/pasted-image-20240311042301.webp)\n\n- The probability that the option seller makes profit is the probability that the **stock price stays low enough that the buyer doesn't earn more than they paid for it**. Note that this is sufficient, because even if the stock price has increased from the strike price, but not by enough to increase past an amount that allows the buyer to exercise the option, the premium payed for by the buyer is enough to give the seller more profit than what would be obtained if he didn't sell the option. This is the red shaded area.\n\nNote that you can influence the region of probabilities simply by changing the premium (price) of the option. Increase the premium, and the stock price required for the option buyer to exercise the option increases. Pushing the probability region where he makes a profit further toward the edges. You can calculate the expected return of buying / selling an option simply by multiplying the profit / loss each individual stands to gain / lose by the probability of each outcome. Note that each probability here is just a function of the price of the option. Bachelier argued that a fair price for an option is what makes the expected return for buyers and sellers equal. \n\n![pasted-image-20240311042939](/images/pasted-image-20240311042939.webp)\n\n\u003eWhen Bachelier finished his thesis, he had beaten Einstein to inventing the random walk and solved the problem that had eluded options traders for hundreds of years. But no one noticed. The physicists were uninterested and traders weren't ready. The key thing missing was a way to make a ton of money.\n\n## The Bachelier Model\nWhat Bachelier essentially gave us, was a closed form equation for pricing a call / put option under the Bachelier model. The Bachelier model is basically representing a forward price contract (process) as a stochastic differential equation. Here, $\\sigma$ is **volatility**. \n\n$$dF_t = \\sigma dW_t, \\ t \\in [0, T]$$\nYou can think of $[0, T]$ as sort of representing a single time-step. Although this is a continuous process, we can think of it as a discrete process where we're using very small values for the time-step $(T = dt)$. Solving for the forward price process, we get:\n$$ \n\\begin{align}\n\u0026 \\int_0^TdF_t = \\int_0^T\\sigma dW_t \\\\ \\\\\n\u0026 F_t - F_0 = \\sigma(W_t-W_0) \\quad | \\ W_0 \\text{ is 0 by the definition of brownian motion} \\\\ \\\\\n\u0026 F_T = F_0 + \\sigma W_t\n\\end{align}\n$$\nAnd that's it. An elegant way to model the future price and derive the closed form for pricing options. More generally, we can write the above result as $F_{t+1} = F_t + \\sigma W_t$. We can even prove that $F_t$ is a **martingale**. That is:\n$$\n\\mathbb{E}[F_{t+1}|F_t] = F_t\n$$\nIt's essentially saying that the forward price process at some point in the future is expected to be $F_t$. Our best guess for the next step in the process, is just the latest point computed in the process. Proof: \n$$\n\\mathbb{E}[F_{t+1}|F_t] = \\mathbb{E}[F_t + \\sigma W_{t+1}|F_t] = \\mathbb{E}[F_t + \\sigma W_{t+1}] = F_t\n$$\n### Pricing a Call Option\nWe are going to be pricing European style options, that is, we will be considering the payoff at **maturity**, at time $T$. We don't know what the future holds for the derivative, but we know what the value of that derivative **could be** at some time $T$ in the future. Essentially, based on the price of the underlying asset that the derivative is tracking at expiration, we know that the payoff is going to take the shape of a hockey-stick figure as shown previously. A call option at time $T$, will give us:\n$$\n\\begin{align*}\n\u0026 K \\text{ - Strike Price} \\\\\n\u0026 T \\text{ - Time to Maturity} \\\\\n\u0026 C_T = max((F_t-K), 0)=(F_T - K)^+\n\\end{align*}\n$$\nWe use the $(\\cdots)^+$ notation just to simplify the expression. At time $T$, this is a deterministic expression to how much payoff we make. But the issue is we do not know what $F_T$ will be. So the best thing to do today would be to compute the expectation of that payoff and hope to derive a closed form equation to compute the call price. The call price today is given by the expectation of the future:\n$$\n\\begin{align*}\n\u0026 C_0 = \\mathbb{E}[(F_T - K)^+] \\\\\n\u0026 = \\mathbb{E}[(F_0 + \\sigma W_T - K)^*]\n\\end{align*}\n$$\nNow, $W_T$ is still an increment in Brownian motion, that is, it is **distributed normally** with a mean of 0 and a variance of $dt$. Note $dt = T$. And since variance is equivalent to the square of the standard deviation, we can write the equation as:\n$$\n= \\mathbb{E}[(F_0 - \\sigma \\sqrt{(T - 0)}Z - K)^+]\n$$\nWhere $Z \\sim N(0, 1)$, $Z$ is a **standard normal random variable**. Essentially, we use the fact that we have independent stationary increments with mean 0 and variance $dt$ to substitute for $W_T$. Let's rearrange some terms to get:\n$$\n= \\mathbb{E}[(F_0 - K - \\sigma\\sqrt{T}Z)^+]\n$$\nWe want some more algebraic / better mathematical tools to substitute for the $max$ function. We will use indicators to make this equation easier to solve. Recall that:\n$$\n\\mathbb{1}(x) = \\begin{cases}\n1 \u0026 \\text{condition of } x\\\\\n0 \u0026 \\sim \\text{condition of } x\\\\\n\\end{cases}\n$$\nThe $max$ function in this context essentially just implies that when exercising an option, if there is positive payoff, take it, otherwise don't take it (let it expire). And the indicator function let's us imply the same thing in the equation. So we can substitute the indicator function in for the $max$ function be defining our indicator $\\mathbb{1}$ as follows:\n$$\n\\mathbb{1}(Z) = \\begin{cases}\n1 \u0026 Z \\leq \\frac{F_0 - K}{\\sigma \\sqrt T} \\\\\n0 \u0026 Z \\gt \\frac{F_0 - K}{\\sigma \\sqrt T} \\\\\n\\end{cases}\n$$\nSubstituting this in:\n$$\n= \\mathbb{E}[((F_0 - K - \\sigma\\sqrt TZ))\\mathbb{1}_{Z \\leq\\frac{F_0-K}{\\sigma\\sqrt T}}]\n$$\nDistributing the indicator function yields:\n$$\n= \\mathbb{E}[(F_0 - K)\\mathbb{1}_{Z \\leq\\frac{F_0-K}{\\sigma\\sqrt T}} - \\sigma \\sqrt TZ\\mathbb{1}_{Z \\leq\\frac{F_0-K}{\\sigma\\sqrt T}}]\n$$\nNow, since we know that $Z$ is distributed standard normally, the expectation that $Z$ is less than some quantity can be found by using the cumulative distribution function for the normal distribution. Essentially, the first term indicator function can be replaced by just substituting it with the normal cumulative distribution, $\\Phi$, up to the indicator function value:\n$$\n= (F_0 - K) \\Phi(\\frac{F_0 - K}{\\sigma \\sqrt T}) - \\sigma \\sqrt T \\mathbb{E}[Z\\mathbb{1}_{Z \\leq \\frac{F_0 - K}{\\sigma \\sqrt T}}]\n$$\nUsing properties of normal distributions, the derivative of the CDF $\\Phi'(x) = -x\\phi(x)$, where $\\phi$ is the probability density function of the normal distribution. \n$$\n\\phi(x) = \\frac{1}{\\sqrt{2\\pi}}e^{\\frac{-x^2}{2}}\n$$\nWe can use this property to solve the second term since:\n$$\n\\mathbb{E}[Z\\mathbb{1}_{Z \\leq y}] = \\int_{-\\infty}^y x\\phi(x)dx = -\\phi(y)\n$$\nApplying this to the original equation by letting $y = \\frac{F_0 - K}{\\sigma \\sqrt T}$, we get:\n$$\nC_0 = (F_0 - K)\\Phi(\\frac{F_0 - K}{\\sigma \\sqrt T}) + \\sigma\\sqrt T\\phi(\\frac{F_0 - K}{\\sigma \\sqrt T})\n$$\nA closed form equation for pricing a call option given the current asset price $F_0$, the strike price $K$, the volatility $\\sigma$ and the time to maturity $T$ of the option!\n\nWe can similarly use the Bachelier model to price all other kinds of future contracts, including put options, call / put futures, etc. \n\n# Phase 1.5 - Brownian Motion $B_t$ (Wiener Process)\n\n\u003eSo Brown discovered that any particles, if they were small enough, exhibited this random movement, which came to be known as Brownian motion. But what caused it remained a mystery. 80 years later in 1905, Einstein figured out the answer. Over the previous couple hundred years, the idea that gases and liquids were made up of molecules became more and more popular. But not everyone was convinced that molecules were real in a physical sense. Just that the theory explained a lot of observations. The idea led Einstein to hypothesize that Brownian motion is caused by the trillions of molecules hitting the particle from every direction, every instant. Occasionally, more will hit from one side than the other, and the particle will momentarily jump. To derive the mathematics, Einstein supposed that as an observer we can't see or predict these collisions with any certainty. So at any time we have to assume that the particle is just as likely to move in one direction as an another. So just like stock prices, microscopic particles move like a ball falling down a galton board, the expected location of a particle is described by a normal distribution, which broadens with time. It's why even in completely still water, microscopic particles spread out. This is diffusion. By solving the Brownian motion mystery. Einstein had found definitive evidence that atoms and molecules exist. Of course, he had no idea that Bachelier had uncovered the random walk five years earlier. - [The Trillion Dollar Equation](https://www.youtube.com/watch?v=A5w-dEgIU1M\u0026t=148s)\n\nThe random walk that Bachelier came up with and the Brownian motion that Robert Brown discovered are both pretty similar and following the developments that occurred in mathematically developing Brownian motion will help us understand more complex future contracts pricing models. \n\n\u003e **Definition**: A *standard (one-dimensional)* *Brownian Motion* (also called *Wiener Process*) is a stochastic process $\\{W_t\\}_{t \\geq 0+}$ indexed by non-negative real numbers $t$ with the following properties:\n\u003e1. $W_0 = 0$.\n\u003e2. With probability 1, the function $t \\to W_t$ is continuous in $t$. \n\u003e3. The process $\\{W_t\\}_{t \\geq 0+}$ has *stationary, independent increments*.\n\u003e4. The increment $W_{t+s} - W_s$ has the $\\text{NORMAL}(0, t)$ distribution\n\u003e - [BROWNIAN MOTION - Galton UChicago](https://galton.uchicago.edu/~lalley/Courses/313/BrownianMotionCurrent.pdf)\n\nI'll explain these properties in more details below. Let's call them the axioms that govern all Wiener processes / Brownian motion. \n## Axioms\n1. **Brownian Motion has independent increments.** \n\tSay we have a time value $r$, $s$ and $t$. We have some Brownian motion associated with each of these time values. The time from $s \\to t$ is an increment. So is the time from $r \\to s$. We're essentially saying that the increment from $s \\to t$ is **totally** independent of other time periods, not even the previous $r \\to s$ time period. In short, this axiom essentially says that whatever happens in any given time period is **totally random** and does not depend on what happens in any other time period. \n\t\n\t![pasted-image-20240312084016](/images/pasted-image-20240312084016.webp)\n2. **Brownian Motion has stationary increments.**\n\tIt's sort of related to the previous axiom. But what it essentially says that the distirbution in the time between $s \\to t$ only depends on the time values $t$ and $s$ and nothing else. \n3. **Brownian Motion has Normal Distribution.** \n If we look at the distribution in any time-step, the data points will be normally distributed. That is:\n $$\n\tB_t - B_s \\sim N(\\mu(t - s), \\sigma^2(t-s))\n\t$$\n\tHere, the term $\\mu (t-s)$ is the **mean** of the normal distribution. This term is also often called **drift**. The $\\sigma^2(t-s)$ term is the **variance** of the normal distribution. $\\sigma$ is just the standard deviation.\n4. **Brownian Motion has continuous sample paths**.\n\tThis simply just means that at any time value, the Brownian motion graph is continuous at all points. \n### Standard Brownian Motion\nStandard Brownian Motion is a specialized case of Brownian Motion. It is the case that Bachelier studied and used to model future stock prices in his PhD Thesis. Here, Brownian motion has a **standard normal distribution**. A standard normal distribution has mean $(\\mu) = 0$ and variance $\\sigma^2 = 1$. \n$$\nB_t - B_s \\sim N(0, t-s)\n$$\n## Random Walks\nA **symmetric** random walk is a mathematical model that describes a path consisting of a series of random steps, where each step has an equal probability of being taken in either direction. We will limit our discussion to **symmetric** random walks. Here symmetric just means that the probability of each step being chosen is equal. \n\nLet $S_n$ denote the position of the walker after $n$ steps. Then, a symmetric random walk can be defined recursively as: $$ X_n = X_{n-1} + Z_n$$\nHere, $Z_n$ are independent and identically distributed random variables taking values $+1$ or $-1$ with equal probability, i.e., $P(Z_n = 1) = P(Z_n = -1) = \\frac{1}{2}$.\n\n![pasted-image-20240312091108](/images/pasted-image-20240312091108.webp)\n- [Eight different random walks - Wikipedia](https://en.m.wikipedia.org/wiki/File:Random_Walk_example.svg)\n\nEffectively, when we consider the discrete case, we call it a random walk. But as we keep reducing our time-steps, that is, $\\Delta t \\to 0$, it's the same as Brownian motion. The summation formula is the mean by definition, so we can write $Z_k = \\pm\\frac{t}{n}$, where $n$ is the number of time steps. For convenience, let us write $Z_k = \\pm \\sqrt \\frac{t}{n}$. \n### Expectation\nThe expectation of $Z_k$, $\\mathbb{E}[Z_k] = 0 \\iff \\mathbb{E}[X_n] = 0$. The expectation, $\\mathbb{E}[Z_k^2] = \\frac{t}{n}$. Now when working with expected values, due to [LINEARITY OF EXPECTATION](/blog/linearity-of-expectation), $\\mathbb{E}[Z_i Z_j] = \\mathbb{E}[Z_i] \\cdot \\mathbb{E}[Z_j] = 0$ . \n$$\n\\begin{align}\n\u0026 \\mathbb{E}[X_n^2] = \\mathbb{E}[(\\sum Z_k)^2] \\\\\n\u0026 = \\mathbb{E}[(Z_1 + Z_2 + \\cdots + Z_n)(Z_1 + Z_2 + \\cdots + Z_n)] \\\\\n\u0026 = \\mathbb{E}[Z_1^2 + Z_1Z_2 + \\cdots + Z_1Z_n + Z_2Z_1 + \\cdots + Z_2Z_n + \\cdots + Z_nZ_1 + Z_nZ_2+\\cdots+Z_n^2] \\ | \\text{Since } \\mathbb{E}[Z_iZ_j] = 0 \\text{ for } i \\neq j \\\\\n\u0026 = \\mathbb{E}[\\sum Z_k^2] \\\\\n\u0026 = \\mathbb{E}[Z_1^2] + \\mathbb{E}[Z_2^2] + \\cdots + \\mathbb{E}[Z_n^2] \\\\ \n\u0026 = \\frac{t}{n} + \\frac{t}{n} + \\cdots + \\frac{t}{n} \\\\ \n\u0026 \\implies \\mathbb{E}[X_n^2] = t\n\\end{align}\n$$\nThe important property here is that this expectation is completely independent of $n$. No matter how many time steps we take, the expectation is just $t$. To go from the discrete case to the continuous case, we can indicate the size of time-steps going to 0 as $n \\to \\infty$. Because $\\mathbb{E}[X_n] = 0$ and $\\mathbb{E}[X_n^2] = t$ (both are independent of $n$), we know that the exact same expectations apply to the Brownian Motion case as well. \nAs $n \\to \\infty$, our random walk becomes Brownian Motion. Therefore, we get:\n$$\n\\begin{align}\n\u0026 \\mathbb{E}[B_t] = 0 \\\\\n\u0026 \\mathbb{E}[B_t^2] = t\n\\end{align}\n$$\n- **Brownian Motion is bounded**. This just says that, in the context of share prices, a share price cannot go to $\\infty$. \n- **Brownian Motion is a Markov process**. This follows from the definition. \n- **Brownian Motion is a Martingale**. This is sort of like saying, *the best guess for what happens next (in the future), is what's happening now*. More formally, $\\mathbb{E}[Z_{t+1}|Z_t] = Z_t$. Kind of paradoxical. \n## Geometric Brownian Motion\nRemember that in Bachelier's Thesis, he modeled share prices using a standard normal distribution. But looking at share prices almost immediately indicates an issue with his model. We notice that over time, stocks tend to *drift* in one direction or the other, with total markets having an overall upwards drift. This is sort of like having the normal distribution have it's mean drifted up from 0. This is the idea that we want to model using geometric Brownian motion. \n\nWe sort of expect share prices to grow in an exponential manner. We mathematically write this as $S_t = S_0e^{\\alpha t}$. Just the formula to denote standard exponential growth. But we know that share prices follow Brownian motion (random walk), and the price keeps constantly fluctuating. Effectively, we need to introduce a parameter in this equation to account for the Brownian motion. So to take this into account, we can do this by modifying the model slightly to $S_t = S_0 e^{\\alpha t + \\beta B_t}$. The term $\\beta B_t$ accounts for the Brownian motion. $\\beta$ is a constant, which is very difficult to measure for a stock. The term is essentially supposed to be a measure of volatility. You can see that with higher $\\beta$, you have more contribution from the Brownian motion term and hence have more random volatility. \n\n![pasted-image-20240312094517](/images/pasted-image-20240312094517.webp)\n\nIf we play around with the formula a bit, we can do the following:\n$$\n\\begin{align}\n\u0026 \\frac{S_t}{S_0} = e^{\\alpha t + \\beta B_t} \\\\\n\u0026 \\ln(\\frac{S_t}{S_0}) = \\alpha t + \\beta B_t \\quad \\text{You can think of the } \\alpha t \\text{ term as contributing to the mean and } \\beta B_t \\text{ as a normal distribution with mean } 0\\\\\n\u0026 \\text{Since, } B_t \\sim N(0, t) \\\\\n\u0026 \\alpha t + \\beta B_t \\text{ is normally distributed, but we want to know it's mean and variance} \\\\\n\u0026 \\alpha t + \\beta B_t \\sim N(\\alpha t, \\beta^2t) \\quad | \\text{ Since } Var(x) = a \\implies Var(kx) = k^2a\\\\\n\u0026 \\implies \\ln(\\frac{S_t}{S_0}) \\sim N(\\alpha t, \\beta^2 t)\n\\end{align}\n$$\nThis is what is known as log-normal. In other words, the ratio of the share prices at time $t$ to the share price at the beginning is a log-normal distribution. The log part essentially just skews the curve. \n![pasted-image-20240312095342](/images/pasted-image-20240312095342.webp)\n- [Log-Normal Distribution: Definition, Uses, and How To Calculate - Investopedia](https://www.investopedia.com/terms/l/log-normal-distribution.asp)\n# Phase 2 - The Black-Scholes-Merton Equation\nThorpe wasn't satisfied with Bachelier's model for pricing options. For one thing, stock prices aren't entirely random. They can increase over time if the business is doing well or fall if it isn't. Bachelier's model ignores this. So Thorpe came up with a more accurate model for pricing options, which took this drift into account. He used his model to gain an edge in the market and make a lot of money. Black-Scholes and Merton later independently came up with a way to price future contracts that would then revolutionize the trading industry forever. Their equation, like Thorpe's, was an improved version of Bachelier's model. \n\n## Dynamic Hedging\n### A Toy Example\nLet's say Bharat sells Arya a call option on a stock, and let's say the stock price has gone up. So it's now in the money for Arya. For every additional rupee that the stock price goes up from the strike price, Bharat will now lose a rupee. **BUT**, he can eliminate this risk by owning 1 unit of stock. He would lose 1 rupee from the option, but gain that rupee back from the stock. And if the stock drops below the strike price, making the option go out of the money for Arya, he can just sell the stock at the strike price so he doesn't risk losing any money from that either. This is the idea behind dynamic hedging.\n\n### A Hedged Portfolio\nA hedged portfolio, at any one time, will offset an option $V$ with some amount ($\\Delta$) of stock $S$. Let $\\Pi$ represent the portfolio, we have $\\Pi = V - \\Delta S$. It basically means you can sell something without taking the opposite side of the trade. You have a no-risk trade you could make profit from. However, this isn't very practical because the amount of stock to hold $\\Delta$, changes based on current stock prices.\n\n## Deriving Black-Scholes-Merton\nWe're essentially constructing a portfolio of a single option $V$, and a certain number of shares $\\Delta$ of $S$ that we're going to sell against the option to dynamically hedge against it. So the value of our portfolio is essentially $\\Pi = V(S, t) - \\Delta S$. We're interested in tracking the time evolution of our portfolio. This is difficult because again, the future cash-flow of our option is not easy to price in. So we use the principles from Brownian motion to essentially model the underlying asset as a stochastic process that follows geometric Brownian motion. \n$$\n\\begin{align}\n\u0026 \\Pi = V(S, t) - \\Delta S \\\\\n\u0026 d\\Pi = dV - \\Delta dS \\quad | \\ \\text{Modelling } dS \\text{ using Geometric Brownian Motion,}\\\\\n\u0026 dS = \\mu Sdt + \\sigma S dW \\quad | \\ \\text{The first term accounts for drift. The second term accounts for volatility.}\\\\\n\u0026 \\text{You can then apply Ito's Lemma to get: } \\\\\n\u0026 dV = \\frac{\\partial V}{\\partial t}dt + \\frac{\\partial V}{\\partial S}dS + \\frac{1}{2}\\frac{\\partial^2 V}{\\partial S^2}dS^2 \\\\\n\u0026 dV = \\frac{\\partial V}{\\partial t}dt + \\frac{\\partial V}{\\partial S}dS + \\frac{1}{2} \\sigma^2S^2\\frac{\\partial^2V}{\\partial S^2}dt \\\\ \n\u0026 \\text{Substituting this back into the original portfolio formula,} \\\\\n\u0026 d\\Pi = (\\frac{\\partial V}{\\partial t} + \\frac{1}{2}\\sigma^2S^2\\frac{\\partial^2V}{\\partial S^2})dt + (\\frac{\\partial V}{\\partial S} - \\Delta) dS \\\\\n\u0026 \\text{If we take } \\Delta = \\frac{\\partial V}{\\partial S} \\text{ as the hedge,} \\\\\n\u0026 d\\Pi = (\\frac{\\partial V}{\\partial t} + \\frac{1}{2}\\sigma^2S^2\\frac{\\partial^2V}{\\partial S^2})dt \\\\ \n\\end{align}\n$$\nOur portfolio is now just a $dt$ term which means that the portfolio is now deterministic, and as such, doesn't carry any risk. A risk-free portfolio should yield a risk-free rate ($r$), which let's us write a different equation for $d\\Pi$. \n$$\n\\begin{align}\n\u0026 d\\Pi = r\\Pi dt = (V - rS\\frac{\\partial V}{\\partial S})dt \\\\\n\u0026 \\text{By equating this to our previous formula, and re-grouping terms, we get the famous equation:} \\\\\n\u0026 \\frac{\\partial V}{\\partial t} + rS\\frac{\\partial V}{\\partial S} + \\frac{1}{2}\\sigma^2S^2\\frac{\\partial^2V}{\\partial S^2}-rV = 0\n\\end{align}\n$$\nThe risk-free rate in the Black-Scholes formula represents the **theoretical return on an investment with no risk of default**. For example, government-bonds. \n\nWe can now set $V$ equal to a call option or a put option and then solve the differential equation to get a closed-form equation for the price of a call-option given:\n$$\n\\begin{align}\n\u0026 C = \\text{call option price} \\\\\n\u0026 N = \\text{cumulative distribution function of the normal distribution} \\\\\n\u0026 S_t = \\text{spot price of an asset} \\\\\n\u0026 K = \\text{strike price} \\\\\n\u0026 r = \\text{risk-free rate} \\\\\n\u0026 t = \\text{time to maturity} \\\\\n\u0026 \\sigma = \\text{volatility of asset} \\\\\n\u0026 \\\\\n\u0026 C = N(d_1)S_t - N(d_2)Ke^{-rt} \\\\ \n\u0026 \\text{where } d_1 = \\frac{\\ln(\\frac{S_t}{K}) + (r + \\frac{\\sigma^2}{2})t}{\\sigma\\sqrt t} \\\\\n\u0026 \\text{and } d_2 = d_1 - \\sigma\\sqrt t\n\\end{align}\n$$\n\n\n# References\n1. [The Trillion Dollar Equation](https://www.youtube.com/watch?v=A5w-dEgIU1M\u0026t=148s)\n2. [Bachelier Model Call Option Price Derivation](https://www.youtube.com/watch?v=J1sBj9K-BhE)\n3. [BROWNIAN MOTION - Galton UChicago](https://galton.uchicago.edu/~lalley/Courses/313/BrownianMotionCurrent.pdf)\n4. [Geometric Brownian Motion](https://www.youtube.com/watch?v=sIKD1tQryHg\u0026list=PLg5nrpKdkk2BZoZYAg2d6Ma3HZ5p9h35i\u0026index=5)\n5. [The Easiest Way to Derive the Black-Scholes Model](https://www.youtube.com/watch?v=NHvQ5CSSgw0\u0026t=70s)\n", - "lastmodified": "2026-01-11T05:06:32.761997775+05:30", + "lastmodified": "2026-01-11T10:59:44.924165605+05:30", "tags": [ "domain-finance-economics", "domain-finance-quantitative" @@ -438,7 +438,7 @@ "/blog/the-economics-of-foreign-exchange-forex": { "title": "The Economics of Foreign Exchange (Forex)", "content": "# History\n## Globalization\nToday, we live in an increasingly globalized word. The usual growth path for a business today always involves adding more dependencies on global supply-chains to ensure that their products and services are produced in the best country possible. To boost profits, you might want to cut manufacturing costs, and where else to get cheap components and parts from than China? Or maybe you want high quality parts from Germany to better satisfy your customers. Regardless, internationalizing supply chains is a very important near-compulsory step for every business to increase their growth. But this wasn't the case 100 years ago. \"Imports\" didn't exist. You either built the entire product locally or you couldn't. Trade was minimal and owning foreign products was a luxury. Maybe it's one of the reasons why the words \"foreign goods\" still commands so much respect today. \n## Comparative Advantage\nLet's explain this concept through a simple case study of 2 countries, Germany and China. Germany is well-known for being a supplier of high quality, efficient and luxury parts. China on the other hand is well-known for being able to mass-produce cheap goods. \n### Germany \nLet's say we asked Germany to produce iPhones. They could do it pretty easily because they already have all the infrastructure and technological expertise required to mass-produce them from their individual components without much difficulty. However, Germany has very expensive, limited labor and their government also has very stringent rules regarding industrial waste disposal. Add to this, it's not geographically close to the mass-suppliers of the components used in iPhones. According to educated estimates, iPhones built in Germany might cost as much as $\\approx 4,000\\$$. \n### China\nNow let's ask China to build luxury sports cars. They could probably do it. China mass produces a lot of EVs, but their manufacturing is a lot more targeted to more medium-quality mass manufacturing. China's labor force is much larger than Germany and is also comparatively lower-skilled. Their government also has much looser environmental regulations than Germany. A lot of the major NAND and other similar component suppliers are also in or near China. This makes it extremely well suited to mass manufacturing iPhones, but not luxury cars. \n### Symbiosis\nGermany _could_ allocate some resources to building overpriced iPhones and China _could_ allocate resourced to build *somewhat decent* luxury cars, but this requires a lot of research, exploration and budget to be allocated to these projects. But instead of doing that, they could just agree to focus on their strengths. Germany makes cars, and China makes phones. Now to ensure their citizens can enjoy both cheap phones and good quality cars, Germany over-produces and exports its cars to China, and China does the same with iPhones for Germany. Both countries get the best of both worlds. \ndd ad\nThis is the because each country had a **comparative advantage** for the product they specialize it. However, note that this is an extremely simplified example. We have evolved past the barter system, trade is not done by putting up 100 iPhones for a car, but through currency. However, German manufacturers don't want Yuan and Chinese manufacturers don't want Euros. Hence we trade *currencies*. This makes a secondary market, for Euros and Yen. The foreign exchange is created because global trade requires it. ([What is the Stock Market?](/blog/what-is-the-stock-market))\n## Bretton Woods Conference\nThe Bretton Woods conference, also known as the United Nations Monetary and Financial Conference, was a landmark international gathering held in July 1944 in Bretton Woods, USA. The conference was attended by 730 delegates from 44 representative countries. Countries wanted to establish multilateral economic cooperation to avoid the complications and painful situations faced during World War 2, the [Great Depression](/blog/great-depression) and the trade wars that spread the depression globally. \n\n\u003eThe seminal idea behind the Bretton Woods Conference was the notion of [open markets](https://en.wikipedia.org/wiki/Free_market \"Free market\"). In his closing remarks at the conference, its president, U.S. Treasury Secretary [Henry Morgenthau](https://en.wikipedia.org/wiki/Henry_Morgenthau_Jr. \"Henry Morgenthau Jr.\"), stated that the establishment of the IMF and the IBRD marked the end of [economic nationalism](https://en.wikipedia.org/wiki/Economic_nationalism \"Economic nationalism\"). This meant countries would maintain their national interest, but trade blocs and economic spheres of influence would no longer be their means. - [Bretton Woods Conference - Wikipedia](https://en.wikipedia.org/wiki/Bretton_Woods_Conference)\n\nTwo notable results of this conference include the establishment of the **International Monetary Fund (IMF)** and the **International Bank for Reconstruction and Development (IBRD)** which was later renamed to something you might recognize today, the **World Bank**. The other important result, was the establishment of an adjust-ably pegged foreign exchange market rate system, where exchange rates were pegged to gold. Governments were only allowed to alter the rates by at most $10\\%$ without the involvement of the IMF and only to correct a \"fundamental disequilibrium.\"\n\nThe system established a gold price of 35$ per ounce and participating currencies pegged their currency to the dollar. There was no \"market\" for currencies, just an somewhat adjustable fixed rate for currency conversion irrespective of supply and demand. This is simple, but completely disregards the fluctuating nature of supply and demand that economics tries to model \u0026 understand. \n\n\u003e For anybody that that knows anything about economics, you will know, that fighting the forced of supply and demand is like fighting the flow of water on a riverbank. Eventually, the continuous forces are going to win out. - [Economics Explained - YouTube](https://www.youtube.com/watch?v=ig_EO805rpA)\n\n# The End of Bretton Woods, Today's Forex Market\n\n\u003eThe first large crack in the system appeared in 1967, with a run on gold and an attack on the British Pound that lead to a 14.3% devaluation in the currency despite the efforts by the government to pull it back in line with it's predetermined level. Finally, president Nixon took the United States off the gold standard in 1971 (temporarily). And by late 1973, the system had collapsed and participating countries were allowed to float their currencies freely. This meant that currencies were traded on open markets just like shares or oil or beanie babies. The growth of global trade and the liberation of currencies to live their life at whatever value the market decided for them is why the market is so influential today - [Economics Explained - YouTube](https://www.youtube.com/watch?v=ig_EO805rpA)\n\n# Foreign Exchange Players\n- **General public:** Makes up a very tiny percentage of market transactions. Primarily from holiday travel currency conversions and retail shopping on sites which do not support automatic currency conversions. \n- **Institutional Investors:** Companies, Investment firms and Governments are big players in this market. Imagine company $A$ is based in country $a$ and wants to expand to country $b$, which uses $b'$ currency. Company $A$ now has to use its $a'$ currency revenue from country $a$ to buy up currency $b'$ through the exchange market and then use it as investment in country $b$ to expand there. However, remember that exchange rates are always fluctuating. Let's say company $A$ makes $50\\%$ profit in country $b$, but $b'$ has fallen in value harshly, completely negating the profit the business made in country $b$. This is a lot of risk that companies don't want to take. They would be willing to take risk in betting that their products and services would be a hit in country $b$, but they wouldn't want to deal with fluctuations in the exchange market which is outside their area of expertise. This is where other investment firms and speculative investors can help these businesses out by taking bets and helping the company by trading using [Derivatives - Options](/blog/derivatives-options), which give peace of mind to the company and risk-reward potential for the speculative investors. Similarly governments can buy up chunks of foreign currency to influence the market, play politics, etc. \n# References\n1. [Economics Explained - YouTube](https://www.youtube.com/watch?v=ig_EO805rpA)\n2. [Bretton Woods Conference - Wikipedia](https://en.wikipedia.org/wiki/Bretton_Woods_Conference)", - "lastmodified": "2026-01-11T05:06:32.770736171+05:30", + "lastmodified": "2026-01-11T10:59:44.933836076+05:30", "tags": [ "domain-finance-economics" ] @@ -446,7 +446,7 @@ "/blog/the-fast-fourier-transform-fft": { "title": "The Fast Fourier Transform (FFT)", "content": "# FFT (Fast Fourier Transform)\n\n**The problem:** _Given two d-degree polynomials, compute their product_\n\nLet $A(x) = a_0 + a_1x + ... + a_dx^d \\ \\text{and} \\ B(x) = b_0 + b_1+...+b_dx^d$\n\nThen,\n\n$C(x) = A(x)\\times B(x) = c_0 + c_1x+...+ c_{2d}x^{2d}$ has coefficients $c_k = a_ob_k+a_1b_{k-1}+...+a_kb_0 = \\sum_{i=0}^ka_ib_{k-i}$\n\nThe naïve solution here would be to compute in $O(d^2)$ steps. There are $2d$ terms in the final expression and each of these terms requires order $O(d)$ multiplications to compute. The question is, _can we do better?_\n\nDivide and conquer is an approach that works well when we are able to introduce/identify some sort of overlap in subproblems. But for each coefficient, the multiplication terms **do not** have much overlap. Perhaps a different view is of order.\n\nThe **co-efficient** representation of polynomials is essentially an equation that can _uniquely_ identify some function on a graph. There are definitely other representations that will allow us to do the same.\n\nThe one we will be looking at today is the **value** representation of a function. Consider any function defined by some $d$ degree polynomial. Notice that such a function can always be _uniquely_ identified by any set of $d+1$ points that satisfy the equation (are on its graph).\n\n**Proof:** Say we have a $d$ degree polynomial $P$ and we evaluate it at $d + 1$ unique points. We end up with the set of points $\\{ (x_0, P(x_0)), (x_1, P(x_1), \\dots, (x_d, P(x_d) \\}$.\n\nIf $P(x) = p_dx^d + p_{d-1}x^{d-1}+\\dots+p_2x^2+p_1x^1+p_0$\n\nNotice that there are $d+1$ coefficients for each such $P(x)$. Writing our equation in matrix form,\n\n$$ \\begin{bmatrix} P(x_0) \\\\ P(x_1) \\\\ \\vdots \\\\ P(x_d) \\end{bmatrix} = \\begin{bmatrix} 1 \u0026 x_0 \u0026 x_0^2 \u0026 \\dots \u0026 x_0^d \\\\ 1 \u0026 x_1 \u0026 x_1^2 \u0026 \\dots \u0026 x_1^d \\\\ \\vdots \u0026 \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ 1 \u0026 x_d \u0026 x_d^2 \u0026 \\dots \u0026 x_d^d \\end{bmatrix} \\begin{bmatrix} p_0 \\\\ p_1 \\\\ \\vdots \\\\ p_d \\end{bmatrix} $$\n\nNotice that there are $d+1$ variables and $d+1$ equations. If we had any lesser we would not be able to uniquely solve for this system. Hence we need at least $d+1$ points. Another way to visualize this is that our matrix of equations is invertible for unique points $x_0, x_1, \\dots, x_d$. This can be proved by solving for the determinant. This implies that we have a unique set of coefficients with which we can identify the polynomial.\n\nBut _why_ do we care about the value representation of a polynomial?\n\n## Value representation: The good and the bad\n\nNotice that if we have some polynomial $C(x)$ which is the result of multiplication of two $d$ degree polynomials $A(x)$ and $B(x)$, the degree of polynomial $C(x)$ must be $2d$. This means that it can be _uniquely_ identified by just $2d+1$ points.\n\nNow, for some point $x_0$, $C(x_0) = A(x_0)\\times B(x_0)$.\n\nThis means that, if we could pick and evaluate polynomials $A(x)$ and $B(x)$ at $2d+1$ points, we can generate $2d+1$ points to _uniquely_ identify $C(x_0)$ with in **linear time**.\n\nHowever, this is assuming that converting the polynomial from coefficient form to value form and back takes lesser than equal to $O(n)$. This is **not** true. We must evaluate a polynomial with $d$ terms at $2d+1$ points. This calculation is of the order $O(d^2)$ and hence, no better than the naïve method. This is where the idea of **FFTs** comes in.\n\n## Evaluating faster (Applying divide and conquer)\n\nThe problem we wish to solve is as follows. Given a polynomial function $A(x)$ and a set of points $X$, we wish to compute $A(x) \\ \\ \\forall x \\in X$.\n\nLet $A(x) = a_0+a_1x+\\dots+a_dx^{n-1}$\n\nNotice that we can divide our polynomial into two halves, one containing the even powers of $x$ and another containing the odd halves. Let's call them $A_e(x)$ and $A_o(x)$.\n\n$A_e(x) = \\sum_{k=0}^{\\frac{n}{2}-1}a_{2k}x^k$\n\n$A_o(x) = \\sum_{k=0}^{\\frac{n}{2}}a_{2k+1}x^k$\n\nNotice that we aren't raising $x$ to the power of their coefficients. And in doing so, we have effectively cut in half the degree of the polynomial. But in doing so, we have lost the original polynomial. We still require an algebraically correct way to merge these two divisions into the original polynomial.\n\nNotice that if we evaluate $A_e$ at $x^2$ instead of $x$, the algebra checks out. $(x^2)^k = x^{2k}$. Every polynomial term matches its counterpart in the original polynomial. Similarly, we can do the same for $A_0$, but we are now missing a $+1$ in the powers of every term. This can be easily corrected by simply multiplying a single $x$ to the whole polynomial. Similar to Horner's rule. This gives us our final equation,\n\n$$ A(x) = A_e(x^2)+xA_o(x^2) $$\n\nThis has allowed us to effectively calculate the value of $A(x)$ for some point using a technique that uses divide and conquer. But **is** it truly faster than any of the previous algorithms?\n\n### Analyzing time complexity\n\nNotice that\n\n$$ T(n) = 2T(\\frac{n}{2}, |X|)+O(n+|X|) $$\n\nThe $\\frac{n}{2}$ comes from dividing the input to each recurrence in half. We have 2 such recursive calls. These 2 factors account for the first term in the expression. Now, at each \"node\" of our recursive tree, we have do $O(n)$ computation for traversing the polynomial list and splitting it into two halves. And finally, $O(|X|)$ time for computing the polynomial at each $x\\in X$.\n\nTo solve this recurrence, let us imagine the recursion tree. The base case for this recursion is when $n=1$. When $n=1$, the answer is the value in the set itself. However, notice that at no point are we **ever** changing the size of the set $X$. The original size of $X$ was $n$, and it remains $n$ at every step of the algorithm. This will span out to be a binary tree of depth $log_2(n)$, with each node doing $O(n)$ computation.\n\n![fft-1](/images/fft-1.webp)\n\nAt the bottom most level, notice that we still have order $n$ leaves, each of which are doing order $n$ computation. This will sadly give us a time complexity of $O(n \\times n) = O(n^2)$.\n\nThe reason why every node must do $O(n)$ computation is because we haven't been able to change the size of the set $X$ like we have managed to with $n$. If we could somehow half the size of $X$ just like we did with $n$, we would get a much simpler recurrence. $T(n) = T(\\frac{n}{2})+O(n)$ which evaluates to just $O(nlogn)$. But how can we reduce the size of the set of all points we need to evaluate our polynomial at?\n\n### The final piece of the puzzle\n\nLet's take a look at our equation again\n\n$$ A(x) = A_e(x^2)+xA_o(x^2) $$\n\nIn the recursive call to $A_e$ and $A_o$, we have so far managed to _reduce_ the value of $n$ (no. of terms in the polynomial), by half. But we haven't managed to half the size of $X$, the set of all points we require to evaluate our polynomial at. So let's take our attention off $n$ and think about $x$.\n\nAt every step, or \"node\" of our algorithm, notice that we are passing the value of $x^2$. Another key realization is that, we are **free** to choose any $X$ we want as long as all the points in $X$ are unique.\n\nThis has allowed us to transform the problem of reducing the size of $X$ into a simpler question, _\"Does there exist some $x^2$ for which there are multiple unique roots $x_0$ and $x_1$?\"_\n\nNotice that at least in the real plane, the answer is **no.** Well, it might work for the first \"root node\" of our recursion tree. Every real number except zero satisfies the property that $x^2=(-x)^2$. Hence we can just evaluate the polynomial at some set of points $x$ and $-x$. But in the second level of our recursion, we have a huge problem. $x^2$ will **always** be a _positive_ value. This means, we no longer have positive-negative pairs to work with. Our set $X$ is no longer free to choose. It has the constraint on it that it **must** be all positive. Without our $\\pm x$ pairs, we cannot proceed.\n\n**Breaking out of the real plane**\n\nHere comes the last piece of our puzzle. While the above was true for real numbers, it is **not** true for complex numbers. Let's assume our set at the final depth of its recursion was $X = \\{ 1 \\}$.\n\nFor the set to be halved in the level just above, we require **two** values of $x_0$ and $x_1$ such that $x_0^2=x_1^2=1$.\n\nTwo such values are $-1$ and $+1$. Let's try thinking one level above this. We would require two values $x_0$ and $x_1$ such that $x_0^2=x_1^2=-1$. Two values that fit this equation are $i$ and $-i$.\n\nNotice that we can keep doing this at every step of our recursion, and we would just keep picking the $k^{th}$ roots of $1$ and every level.\n\nThis is the key realization to solving the problem of reducing the size of $X$. By choosing our set $X$ as the set of all the $k^{th}$ roots of unity where $k \\gt log_2n$, we have effectively managed to half the size of $X$ along with $n$ at **every** step of our algorithm. Our recursion tree now looks more like this\n\n![fft-2](/images/fft-2.webp)\n\nBy simply computing $A_e(x)$ and $A_o(x)$ at $\\frac{n}{2}$ intervals, we can compute the answer at $n$ points. The roots of unity always occur in $\\pm$ pairs and evaluate to the **same** value when squared. This means, we can write it as follows.\n\n$$ A(x) = A_e(x^2) \\pm xA_o(x^2) \\quad \\forall x\\in X , \\text{x is positive} $$\n\nThis has allowed us to transform our original equation for calculating time complexity into he following\n\n$$ T(n) = T(\\frac{n}{2}, \\frac{|X|}{2})+O(n) \\\\ = O(nlogn) $$\n\nWe have managed to come up with an algorithm that can compute the value of some polynomial function $A(x)$ with $n$ terms at every point in some set $X$ of size of the order $n$ in $O(nlogn)$ time.\n\n### Converting back to polynomial form [Interpolation]\n\nNow, we have an algorithm that can **almost** do it all. We can compute form polynomial representation to value representation in just $O(nlogn)$ complexity. We can compute the value of the product of the $2n$ terms and find the value representation of the polynomial product in $O(n)$ complexity. The only thing left is to convert the polynomial obtained back from value form to polynomial form.\n\nWith a little thought, we can use the same FFT algorithm we just came up with to interpolate our values back to give us our polynomial in coefficient form. Let us think about the _original_ equation that we managed to simplify and solve using FFT.\n\n$$ \\begin{bmatrix} P(x_0) \\\\ P(x_1) \\\\ \\vdots \\\\ P(x_d) \\end{bmatrix} = \\begin{bmatrix} 1 \u0026 x_0 \u0026 x_0^2 \u0026 \\dots \u0026 x_0^d \\\\ 1 \u0026 x_1 \u0026 x_1^2 \u0026 \\dots \u0026 x_1^d \\\\ \\vdots \u0026 \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ 1 \u0026 x_d \u0026 x_d^2 \u0026 \\dots \u0026 x_d^d \\end{bmatrix} \\begin{bmatrix} p_0 \\\\ p_1 \\\\ \\vdots \\\\ p_d \\end{bmatrix} $$\n\nWe chose a value of $x_i$ such that every $x_i \\in X$, is some $k^{th}$ root of unity. To get our original vector back, we only need to left-multiply the matrix of the $k^{th}$ roots of unity with it's inverse and use FFT to compute the product of the inverse matrix and the values vector.\n\nThis was our choice for the $X$ matrix,\n\n$$ M_n(\\omega) =\\begin{bmatrix}1 \u0026 1 \u0026 1 \u0026 \\dots\u0026 1\\\\1 \u0026\\omega \u0026 \\omega^2 \u0026 \\dots \u0026\\omega^{n-1}\\\\1 \u0026 \\omega^2\u0026 \\omega^4\u0026 \\dots \u0026\\omega^{2(n-1)}\\\\\u0026\u0026\\vdots\\\\1 \u0026 \\omega^j \u0026 \\omega^{2j} \u0026\\dots\u0026\\omega^{(n-1)j} \\\\\u0026\u0026\\vdots\\\\1 \u0026 \\omega ^{n-1}\u0026 \\omega^{2(n-1)} \u0026 \\dots \u0026\\omega ^{(n-1)(n-1)}\\end{bmatrix} $$\n\n$M_n(\\omega)$ is a Vandermonde matrix with the following property that it is invertible only if every choice of $x_i$ is unique. This is true in our case and hence $M_n(\\omega)$ is invertible. Once this proof has been done for the sake of proving correctness, we have a **complete** solution to solve the problem of polynomial multiplication in just $O(nlogn)$ time.\n\n$$ \\text{Compute values of A(x) and B(X) at } 2d+1 \\text{ points using FFT. Multiply the corresponding points with each other to obtain value representation of the product } C(x) \\text{at 2d+1 points. Use reverse FFT to compute the value of the coefficients of } C(x) \\text{ for each of its } 2d+1 \\text{ terms.} $$\n\nThe overall time complexity is $O(nlogn)+O(n)+O(nlogn) = O(nlogn)$\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H\n2. [Divide \u0026 Conquer: FFT - MIT 6.046J OCW - Erik Demaine](https://youtu.be/iTMn0Kt18tg?si=MkE2euxhcF1whuSR)\n", - "lastmodified": "2026-01-11T05:06:33.032008196+05:30", + "lastmodified": "2026-01-11T10:59:45.189766148+05:30", "tags": [ "domain-cs-algorithms-analysis" ] @@ -454,7 +454,7 @@ "/blog/theoretical-metrics-for-benchmarking": { "title": "Theoretical Metrics for Benchmarking", "content": "# Benchmarking a system\n\n[Profiling a Program](/blog/profiling-a-program) tells us where hotpots in a programs execution are, what parts are the bottleneck and where we should focus our attention when trying to improve performance, but it’s just as important for us to know what the theoretical best performance is that we can get given some hardware. (More on this in [Brent's Theorem \u0026 Task Level Parallelism](/blog/brent-s-theorem-task-level-parallelism) as well).\n\n## Why benchmark?\n\nThere is no point repeatedly trying to optimize some code just because it takes up the most time when that part of the code is already performing very close to it’s theoretical maximum. Some code can only execute so fast on some given hardware and there is no point trying to optimize it further.\n\n[https://stackoverflow.com/questions/329174/what-is-flop-s-and-is-it-a-good-measure-of-performance](https://stackoverflow.com/questions/329174/what-is-flop-s-and-is-it-a-good-measure-of-performance)\n\n\u003e If you know the CPU's theoretical peak performance in FLOPS, you can work out how efficiently you use the CPU's floating point units, which are often one of the hard to utilize efficiently. A program which runs 30% of the FLOPS the CPU is capable of, has room for optimization. One which runs at 70% is probably not going to get much more efficient unless you change the basic algorithm. For math-heavy algorithms like yours, that is pretty much the standard way to measure performance. You could simply measure how long a program takes to run, but that varies wildly depending on CPU. But if your program has a 50% CPU utilization (relative to the peak FLOPS count), that is a somewhat more constant value (it'll still vary between radically different CPU architectures, but it's a lot more consistent than execution time). But knowing that \"My CPU is capable of X GFLOPS, and I'm only actually achieving a throughput of, say, 20% of that\" is very valuable information in high-performance software. It means that something other than the floating point ops is holding you back, and preventing the FP units from working efficiently. And since the FP units constitute the bulk of the work, that means your software has a problem. It's easy to measure \"My program runs in X minutes\", and if you feel that is unacceptable then sure, you can go \"I wonder if I can chop 30% off that\", but you don't know if that is possible unless you work out exactly how much work is being done, and exactly what the CPU is capable of at peak. How much time do you want to spend optimizing this, if you don't even know whether the CPU is fundamentally capable of running any more instructions per second?\n\n## Defining some metrics to measure performance\n\nDefining “how fast” a system performs is not an easy task. A specialized system might perform with varying speeds for different types of tasks. Plus, we don’t even have a unit to measure “performance.”\n\nAgain, performance of a system depends on the type of task it is computing. A graphics card will perform exceptionally well in highly parallelized workloads but might be terrible in serial workloads. When evaluating devices we must keep into account our use case and use a performance metric that is appropriate for our use case. Keeping this in mind, the world of computer science has several units on which we can measure the performance of a device. Perhaps the most famous of which, is the GFLOP.\n\n1. **FLOPS -** FLOPS in CS mean “Floating point operations per second”, and as the name implies, this metric is quite useful when we are evaluating some system for performance in scientific computation involving workloads. A CPU capable of 2 GFLOPS is twice as fast as another CPU only capable of doing 1 GFLOP, _while doing floating point operations_.\n \n A natural question to ask here is, “Why FLOPS?”\n \n → The reason for this is that this method of computation is widely employed in scientific simulation work. Furthermore, when compared to integer operations, FLOPs are substantially more complex for a CPU to execute. Both of these are major factors why FLOPs have become a universal benchmark for HPC. Computers that perform online services or database queries, on the other hand, generally use integer arithmetic and for them we have MIPS. For single PC systems, we compute\n \n $$ FLOPS = cores \\times \\frac{cycles}{second} \\times \\frac{FLOPs}{cycle} $$\n \n2. **MIPS -** MIPS stands for “Millions of instructions per second” and it is essentially a measure of the integer performance of a computer. However, in CISC machines different instructions take different time to execute hence the reported value depends heavily on the mix of the instructions used in the benchmark. MIPS tries to model the peak performance of a machine with few branches and no cache contention. It is a decent estimate of computer performance on commonly used apps which rely on integer arithmetic.\n \n $$ IPS = sockets \\times \\frac{cores}{socket}\\times clock\\times\\frac{instructions}{cycle} $$\n \n Note: Instructions/cycle depends on the test instructions, data and many such test-specific factors.\n \n ## Benchmarking software\n \n There’s a bunch of programs out there written to stress-test your machine and record the max `insert-quantity-you're-measuring` that your machine is capable of. One such tool is the [Whetstone Benchmark](https://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026ved=2ahUKEwj2mdK7vuX2AhVCR2wGHbtFAocQFnoECAcQAQ\u0026url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FWhetstone_(benchmark)\u0026usg=AOvVaw0Letk1QFpBEe0FOy7rJpyU). I found code for it on [Netlib](https://www.google.com/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=\u0026ved=2ahUKEwiPmKrKvuX2AhX4RmwGHfzNB_4QFnoECBAQAQ\u0026url=https%3A%2F%2Fwww.netlib.org%2Fbenchmark%2Fwhetstone.c\u0026usg=AOvVaw0axRP6mDSotBx0v1OCU_Ho). Although, looking at the code and results it produces, it looks like Netlib’s version measures MIPS, not FLOPS.\n \n ## Detour into compiler benchmarks and analysis\n \n I decided to try out different compilers and compiler flags and I’m glad I did, the results I got are pretty surprising\n \n ### ICC\n \n - `-O0` : 5882.4 MIPS\n - `-O3` : 100000.0 MIPS _(17x speedup!)_\n \n ### GCC\n \n - `-O0` : 6250.0 MIPS _(Faster than `icc -O0`!)_\n - `-O3` : 25000.0 MIPS MIPS _(4x slower than `icc -O3`)_\n \n This wasn’t what I expected at all. You’d expect the numbers to at least be close, perhaps some `avx` optimizations aren’t on?\n \n - `-O3 -mavx2 -mfma` : 25000.0 MIPS\n \n No change at all. This was quite disappointing, until...\n \n - `-Ofast` : 100000.0 MIPS\n \n What is going on? Why is there a sudden increase of **4x** performance from this flag? More on this later.\n \n ### Clang - LLVM\n \n - `-O0` : 5555.6 MIPS _(On par with `icc` and slightly slower than `gcc`)_\n - `-O3` : 16666.7 MIPS _(1.5x slower than `gcc`)_\n - `-Ofast` : 100000.0 MIPS\n - `-Ofast -mavx2` : 125000.0 MIPS _(Highest we’ve managed to record)_\n \n So what is going on here? How is `icc` so fast with just `O3` and why do the other two compilers need an extra flag to reach `icc`'s `O3` speeds? And no `icc` does not speed up more on `Ofast`, the above were the highest scores I managed to get out of each compiler by trying a lot of different flags.\n \n Upon investigating further, we discover that `-Ofast` turns on a flag called `-ffast-math` which is basically allowing the compiler to break some rules when doing floating point arithmetic for the sake of performance. [This blog](https://simonbyrne.github.io/notes/fastmath/) by Simon Byrne explains the risks of `-ffast-math` in detail but in brief, this allows it to rearrange and vectorize more code, hence the huge performance boost. Turns out, in floating point land, $a + (b+c) \\neq (a+b) + c$. Associativity might lead to more floating point precision errors. Further, `ffast-math` considers all math that is computed is finite, so no `inf`, `nan`, etc. Subnormals (numbers _very_ close to zero) get considered as 0 via a hardware FPU control register, which may even affect _other code_ running on the same thread.\n \n To be fair, most of these optimizations won’t affect code in non-scientific environments much, but at the same time these are real risks. **The reason why `icc` gets such high performance on `O3` is because Intel has it’s version of `-ffast-math` on by default.** `-fp-model=fast` is the default in [ICC](https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/floating-point-options/fp-model-fp.html).\n \n ### Reality of theoretical benchmarks\n \n In practice, while our theoretical estimates give us a _very loose_ upper bound on the performance we can hope to expect, it’s not very accurate. Consider FLOPS for example, most floating point operations do not take one cycle. FMA operations combine add and multiply and have their own throughput / latency to consider. SIMD vectorization affects the number of data units on which we apply the FLOP at the same time. Even SIMD instructions have their own throughput / latency. Add to this pipe-lining effects and most importantly, memory limits thanks to small cache sizes, memory bandwidth, etc. and there are a lot many factors which give us much lower peak practical performance vs peak theoretical performance. Hence while theoretical numbers are not a _terrible_ idea, it’s almost always good practice to run software benchmarks and get a good practical estimate as well.\n \n This site (translated from Russian) gives a pretty nice overview of why theoretical computation of GFLOPS/sec is a bad estimate and suggests an alternative which involves profiling a program to count the number of floating point operations it does and using this to measure GFLOPS/sec instead. [https://habr-com.translate.goog/ru/company/intel/blog/144388/?_x_tr_sl=auto\u0026_x_tr_tl=en\u0026_x_tr_hl=en-US\u0026_x_tr_pto=wapp](https://habr-com.translate.goog/ru/company/intel/blog/144388/?_x_tr_sl=auto\u0026_x_tr_tl=en\u0026_x_tr_hl=en-US\u0026_x_tr_pto=wapp)\n\n# Operational Intensity and Achievable peak throughput\n\nA slightly better estimate we can do involves using the max bandwidth our CPU can receive, Let’s say we’re running the `saxpy` program which computes $S = \\alpha X +Y$. Here we’ll assume $\\alpha$ is a scalar stored in memory and $X$ and $Y$ are floating point vector arrays. Now, we’ll disregard the load and store operations required to load the operands into register memory as this will mostly be pipelined. The floating point computations we are doing are $2$, One add and one multiply. We’ll ignore FMA for now.\n\nSo we’re doing $2$ FLOPs on $2 \\times 4 = 8$ bytes of data. This gives us $operational \\ intensity = \\frac{2}{8} = 0.25$.\n \nOperational intensity is essentially the number of FLOPs we’re doing per unit data. Multiplying this number by max bandwidth we get an idea of the peak throughput our program can hope to achieve. If my CPU had a max bandwidth of let’s say 50GB/s, I would have a peak throughput of around $12.5 \\ GFLOPS / sec$", - "lastmodified": "2026-01-11T05:06:33.04200757+05:30", + "lastmodified": "2026-01-11T10:59:45.198907735+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -462,7 +462,7 @@ "/blog/towards-a-completely-self-driven-dbms-challenges-ottertune-faced": { "title": "Towards a Completely Self-Driven DBMS - Challenges OtterTune Faced", "content": "\n# Abstract\nThis blog / set of notes is not based on a paper, but rather a set of talks given by the founders of OtterTune, describing their vision for the product and the challenges they had faced moving the OtterTune project from Academia to a sellable product. Not surprisingly, a lot of real-life problems surfaced when the product started being run in production environments. The goal of this blog is to explore Andy's vision for a completely self-driving database, the challenges they faced with OtterTune, how they were forced to introduce a significant amount of manual work in their automatic pipeline to make things work, and maybe some thoughts on how we can try to get around that.\nThe talks I am referring to are:\n- [OtterTune: AI-Powered Database Optimization as a Service! (Dana Van Aken)](https://www.youtube.com/watch?v=glBoRyShD7E)\n- [Andy Pavlo - OtterTune: Using Machine Learning to Automatically Optimize Database Configurations](https://www.youtube.com/watch?v=fVZTnqgXSyw)\n- [Why Machine Learning for Automatically Optimizing Databases Doesn't Work by Andy Pavlo - JOTB23](https://www.youtube.com/watch?v=-p0fX1rsoxY)\nNote that I assume that you've either read the original [OtterTune Paper](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf) or my blog on it [OtterTune - Automatic Database Management System Tuning Through Large-scale Machine Learning](/blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning). If you have, you'll already know why this is such a difficult problem to solve \u0026 why solving it would make a lot of less expensive \u0026 faster for all Database maintainers. \n# What Can Be Optimized?\nWhat we want to have is a completely external \u0026 automated service. One that can hook onto any external database (including sharded like cockroach). Everything in the pipeline must be automated. All the points listed below sort of fall into the same bucket in that you want to automate the tuning of these properties to speed up execution of your SQL queries. All of these are difficult to deterministically optimize and we have grown to rely on machine learning heuristics to optimize these properties.\n## Indexes\nDeciding on the set of indexes to maintain can be a challenging task. The more indexes you maintain, the more queries you can make run faster. However, keep too many indexes and all your write operations will have to perform updates on multiple indexes, slowing them down by a significant amount. There has been a significant amount of work in this field, something I plan on exploring soon. \nRecommended papers:\n- [Automatically Indexing Millions of Databases in Microsoft Azure SQL Database](https://www.microsoft.com/en-us/research/uploads/prod/2019/02/autoindexing_azuredb.pdf)\n- Oracle Autonomous Database Service\n- [Demonstrating UDO: A Unified Approach for Optimizing Transaction Code, Physical Design, and System Parameters via Reinforcement Learning](https://dl.acm.org/doi/pdf/10.1145/3448016.3452754)\n- [openGauss: An Autonomous Database System](https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/vldb21-opengauss.pdf)\n## Partitioning\nThe linked paper is regarding finding an optimal sharding scheme for the database in a Cloud environment.\n- [Learning a Partitioning Advisor for Cloud Databases](https://15799.courses.cs.cmu.edu/spring2022/papers/10-partitioning/hilprecht-sigmod2020.pdf)\n## Knob Configuration\nWith time, the number of configurable variables or \"knobs\" exposed by the developers of popular DBMS software has risen. Depending on your workload, tuning these parameters from default to the optimal configuration can lead to much better utilization of hardware \u0026 give significant cost and speed benefits to DBMS workloads.\n- [Automatic Database Management System Tuning Through Large-scale Machine Learning](https://db.cs.cmu.edu/papers/2017/p1009-van-aken.pdf) / [OtterTune - Automatic Database Management System Tuning Through Large-scale Machine Learning](/blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning) :)\n- [An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning](https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/sigmod19-cdbtune.pdf)\n- [Akamas](https://www.akamas.io/resources/)\n- [ResTune: Resource Oriented Tuning Boosted by Meta-Learning for Cloud Databases](https://15799.courses.cs.cmu.edu/spring2022/papers/08-knobs3/zhang-sigmod2021.pdf)\n- [QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning](https://www.vldb.org/pvldb/vol12/p2118-li.pdf)\n## Query Optimization\nMySQL query optimizers are known to be notoriously bad. But providing the right hints \u0026 suggesting better execution plans to said optimizers can also lead to huge increases in hardware utilization. \n- [Bao: Making Learned Query Optimization Practical](https://15799.courses.cs.cmu.edu/spring2022/papers/17-queryopt1/marcus-sigmod2021.pdf)\n- [Neo: A Learned Query Optimizer](https://www.vldb.org/pvldb/vol12/p1705-marcus.pdf)\n- [MySQL HeatWave Goes on Autopilot Delivering Automated Real-Time Query Analytics Faster, Cheaper](https://www.oracle.com/a/ocom/docs/mysql/idc-mysql-heatwave-autopilot-report.pdf)\n# OtterTune: Challenges in Transitioning from Academia to Industry\nThe efficacy of ML algorithms for tuning databases in research literature are impressive, but there are many assumptions made in the papers which allow it to obtain those numbers. However, while the **academic project only focused on configuration knobs, the commercial project had expanded to include knobs, indexes, cloud configurations \u0026 queries.**\n## Assumption 1: Availability of Training Environments Mirroring Production Exactly\nOne of the biggest pain points they faced when optimizing production databases is that the paper assumed the availability of a training environment that mirrors production exactly. One might think that companies might maintain such *staging* environments, but the truth is that the staging and dev environments are often on much smaller instance configurations (for saving costs) and do not face a workload identical to that faced by their production counterparts. An ML algorithm trained on this configuration will do excellent in staging, but may perform very poorly when moved to production. Sometimes you also have features like *dynamic hardware scaling (burst credits)* that allow the servers to crank up CPUs during random short intervals which might throw off your readings by quite a bit. \n\nIn one of their field studies,\n\u003e Customer $X$ got a $15\\%$ reduction in Aurora PostgreSQL IOPS in their staging DB, but only $1\\%$ when applying the same optimizations to prod DB.\n## Assumption 2: Availability of Tools for DBAs to Replay Workload Traces in Identical Environments\nEven if the company had identical staging \u0026 production environments, the two databases are often under very different loads. The staging database might be used much more sparingly in comparison to the production database. \"Replaying\" a trace becomes a difficult problem to solve. To carry out such a task you'd need to log the time of arrival of queries and then mirror this execution as a replay trace on the copied staging database. Further, without a **repeatable** workload as a baseline, it is difficult for ML models to learn whether they are improving a database. \n- Tools for open-source DBMSs are less sophisticated than commercial DBMSs.\n- Existing built-in slow query log methods do not capture transaction boundaries.\n\nOne of the best solutions to this problem identified so far was how Oracle did it.\n![pasted-image-20240909180058](/images/pasted-image-20240909180058.webp)\n- [Why Machine Learning for Automatically Optimizing Databases Doesn't Work by Andy Pavlo - JOTB23](https://www.youtube.com/watch?v=-p0fX1rsoxY)\n\nBy having two identical instance replicas of the database and a proxy at the application level which mirrors the traffic to both the databases, you have a very reliable and identical measure of the workload that the production database has to process. \n\nRegarding repeatability, the product moved from the **5-minute monitoring period mentioned in the paper to use a much larger 1-day window for monitoring**. This presumably is long enough to learn the general characteristics of the workload, even in a distributed setting where the placement driver may schedule queries to different nodes differently at a higher layer.\n\nFurther, to build generalizable models and transfer experience, they needed training data for a diverse set of **workloads**, **hardware** \u0026 **software configurations** $\\to$ They needed a strategy change.\n## Assumption 3: Users Know What Their Databases Are Doing / Models Working In Isolation\nSometimes, bugs in the CI/CD pipeline could cause random spikes in usage which the company might need days to debug and track down. If the humans are unable to immediately identify the cause of the issue, the model does not know this either and might start incorrectly tuning the query since the function landscape captured during that period would be significantly different from what it is supposed to otherwise optimize.\n\nFurther the OtterTune software work on isolated database nodes. If two databases are working as replicas of each other or sharing similar data, it is definitely useful for an automated tuner to know that two databases are replicas and that they might benefit greatly from starting at the same configs instead of wasting expensive time \u0026 resources retraining.\n## Assumption 4: Optimization process is a one-time, offline activity\nTo some extent I believe this was pitched as an assumption partly to convince customers to not use OtterTune as a once and done tool. That said, it is true that depending on the service the customer is providing it is possible that there will be moments in time where the workload might change significantly enough to warrant another optimization bout with OtterTune. I highly doubt it ever happens frequently enough to run OtterTune very frequently, but let's suppose the customer decides to change his architecture, the backend, handle an influx of customers, etc. all of which are common in developing startups (and even much bigger startups), it would be worthwhile to re-run OtterTune. But this opens up another can of worms in that you now need to solve another open problem, \"When has a workload shifted enough to warrant running OtterTune again?\" This is a monitor, detect and alert problem which they pose as an open problem to solve in the future.\n## Other Problems\n### Safety\nYou do NOT want to crash your customer's production database at any point in time. If you are tuning on a production database, you have to be **very** careful not to degrade performance by enough to either mess up readings or crash the server. Stability is a lot more important to most customers than peak performance.\n### Time-To-Value Selling Proposition\nHow do you convince a customer that running OtterTune for just days is sufficient to show significant gains? What if it takes weeks? Without a repeatable workload the time-to-convergence could take a much longer time to converge on a near-optimal configuration. **This was one of the reasons why OtterTune also diverted to providing health checks, index optimization suggestions, etc. They had to provide value faster.**\n# OtterTune: Solutions (or \"Keeping it Street\")\n## Solving Safety\nAs a consequence of getting around the first two assumptions they made, they realized that running OtterTune on production databases was the only real solution for many of their customers who did not have the relevant tools \u0026 stack to setup a perfect traffic replication system like Oracle. In these scenarios, the only solution was to run OtterTune like a side car attached to production. This obviously has major safety risks. Here's how they tackled it:\n- Provide a significant amount of manual control for:\n\t- **Tuning Periods:** Allow DBAs to schedule exactly when the DBMS will run tuning workloads, collect samples, etc.\n\t- **Human-In-The-Loop-Approval:** Allow DBAs to manually verify all the changes that OtterTune was making before applying. This also required making the changes more human understandable. The team leveraged a mix of GPT generated \u0026 human hardcoded descriptions of knob changes to also explain what the changes OtterTune was making did and why it thought doing it was correct.\n\t- **Restart Tracking \u0026 Scheduling:** Allow DBAs to schedule exactly when the DBMS will restart. This allows them to ensure that it happens when they are doing non-important stuff and that they are available to monitor it closely following the restart.\n\t- **Knob Limits (External Factors):** There may be instances where the DBMS might share the host machine with other software in un-containerized environments or similar where there are factors **outside** of OtterTune's control. For example, companies may want to (in instances where **extreme** reliability is paramount, medical software for example) limit memory usage to say $60\\%$ of all available memory even though the recommended might be $70\\%$. So it was important to make allowances for DBAs to provide a safe range between which OtterTune was allowed to tweak the knob. It cannot exceed or undercut this range under any circumstance. Trade performance for (even if it is only perceived) reliability.\n## Fleet Management\nPeople have a lot of databases. Often it's just the same database with the same workload just duplicated and sharded across many regions. In these scenarios you really don't need machine learning. It makes a lot more sense to just identify that these two tables are the same logically (even though they are two separate physical instances), by comparing schema, name, etc. and then applying the config that was trained in one location in the other place too. In one of their field studies, they noticed that the same table was $10\\times$ faster in one region than the other. Upon manual debugging, they discovered that the DBA had forgotten to build an index built in the faster region in the slower region. These mistakes don't need machine learning to solve, but they still need to be identified and solved.\n## Providing Peace of Mind\nThis I believe was somewhat of a pivot for OtterTune to solve the \"optimization is a once and done\" problem. Reliability is a lot more important to customers than performance. So provide health scores \u0026 performance tracking to ensure that customers use the product and also have an abstracted out easy-to-see score to see how \"safe\" and reliable their database is. I believe integrating something like Netflix's Simian Army with automated runs of their monkeys would be useful integrated into such a health-scoring system. \n## Give Up \u0026 Integrate Manual Help\n### Knob Selection\nDomain knowledge is still very valuable to help the software at the moment since the sheer amount of data for each configuration that we need recorded previously is a lot more than what is available in public today. You can think of it as doing machine learning in an era before social media and the internet. Without the \"big data\" collected for this purpose, it is very difficult \u0026 time consuming for OtterTune to carry out the optimization without manual Domain Knowledge.\n\nThe configuration search space is very high-dimensional. However, research shows that $\\lt 20$ knobs can achieve near-optimal performance and expedite the tuning process. We can manually mark knobs that don't make sense to tune, require human judgement, or require database restarts. Further, in two years, nobody ever turned on tuning any of the knobs that needed database restarts. So they swept this under the rug even though they could support it. They combined the Lasso \u0026 **SHAP** techniques that they previously used with domain knowledge from PGTune, Percona Blogs and other DBMS manuals to make reducing the search space more efficient.\n\nRecommend customers to only use 10-20 knobs to start tuning. Adding all knobs to the search space for tuning makes the program extremely inefficient while not making any significant gains in comparison.\n### Config Selection\nInstead of starting with only the closest workload that was mapped in the workload characterization step, especially early on in the loop, try \"generally good\" knob values recommended by heuristics like PGTune or MySQLTuner. Adding these to the search space at the start of the tuning session helps bootstrap the process. Further, integrate periodically choosing settings based on data-driven recommendations. Example: If buffer cache hit ratio is low, increase the buffer pool size.\n\nIn short. They had to give up on the **completely automatic** solution in favor of quickly resolving customer issues by leveraging domain knowledge \u0026 integrating it with their product to provide a seamless experience to their customer.\n# OtterTune: Performance Improvements \u0026 Success\n![pasted-image-20240909202741](/images/pasted-image-20240909202741.webp)\n![pasted-image-20240909202809](/images/pasted-image-20240909202809.webp)\n- [OtterTune: AI-Powered Database Optimization as a Service! (Dana Van Aken)](https://www.youtube.com/watch?v=glBoRyShD7E)\nGiven that $61\\%$ of their customer did attempt to tune the database, OtterTune was able to get significant performance improvements. Further, Andy claims that a decent number of the databases at the bottom of the scatter plot are dead databases which just did not have enough queries being bed to it to notice any performance increase since they were dormant for the majority of the time anyway. Dana does mention that some of them were likely very well Tuned databases too. \n# OtterTune: Field Studies\n## French Bank\nThe bank had 1000 Postgres instances they wanted optimized. But when they actually went to look at the databases they realized the number of actually used databases was one. Due to political reasons saying you'll have a \"database\" for your service made certain approvals easier and this bad practice just took over the bank and they had a 1000 useless database instances.\n\nThey came back saying they're a primarily Oracle DB which had been manually tuned by their expert DBAs. Minimal work had to be done by the OtterTune folk on the driver \u0026 data transformation side, but all the ML algorithms used remained the same. This is what they meant when they wanted this service to be truly plug and play on-top of any DBMS provider. This is even considering that the bank wanted to optimize something called \"Oracle DB Time\" which is an arbitrary user-defined metric for OtterTune, but due to how the ML algorithms were defined, it could seamlessly be supported.\n\nThe shared disk had crazy variance. Had to manually be handled.\n![pasted-image-20240909205118](/images/pasted-image-20240909205118.webp)\n- [Andy Pavlo - OtterTune: Using Machine Learning to Automatically Optimize Database Configurations](https://www.youtube.com/watch?v=fVZTnqgXSyw)\n\n**Results:** They managed to get Oracle's resource consumption down by $50\\%$. \n\n![pasted-image-20240909205220](/images/pasted-image-20240909205220.webp)\n\n**Lesson Learned - Experts Make Mistakes:** The DBAs had tuned the DBMS before upgrading from Oracle v11 to v12. They did not check the configuration after the upgrade. This is why automation is very useful to catch these changes \u0026 correct them.\n**Lesson Learned - Dealing With Bad Configurations:** With little prior data, algorithms may choose bad configurations that may cause:\n- Extremely slow query execution (Cut the query execution half way through when we know the data gained from it is redundant anyway?)\n- DBMS fails to start: In practice, they realized that returning say $2\\times$ the worst recorded value so far was good enough to make the algorithm converge on a very efficient / optimal configuration. \n- DBMS fails after delay: This is trickier to solve. For example, mid-run memory allocation which breaks the server. They were able to solve by scraping the log and identifying these errors. Again, once identified, just give it a very bad score like in the previous scenario and the algorithm performs without any issues.\n## Unnamed Travel Company\nThey had a very read-heavy workload \u0026 wanted to tune their read-replicas. They currently generate configs using in-house formulas based on hardware specifications with manual tuning from DBAs if required. They tuned using 20 knobs for OtterTune. \n\n**Results**:\n\n![pasted-image-20240909205912](/images/pasted-image-20240909205912.webp)\n**Lesson Learned - External Factors:** Provide DBAs with control on what is the minimum and maximum value range they are comfortable with the value being. \n# Open Problems\n## Workload Synthesis\nLike mentioned previously, mimicking production database traffic is not an easy problem to solve for most customers. Can we perhaps solve it by using generative models like Dall-E, etc. to synthesize these workloads to exercise the staging database just like the production database?\n- Relevant Paper Suggested: [HyBench: A New Benchmark for HTAP Databases](https://www.vldb.org/pvldb/vol17/p939-zhang.pdf)\n## Sample-Efficient Tuning Techniques\nHow do we speed up the model to collect fewer samples \u0026 converge faster? Can we modify the internals of MySQL \u0026 Postgres, etc. to terminate query execution early even in the middle of data collection? If a query has been running for half an hour but the data collected is already enough to identify that the data is redundant, can we cut early and give \"infinity\" or \"very bad\" feedback to the model and save costs?\n- Relevant Paper Suggested (Highly): [LlamaTune: Sample-Efficient DBMS Configuration Tuning](https://arxiv.org/pdf/2203.05128)\n\nStudent suggests checking out transfer learning for instance optimization that can cast these workloads into large vectors which apparently works reasonably well for workload mapping. Andy says hardware and instance differences are an issue but the paper is for a fixed instance anyway? But Dana does say that they do not use the average difference of metrics method used for workload characterization in the paper in the production model. We do need more data, whether from the `EXPLAIN` plans or similar to improve workload characterization. \n## Stopping / Starting Criteria\n![pasted-image-20240909203024](/images/pasted-image-20240909203024.webp)\n- [OtterTune: AI-Powered Database Optimization as a Service! (Dana Van Aken)](https://www.youtube.com/watch?v=glBoRyShD7E)\n## Application Development Integration\nA suggestion is to integrate with CI/CD, GitHub, etc. to identify PR changes to indexes etc. and catch these errors early on in the pipeline way before it makes it all the way to production.\n## Extending Beyond DBMS Config Optimization\nMaking instance size optimizations \u0026 even RDBMS software would be great. But lack of data is a very hard problem. ", - "lastmodified": "2026-01-11T05:06:36.503277552+05:30", + "lastmodified": "2026-01-11T10:59:48.596353993+05:30", "tags": [ "domain-cs-ai-ml-machine-learning", "domain-cs-systems-databases" @@ -471,7 +471,7 @@ "/blog/vectorization-analyzing-loop-dependencies": { "title": "Vectorization \u0026 Analyzing Loop Dependencies", "content": "# Vectorization\n\nThe basic idea behind vectorization is just SIMD from [Flynn's Taxonomy](/blog/flynn-s-taxonomy). It allows us to perform the same instruction on multiple data element in parallel. This is achieved in hardware because of the existence of extended register files on the CPU. The CPU is modified to contain registers which can be anywhere from 128-512 bits or even larger (GPUs). These 512 bit registers can load 512 bits of data in one instruction and packed add $\\frac{512}{32} = 16$ floating point additions in one instruction.\n\n![simd-1](/images/simd-1.webp)\n\nA comprehensive list of all the SIMD instructions can be found [here](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).\n\n## Memory alignment\n\nWhen memory is aligned to 32/64 bit word boundaries than we can expect our effective memory bandwidth to be higher. When memory is fetched into the CPU cache line it is fetched in cache line sizes which are usually 64 bytes long. When data is not aligned to some nice power of 2, there will be multiple scenarios when our data element crosses a cache boundary and hence forces the compiler to make 2 fetches as the cache line does not contain the entire data element.\n\n`AVX` loads expect 16/32 bit alignment of memory for the normal `load` / `store` instructions. You’d have to use special `uload` to load from unaligned memory. This might be slower.\n\n## Loop unrolling\n\nLoop unrolling can be done automatically by the compiler and is also often used by programmers to speedup `for` loops. For example, consider the following code\n\n```c\nfor(int i=0; i\u003cn; i++)\n\t\tc[i] = a[i] + b[i]\n```\n\nThis code can be unrolled to\n\n```c\nfor(int i=0; i\u003cn-4; i+=4){\n\t\tc[i] = a[i] + b[i];\n\t\tc[i+1] = a[i+1] + b[i+1];\n\t\tc[i+2] = a[i+2] + b[i+2];\n\t\tc[i+3] = a[i+3] + b[i+3];\n}\n```\n\nSuddenly this code becomes vectorize-able. Loops can be unrolled by the compiler if it decides it is worth it.\n\n## Advantages\n\nThere are 2 main reasons to want to vectorize loops.\n\n1. Lesser conditional statement execution. Notice that in the unrolled loop the conditional check `i\u003cn` runs `n` times. Conditional statements are bad as they could introduce a potential branch in the code path. Usually branch predictors are good enough to cut back on major losses but it still helps to reduce the execution of the conditional check to $\\frac{n}{4}$ times.\n2. Facilitate register renaming. The compiler can use its full set of registers to perform `load` and `store` operations out of order to speedup execution of some of these instructions. For example, you can now perform multiple reads to different `src` registers and then perform writes during the `store` of other values. Compilers can recognize such opportunities and employ register renaming here. This would not be possible in unrolled loop code.\n\n## Disadvantages\n\n1. The primary disadvantage is increase in code size. Increased code size could potentially increase the number of instructions the CPU has to load into it’s instruction cache and hence lead to some slowdown. Another obvious disadvantage is that the code looks terrible.\n\n# Pointer Aliasing\n\nSometimes, given code like this\n\n```c\nvoid *element_sum(int *a, int *b, int N){\n\t\tfor(int i=0; i\u003cN; i++) a[i] = a[i] + b[i];\n}\n```\n\nThe compiler might not be able to parallelize it due to aliased pointers. Essentially, the compiler has no guarantee that pointers `a` and `b` point to two independent arrays. The vectorized code would give incorrect results if `b` pointed to some element of `a` itself. Hence it will not explicitly auto vectorize the code unless given some form of guarantee, usually via a `restricted` pointer or pragmas like `pragma ivdep` (for GCC).\n\n# Dependence Analysis\n\n1. True (flow) dependence (RAW → Read after Write)\n \n ```c\n 1: a = 1;\n 2: b = a;\n ```\n \n2. Anti dependence (WAR → Write after Read)\n \n ```c\n 1: b = a;\n 2: a = 2+b;\n ```\n \n3. Output dependence (WAW → Write after Write)\n \n ```c\n 1: a = 1;\n 2: a = 2+b;\n ```\n \n\nIf $S_j$ is dependent on $S_i$, we write $Si \\ \\delta \\ S_j$. Sometimes we also indicate the type of dependence using superscript. $\\delta^f, \\ \\delta^a, \\ \\delta^o$.\n\n## Loop carried dependence\n\n```c\nfor i = 1 to 3:\n\t\tx[i] = y[i] + 1\n\t\tx[i] = x[i] + x[i-1]\n```\n\nThis update requires `x[i]` to know the value of `x[i-1]` beforehand. This is a loop carried dependence and cannot be parallelized easily.\n\n**Key points:**\n- True (flow), anti, and output dependences are identified based on the order of read and write operations.\n- Loop-carried dependences arise when an iteration depends on the result of a previous iteration, hindering parallelization.\n- Dependence analysis is crucial for identifying parallelization opportunities and potential data hazards.\n\n### An Algorithm to Test for Dependence\n\nTake the read set and write sets of 2 consecutive iterations and check for dependence. If there is an intersection between $R_{s1}$ and $W_{s2}$ or $W_{s1}$ and $R_{s2}$ then there is an dependence.\n\nExample,\n\n```c\nfor(int i=16; i\u003cn; i++)\n\t\ta[i] += a[i-16]\n```\n\nTake, $R_{s1} = \\{ a[16], \\ a[0]\\}$, $W_{s1} = \\{ a[16]\\}$ and $R_{s16} = \\{a[32], a[16]\\}$, $W_{s16} = \\{a[16]\\}$. There is dependence.\n\nFor iterations $s1$ and $s16$:\n\n- $R_{s1} = {a[16], a[0]}$\n- $W_{s1} = {a[16]}$\n- $R_{s16} = {a[32], a[16]}$\n- $W_{s16} = {a[32]}$\n\nSince $W_{s1}$ and $R_{s16}$ intersect at $a[16]$, there is a data dependence.\n\n**The key steps are:**\n1. Identify read and write sets for consecutive iterations.\n2. Check for intersections between the read set of one iteration and the write set of the other iteration, or vice versa.\n3. If an intersection exists, there is a data dependence.\n\n## Loop Un-Switching\n\n```c\nfor(int i=0; i\u003cn; i++)\n\t\tfor(int j=0; j\u003cm; j++)\n\t\t\t\tif(x[i] \u003e 0) // S++\n\t\t\t\telse // T++\n```\n\nCode like this can be converted to\n\n```c\nfor(int i=0; i\u003cn; i++)\n\t\tif(x[i] \u003e 0)\n\t\t\t\tfor(int j=0; j\u003cm; j++) // S++\n\t\telse\n\t\t\t\tfor(int j=0; j\u003cm; j++) // S++\n```\n\nThe idea is that we eliminate repeated conditional branch checks from inside the `for` loop. The execution of the branch statement is reduced from $n \\times m$ times to $n$ times, while the loop iterations is increased from $n \\times m$ to $n \\times 2 \\times m$. But usually the internal loop can now employ vectorization thanks to the removal of the conditional statement to essentially reduce loop iterations to $\\frac{n \\times m \\times 2}{vec \\ len}$.\n\nThis gives the added speedup from vectorization + the speedup from lesser execution of conditional branch statements.\n\n## Index Set-Splitting\n\nThe concept of index set-splitting can be better illustrated with a practical example. Let's consider a scenario where you have an array of integers, and you want to perform a specific operation on the even-indexed elements and a different operation on the odd-indexed elements. This situation often arises in image processing or signal processing algorithms. Suppose we have the following code:\n```cpp\nint arr[] = {1, 2, 3, 4, 5, 6, 7, 8};\nint n = sizeof(arr) / sizeof(int);\n\nfor (int i = 0; i \u003c n; i++) {\n if (i % 2 == 0) arr[i] *= 2;\n else arr[i] /= 2;\n}\n```\n\nThis code flow includes conditionals within the loop iteration which makes it difficult to vectorize. But compilers can use index set-splitting here to split the loop into two separate loops, one for even-indexed elements and another for odd-indexed elements, as follows:\n\n```cpp\nint arr[] = {1, 2, 3, 4, 5, 6, 7, 8};\nint n = sizeof(arr) / sizeof(int);\n\nfor (int i = 0; i \u003c n; i += 2) arr[i] *= 2;\nfor (int i = 1; i \u003c n; i += 2) arr[i] /= 2;\n```\n\nBy separating the loops based on the index set, we eliminate the need for conditional statements, which can improve vectorization and overall performance\n\n# Polyhedral Compilation\n\nConsider the following nested loop:\n```cpp\nfor (int i = 0; i \u003c 3; i++)\n for (int j = 0; j \u003c 3; j++)\n x[i][j] = x[i][j - 1] + x[i - 1][j - 1];\n```\n\nThis loop exhibits loop-carried dependences due to the use of `x[i][j - 1]` and `x[i - 1][j - 1]`, which depend on values from previous iterations. Polyhedral compilation is a technique for analyzing loop dependencies and transforming loops to enable parallelization. The key idea is to represent the iteration space of a loop as a polyhedron and perform transformations on this polyhedron to eliminate dependencies. \n\n![polyhedral-1](/images/polyhedral-1.webp)\n\nThe steps involved in polyhedral compilation are:\n\n1. **Modeling:** Map the loop iteration space onto a polyhedron, representing each iteration as a point in the polyhedron.\n2. **Dependence Analysis:** Analyze the dependencies between iterations by examining the polyhedron.\n3. **Transformation:** Apply transformations to the polyhedron, such as skewing, tiling, or permutation, to eliminate dependencies and expose parallelism.\n4. **Code Generation:** Generate optimized loop code based on the transformed polyhedron.\n\nPolyhedral compilation provides a systematic way to analyze and transform loops, enabling compilers to automatically identify and exploit parallelism in complex loop nests. The goal of polyhedral compilation is to teach compilers to analyze and transform loop nests automatically, leveraging the power of polyhedral representations and transformations to expose parallelism and optimize performance.\n\n# References\nThese notes are quite old, and I wasn't rigorously collecting references back then. If any of the content used above belongs to you or someone you know, please let me know, and I'll attribute it accordingly.", - "lastmodified": "2026-01-11T05:06:36.767065496+05:30", + "lastmodified": "2026-01-11T10:59:48.85416215+05:30", "tags": [ "domain-cs-systems-high-performance-computing" ] @@ -479,15 +479,15 @@ "/blog/what-goes-around-comes-around-the-history-of-database-systems-part-1-1960-2000": { "title": "\"What Goes Around Comes Around\" - The History of Database Systems - Part 1 (1960 - 2000)", "content": "# Abstract\nThis will be my first blog post / set of notes taken for a paper I've read. The paper titled, [What Goes Around Comes Around](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf) was written by [Michael Stonebraker](https://scholar.google.com/citations?user=nXYv4nEAAAAJ\u0026hl=en) (Turing Award Winner and the man behind the POSTGRES (and INGRES) database(s)) and [Joseph Hellerstein](https://scholar.google.com/citations?user=uFJi3IUAAAAJ\u0026hl=en) (Has a casual h-index of 105). Usually, the abstract of the paper provides an excellent summary, and this one is no exception.\n\n\u003e**Abstract** \n\u003e*This paper provides a summary of 35 years of data model proposals, grouped into 9 different eras. We discuss the proposals of each era, and show that there are only a few basic data modeling ideas, and most have been around a long time. Later proposals inevitably bear a strong resemblance to certain earlier proposals. Hence, it is a worthwhile exercise to study previous proposals.* \n\u003e\n\u003e*In addition, we present the lessons learned from the exploration of the proposals in each era. Most current researchers were not around for many of the previous eras, and have limited (if any) understanding of what was previously learned. There is an old adage that he who does not understand history is condemned to repeat it. By presenting “ancient history”, we hope to allow future researchers to avoid replaying history.* \n\u003e\n\u003e*Unfortunately, the main proposal in the current XML era bears a striking resemblance to the CODASYL proposal from the early 1970’s, which failed because of its complexity. Hence, the current era is replaying history, and “what goes around comes around”. Hopefully the next era will be smarter.*\n\nThis paper primarily analyzes the evolution of various data models and their rise to popularity or extinction, contingent on their features and commercial decisions by big market players. I will also sprinkle in a bit of history about the evolution of databases that I learnt from [01 - History of Databases (CMU Advanced Databases / Spring 2023](https://www.youtube.com/@CMUDatabaseGroup) and [Postgres pioneer Michael Stonebraker promises to upend the database once more - The Register](https://www.theregister.com/2023/12/26/michael_stonebraker_feature/) as well.\n\n# Useful Concepts\nBelow we'll define a few concepts that'll be useful to know about when reading about the evolution of database systems.\n## Physical Data Independence\nWe can define physical data indepence as the ability to change the core algorithms and / or access patterns related to how data is accessed and stored at the physical (disk) level without affecting the logical layer (application level code) written on top of it. \n\nIn short, changing the DBMS's underlying core data structure from a [B+ Tree](/blog/b-tree) to a [Hashtable](/blog/hashtable) should not require rewriting any part(s) of the applications written on top of this DBMS. The APIs provided by the DBMS must ensure this independence. \n\nThis allows the DBMS software to optimize it's performance by altering storage formats, use new hardware features, etc. while also providing guarantees to application developers for a stable \u0026 consistent interface to the DBMS.\n## Logical Data Independence\nThis is harder to define clearly, but in short, changes to the logical table definition (such as schema, relations, attributes, etc.) should not require rewrite of application level code. For example, if I recently started logging information about whether or not a patient had been vaccinated for COVID-19 in a Hospital's DBMS, I would not want the application software that did not need this extra attribute to break. \n## The Supplier / Parts Table\nThis isn't a concept, but we'll be using the standard supplier-parts table or the employer-employee table for giving examples in the future, so it's useful to know the structure beforehand. \n\nLet's suppose you're NASA, and you need a bunch of parts for your new Space Mission. There are also a set of suppliers who provide some subset of these parts, in various batch order sizes for varying prices. You want to build a system that allows you to query this data to figure out useful information such as:\n- Which supplier(s) supply part $x$?\n- Which part(s) are supplied by supplier $x$?\n- What all parts of type $y$ are supplied by supplier $x$?\n- etc.\n\n# The (Real) Eras Tour\n Stonebraker and Hellerstein roughly summarize the period from 1960s to the 2000s into 10 distinct (but not disjoint) eras. Each era has it's own intriguing idea and starts a debate of old model vs new model between people in the opposing camps. Ultimately ease of use and commercial requirement + adoption is the primary driver of success for these models. \n \n Over the years, SQL and the relational model have come out as the juggernaut in the space. Every decade, someone invents a challenger or replacement to SQL, which then proceeds to fail and/or have it's key ideas absorbed into the standard. Some of these will be discussed in [What Goes Around Comes Around... and Around... - The History of Database Systems - Part 2 - (2000 - 2020)](/blog/what-goes-around-comes-around-and-around-the-history-of-database-systems-part-2-2000-2020). \n \n It is also useful to know that many of the older models mentioned below (even IMS) are still in use today, but almost every instance of such a database is used in legacy code. ATMs for example still use IMS because they don't have a reason to migrate their legacy code, however no startup starting off today would ever choose to use IMS. \n \n## IMS Era (Late 1960s - 1970s)\n### Integrated Data Store (IDS)\nBefore we get to IMS, there was IDS. This is perhaps the earliest known instance of a \"DBMS\" product. It was designed by the computer division of General Electric by [Charles William Bachman](https://amturing.acm.org/award_winners/bachman_9385610.cfm), who received the ACM Turing Award in 1973 for his work on DBMS. \n#### Motivation\nIn the 1950s, there was a huge rush for buying computers and using them to automate work. However, getting computers to do useful tasks turned out to be a lot harder than expected. Companies mostly used them in narrow clerical tasks and needed more from computers to justify their cost. \n\n\u003e *Various firms tried to build such “totally integrated management information systems”, but the hardware and software of the era made that difficult. Each business process ran separately, with its own data files stored on magnetic tape. A small change to one program might mean rewriting related programs across the company. But business needs change constantly, so integration never got very far.* \n\u003e - [Charles William Bachman - ACM Page](https://amturing.acm.org/award_winners/bachman_9385610.cfm)\n\n#### The Birth of IDS\nWhen working at GE, partially thanks to the invention of the disk drive, his department managed to successfully build a management system called the *Manufacturing Information and Control System (MIACS)*. This then grew to become IDS. IDS provided application programmers with a powerful API to manipulate data, an early expression of what would soon be called a Data Manipulation Language (DML).\n\nIDS managed \"records\" on disk and provided programmers with an abstraction over the physical data layer. This provided programmers with **physical data independence**, they need not rewrite all their application logic if a minor change was made to how the disk accesses were made. He managed to squeeze MIACS and IDS into just 40Kb of memory. \n#### Honeywell\nGE built this technology for a timber company in Seattle, and then later ended up spinning out the custom solution as a standalone software product. GE was around the third best computer seller in the market, which wasn't good enough for them. So they sold their computing division to Honeywell, who continued to sell the product for a while. \n#### Characteristics of IDS\n- **Tuple-at-a-time queries:** This essentially means that when we define operations or queries, IDS would use `for` loops to iterate one tuple at a time and do computations.\n- **Network Data Model:** More on this and Bachman when we talk about CODASYL. \n### Information Management System (IMS)\nIMS was a DBMS product released by IBM around 1968. It was one of the earlies DBMS systems to introduce the notion of a **record type**, that is, a collection of named fields with their associated data types. It also introduced the notion of **enforcing constraints**. Every record instance had to satisfy the record type description. In simpler terms, it was perhaps the first DBMS to introduce the idea of **schema**. \n#### Hierarchical Data Model\nIMS was also the first DBMS to use a *hierarchical data model*. Every record type, except the root had a single parent record type. In other words, you had to define the record types such that they formed a directed tree. This is how we'd have to represent our supplier-parts table using this mode:\n\n![pasted-image-20240819045449](/images/pasted-image-20240819045449.webp)\n- [What Goes Around Comes Around](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n\nEither `Supplier` as the parent to `Part` or vice-versa. While it was an interesting take, it had some fundamental issues:\n\n1. **Difficult to Eliminate Redundant Information**: In the first schema, for every supplier who supplies a part, we'd have to repeat part information. Essentially, let's say we had 2 vendors selling the same product for different prices, we'd need to have multiple records of the same `pname`. Now if we wanted to change the name of the part, we'd need to update **every** duplicated `pname` field. This can lead to inconsistency issues when updated may fail midway etc.\n\t![pasted-image-20240819050043](/images/pasted-image-20240819050043.webp)\n\t- [CMU SCS 15-721 (Spring 2023) :: History of Databases](https://15721.courses.cs.cmu.edu/spring2023/slides/01-history.pdf)\n2. **Corner Case Issues**: In a strict hierarchy, we cannot have a supplier who does not supply anything and vice versa since the `parts` record is a part of a `Supplier` record.\n3. **Record-at-a-time**: IMS ordered records by a **hierarchical sequence key (HSK)**. In short, it's basically records ordered in DFS order of the record-type tree. You could use it's DML language (DL/1) to fetch the next record or fetch the next record within parent. You could do interesting tree / sub-tree type traversals but it was still record-at-a-time. Optimization of queries was completely left to the programmer. \n4. **Lack of Physical Data Independence:** IMS supported multiple physical data storage formats.\n\t- Sequential storage\n\t- Indexed B-Tree\n\t- Hash table\n\tHowever, if you switched between formats because you needed support for range queries or for faster lookups, the API exposed to the application programmer was also different. \n5. **Limited Logical Independence**: Because DL/1 was not defined on the physical data layer, IMS supported limited logical independence. If we modified record types, they'd essentially be some subtrees in the logical database record tree. A DL/1 program can just use the logical database definition it was originally written for by allowing the logical database to exclude the subtrees that contain redefined record types. \n6. **Response to fix data redundancy failed**: The response to fixing the redundancy issue was to allow for the following:\n\tPhysical storage:\n\t![pasted-image-20240819052120](/images/pasted-image-20240819052120.webp)\n\tLogical storage:\n\t![pasted-image-20240819052143](/images/pasted-image-20240819052143.webp)\n\tIt allowed to \"graft\" two tree-structured physical databases into a logical data base (with many restrictions). Essentially, since `Part` and `Supply` are two *separate* physical tables themselves, there is no repeated information. However, computing the logical grafted block likely would involve joining the two tables for queries on the `Supplier` logical database. \n\tThis introduced a lot of undesirable computational and design complexity.\n\t\n### Lessons From the Paper\n- **Lesson 1**: Physical and logical data independence are highly desirable \n- **Lesson 2**: Tree structured data models are very restrictive \n- **Lesson 3**: It is a challenge to provide sophisticated logical reorganizations of tree structured data\n- **Lesson 4**: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard.\n\n## CODASYL Era (1970s)\nRemember Charles Bachman? The man behind IDS? He didn't stop there. COBOL programmers proposed a standard for how programs will access a database. They were essentially trying to build a standard for DBMS and Bachman lead the work on CODASYL. \n\nThe natural next thought from IMS was to move from a tree like structure to a general graph network structure. Here's how we'd represent the Supplier-Parts table in CODASYL.\n\n![pasted-image-20240819055108](/images/pasted-image-20240819055108.webp)\n\nNotice that in this DAG, the directed edges have names. In CODASYL, these directed edges represent **sets**. It indicates that for each record instance of the owner record type (the tail of the arrow) there is a relationship with zero or more record instances of the child record type (the head of the arrow). It represents 1-n relationships between owner and children. \n\nThis solved some of the issues from the hierarchical model, for example, we can have suppliers who don't supply any parts (empty set). However, the fact that you had to maintain sets of \"relation\" info also implied that there existed lots of different ways to implement certain things. There was **no logical or physical independence.** It is also a record-at-a-time language. \n\nConsider this example of pseudo for a program that is tasked with finding all \"red\" parts supplied by supplier $x$. \n```\nFind Supplier (SNO = x) \nUntil no-more { \n\tFind next Supply record in Supplies \n\tFind owner Part record in Supplied_by \n\tGet current record -check for red— \n}\n```\nFor each record, you had to possibly traverse multiple sets of information to obtain what you actually wanted. Several implementations of sets were proposed that entailed various combinations of pointers between the parent records and child records. \n\nCODASYL solved many of the issues that IMS faced with it's graph model that allowed for more expressive relations. However, it still lacked physical and logical independence and the added complexity was simply too much, both for the developer implementing the database internals **and** for the developer programming an application layer on top of CODASYL.\n\n\u003e *In IMS a programmer navigates in a hierarchical space, while a CODASYL programmer navigates in a multi-dimensional hyperspace. In IMS the programmer must only worry about his current position in the data base, and the position of a single ancestor (if he is doing a “get next within parent”). In contrast, a CODASYL programmer must keep track of the:* \n\u003e *- The last record touched by the application* \n\u003e *- The last record of each record type touched* \n\u003e *- The last record of each set type touched*\n\u003e \n\u003e *In his 1973 Turing Award lecture, Charlie Bachmann called this “navigating in hyperspace”*\n\u003e \n\u003e *In addition, a CODASYL load program tended to be complex because large numbers of records had to be assembled into sets, and this usually entailed many disk seeks. As such, it was usually important to think carefully about the load algorithm to optimize performance. Hence, there was no general purpose CODASYL load utility, and each installation had to write its own*\n\u003e \n\u003e - [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n### Lessons From the Paper\n- **Lesson 5**: Networks are more flexible than hierarchies but more complex \n- **Lesson 6**: Loading and recovering networks is more complex than hierarchies\n## Relational Era (1970s and Early 1980s)\nRoughly around the same time, [Edgar F. (\"Ted\") Codd](https://amturing.acm.org/award_winners/codd_1000892.cfm) was working on his relational model. Codd was originally a mathematician who was motivated by the fact that IMS programmers spent a lot of their time working on maintenance due to IMS neither supporting logical nor physical data independence.\n\n\u003e *It should be noted, incidentally, that the relational model was in fact the very first abstract database model to be defined. Thus, Codd not only invented the relational model in particular, he actually invented the data model concept in general.*\n\u003e - [Edgar F. (\"Ted\") Codd - ACM Page](https://amturing.acm.org/award_winners/codd_1000892.cfm)\n\nHis proposal was threefold:\n1. Store the data in simple data structures (tables)\n\t\tIt was difficult for previous databases to provide logical independence due to their use of complex data structures such as B-trees, hierarchical models, etc.\n2. Access it through a high level **set-at-a-time** DML\n\t\tUsing a high enough level language, it was possible to provide a high degree of physical data independence where you don't need to specify a fixed storage proposal. Think the modern day SQL queries you are used to vs the \"getNext\" and similar methods programmers had to use back in the day. Also, set-at-a-time would allow for programmers to reduce number of lines of code per query significantly while also opening the door to a slew of database query optimizations. \n3. There was no necessity for a physical storage proposal\n\t\tFor a database that supports complete logical and physical data independence, it was not necessary to specify a physical storage format. DBMS users should be able to specify the data structure that best suits their queries and this should be hot-swappable with no rewrite at the application layer.\n\n\u003e*Ted Codd was also interested in the possibility of extending his relational ideas to support complex data analysis, coining the term OLAP (On-Line Analytical Processing) as a convenient label for such activities. At the time of his death, he was investigating the possibility of applying his ideas to the problem of general business automation.*\n\u003e- [Edgar F. (\"Ted\") Codd - ACM Page](https://amturing.acm.org/award_winners/codd_1000892.cfm)\n\nMoreover, the relational model has the added advantage that it is flexible enough to represent almost anything. It fixed all the issues IMS had in representing complex relationships using the hierarchical model while also providing logical and physical data independence, something CODASYL could not. This immediately sparked off a huge debate between CODASYL supporters and Relational Model supporters.\n### Issues with CODASYL\n1. CODASYL was extremely complex to work with. For software developers working at both the database code and application code.\n2. No logical or physical data independence meant a lot of time and money was spent on labour rewriting codebases.\n3. Record-at-a-time programming was too difficult to optimize and had to be done by each application that interacted with a DBMS. I find this extremely similar to the \"compiler cannot optimize as good as a human\" debate from the programming languages world. \n4. CODASYL was not flexible enough to represent certain relationships.\n### Issues with the Relational Model\n1. It was complicated, extremely rigorous, formal and difficult to understand for your average programmer (Ted Codd was a mathematician).\n2. It is extremely difficult to implement the relational model efficiently due it's lack of advanced data structure usage. (This would mostly be solved later due to advanced in the field of Query Optimization. Pretty much \"compilers\" that beat all but the world's best at query plan generation.)\n\n\u003e*A debate between the two and their supporters, held at an ACM workshop (SIGMOD) in 1974, is remembered among database researchers as a milestone in the development of their field. Bachman stood for engineering pragmatism and proven high performance technology, while Codd personified scientific rigor and elegant but unproven theory. Their debate was inconclusive, which was perhaps inevitable given that no practical relational systems had yet been produced.*\n\u003e[Charles William Bachman - ACM Page](https://amturing.acm.org/award_winners/bachman_9385610.cfm)\n\nOver a period of time, once System R and INGRES had proved that efficient implementations of the Relational Model was possible, the relational advocates also agreed that Codd's mathematical language was too complicated and changed their proposed language to SQL or QUEL. Meanwhile, on the CODASYL side, LSL was a language which allowed set-at-a-time querying for networked databases, offering physical data independence. They also showed that it was possible to clean up the complexities of the network data model somewhat. (TODO, don't know how.)\n\nThis debate eventually lead to the commercial war for CODASYL vs Relational Systems, which would decide which specification lived and which would die.\n### The Commercial War for CODASYL vs Relational Model\nVAX (minicomputers implementing the idea of a virtual memory space) were a market dominated by relational databases. VM made implementing relational ideas easier. They were also very fast. Further, CODASYL was written in assembler which made migration to VAX hard. In contrast, the mainframe market was still dominated by IMS and other non-relational database systems.\n\nHowever, this changed abruptly in 1984. IBM who controlled most of the market share was the leader in this space. They introduced DB/2 which was a relational DB which was comparatively easier to use and was the \"new tech.\" This signaled that IBM was serious about RDBMS and backing it, which eventually made it win the war. They effectively declared that SQL was the de-facto query language. \n\n\u003e Interestingly, there was a standards committee setup to decide the standard language that would be used to query RDBMS. At this time, the two main competitors were QUEL and SQL. QUEL, backed by Stonebraker had a lot of nicer semantics compared to SQL For example, you could use `from` right after the `select` instead of at the end of your query.) However, Stonebraker refused to attend the conference due to his dislike of standards committees and such (average academician :p) which lead to SQL becoming the standard. - Andy Pavlov in some CMU lecture\n\n\u003eInteresting fact number 2, IBM tried to build a relational frontend transpiler sort of interface on top of IMS (To provide a more elegant migration to RDBMS). But the complexity \u0026 logical dependence of DL/1 made it very difficult to implement. IBM had to abandon and do a dual-db strategy, which also consequently made them declare a clear winner for the debate.\n### Lessons From the Paper\n- **Lesson 7**: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence. \n- **Lesson 8**: Logical data independence is easier with a simple data model than with a complex one. \n- **Lesson 9**: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.\n- **Lesson 10**: Query optimizers can beat all but the best record-at-a-time DBMS application programmers.\n## Entity-Relationship Era (1970s)\n[Peter Chen](https://en.wikipedia.org/wiki/Peter_Chen) came up with the Entity-Relationship model as an alternative to all the other models in the 1970s. The model he proposed can be described as follows:\n### The Model\n- **Entities**: Loosely speaking, objects that have an 'existence', independent of any other entities in the database. Examples: `Supplier`, `Part`.\n- **Attributes**: Data elements that describe entities. For `Part`, attributes would include `pno`, `pname`, `psize`, `pcolor`, etc.\n- **Keys**: Unique attributes designated to together identify entities uniquely.\n- **Relationships**: Connections between entities. Example: `Supply` connects `Part` and `Supplier`. Similar to the CODASYL model. They can be of multiple types:\n - **Types**:\n - **1-to-1**: One entity relates to one other.\n - **1-to-n**: One entity relates to multiple others.\n - **n-to-1**: Multiple entities relate to one.\n - **m-to-n**: Multiple entities relate to multiple others (e.g., `Supply` is m-to-n because suppliers can supply multiple parts, and parts can be supplied by multiple suppliers).\n- **Relationship Attributes**: Properties describing the relationship. Example: `qty` and `price` in the `Supply` relationship.\n### Failure In Acceptance as DBMS Data Model\nThe model did not gain acceptance as a DBMS data model for several reasons, as speculated by the authors: it lacked an accompanying query language, it was overshadowed by the more popular relational model, and it resembled an updated version of the CODASYL model, which may have contributed to its lack of distinction.\n### Success In Database Schema Design\nThe ER model ended up being a model that helped solve the issue of finding \"initial tables\" for applying normalization on in the relation only tables model. Database Administrators (DBAs) used to struggle with coming up with good database schema design. The Entity-Relationship model, with it's notion of \"Entities\", made it a lot simpler for DBAs to model initial tables on paper quickly and get schemas for tables to use in the relational model. It was easy to convert E-R models to the 3rd Normal Form. ([Normalization Theory in DBMS](/blog/normalization-theory-in-dbms))\n### Lessons From the Paper\n- **Lesson 11**: Functional dependencies are too difficult for mere mortals to understand. Another reason for KISS (Keep it simple stupid).\n## R++ Era (1980s)\nThe authors use the term \"R++\" to talk about an era where most of the research involved showing example programs which performed poorly or were difficult to implement on a RDBMS and added features to the Relational Model to improve / fix it.\n\nLot of application specific additions were proposed to the relational model. These were some of the identified most useful constructs:\n1. **set-valued attributes**: In a `Parts` table, it is often the case that there is an attribute, such as `available_colors`, which can take on a set of values. It would be nice to add a data type to the relational model to deal with sets of values.\n2. **aggregation (tuple-reference as a data type)**: In the RM model for the Supply relation, we had two foreign keys `sno` and `pno` which point to tables in other tables. Instead of this, we can just have pointers to these tuples. This “cascaded dot” notation allowed one to query the Supply table and then effectively reference tuples in other tables. It allowed one to traverse between tables without having to specify an explicit join.\n3. **inheritance**: Gem implemented a variation of inheritance you find in OOP languages in the DBMS context. Inherited types inherited all the attributes of their parent. However, the problem with this was that while it allowed easier query formulation that in a conventional relational model, there was *very little* performance improvement. Especially since you could simulate this in a RM model by substituting a tuple for a data type.\n\nMost commercial vendors were focusing on improving transaction performance and scalability. Since R++ ideas offered little improvement and not much revenue potential there was little technology transfer of R++ ideas from academia into the commercial world.\n### Lessons From the Paper\n- **Lesson 12**: Unless there is a big performance or functionality advantage, new constructs will go nowhere.\n## Semantic Data Model Era (Late 1970s and 1980s)\nThis was pretty much an effort at quite literally bringing Object Oriented Programming (OOP) ideas to the DBMS world (please no!). They suggested that the relational model was \"semantically impoverished\" and wanted to allow for expressing classes and such. These efforts were usually called \"Semantic Data Models.\"\n\nThey expanded on aggregation, inheritance and set-valued attributes from the R++ era and allowed classes to extend aggregation to refer to an entire set of instances of records in some other class. Also allowed inverse of these attributes to be defined. They also wanted generalized inheritance graphs, extending on the idea of inheritance in SDM. This was basically just multiple inheritance. Lastly, classes can have class variables, for example the Ships class can have a class variable which is the number of members of the class. \n\nHowever... These models were *extremely* complex and did not offer any significant value over RDBMS. SQL had also gained popularity as the intergalactic standard for database querying which made it very difficult to displace it's position in the market. Similar to R++ proposals, there was not enough reward-to-cost ratio to justify them. And in the case of SDMs, they were also far too complex. Throwback to the CODASYL / IMS era :)\n### Lessons From the Paper\n- None :( \n- On a more serious note, the exact same lesson that was gained from the R++ era.\n## Object-Oriented Era (Late 1980s and Early 1990s)\nDuring this period there was a new wave of interest into \"Object Oriented DBMSs\" (god no why?!). Advocated pointed to an \"impedance mismatch\" between RDBMS and OO languages like C++. Since RDBMS systems had their own data systems, naming, and return types, we needed some conversion layers to make code transpile between the two interfaces. They claimed that it would be nicer if DBMS operations were done via language built in constructs and we had persistent variables that could point to either locations in memory or disk (to aid with implementing / interfacing with DBMS operations). \n\nWhile a 'persistent programming language' would allow for much cleaner constructs than a SQL embedding, each programming language would have to be extended with DBMS-oriented functionality. However,\n\n\u003e*Unfortunately, programming language experts have consistently refused to focus on I/O in general and DBMS functionality in particular. Hence, all programming languages that we are aware of have no built-in functionality in this area. Not only does this make embedding data sublanguages tedious, but also the result is usually difficult to program and error prone. Lastly, language expertise does not get applied to important special purpose data-oriented languages, such as report writers and so-called fourth generation languages.* \n\u003e- [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n\nIn the 1980s there was another surge in implementing a persistent C++ version with it's own runtime (PLEASE WHY?!). These vendors mainly focused on targeting niches and engineering CAD applications. However, they did not see much commercial success. The authors of the paper list a few possible reasons:\n\n\u003e1. ***Absence of leverage**: The OODB vendors presented the customer with the opportunity to avoid writing a load program and an unload program. This is not a major service, and customers were not willing to pay big money for this feature.* \n\u003e*2. **No standards**: All of the OODB vendor offerings were incompatible.* \n\u003e*3. **Relink the world**: In anything changed, for example a C++ method that operated on persistent data, then all programs which used this method had to be relinked. This was a noticeable management problem.* \n\u003e*4. **No programming language Esperanto**: If your enterprise had a single application not written in C++ that needed to access persistent data, then you could not use one of the OODB products.*\n\u003e- [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n\nThere was an outlier company called O2 which had a high level declarative language called OQL embedded into a programming language and also focused on business data processing. \n\n\u003e *There is a saying that “as goes the United States goes the rest of the world”. This means that new products must make it in North America, and that the rest of the world watches the US for market acceptance.*\n\u003e - [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n\nUnfortunately for O2, they were a French company and entered the US market too late. \n### Lessons From the Paper\n- **Lesson 13**: Packages will not sell to users unless they are in “major pain”\n- **Lesson 14**: Persistent languages will go nowhere without the support of the programming language community.\n## Object-Relational Era (Late 1980s and Early 1990s)\n### Motivation\nThis model was motivated by INGRES's inability to efficiently solve problems related to 2 dimensional search. B+ trees allowed efficient search on one dimension or on one index. But when the search query involves finding rectangle intersection or points inside a rectangle such queries cannot be solved efficiently by traditional B+ tree implementations of DBMS.\n### Idea\nThe pioneering idea here is to generalize relational systems to allow for user defined types (UDTs), user defined functions (UDFs), user defined operators and user defined access methods. This allows more sophisticated users to solve the 2d search problem using data structures optimized for these types of queries, such as Quad trees or R-trees.\n\nIn essence, we want to replace the hard coded B+ tree logic with a framework that handles general case well and allows sophisticated users to go beyond and define custom user defined protocols. Postgres UDT and UDFs generalized this notion to allow code to be written in a conventional programming language and to be called in the middle of processing conventional SQL queries.\n\nCompared to previous R++ and SDM eras, instead of providing built-in support for aggregation and generalization, Postgres UDT and UDFs provide a better framework for allowing users to optimize for their own types and queries.\n### UDFs\nThere is another notion of UDFs in use today. Many DBMS systems call stored procedures UDFs. Instead of making transactions use one round trip message between DBMS and client per statement, they allowed the client to store \"defined procedures\" on the DBMS which can be called via a single message, to eliminate context switch time. These UDFs are \"brain dead\" in the sense it can only be executed with constants for its parameters. They also required DBMS software to handle errors on the DBMS side since some procedures might have runtime errors which need to be handled well. \n\n### Postgres\n In addition, Postgres also implemented less sophisticated notions of inheritance, and type constructors for pointers (references), sets, and arrays. This latter set of features allowed Postgres to become “object-oriented” at the height of the OO craze. Postgres was commercialized by Illustra and then acquired by Informix. This gave Postgres access to a fast OLTP engine. Also gave them increased market share to convince more business to adopt Postgress' UDF / UDTs. Worked well for the GIS market and large content repositories (Used by CNN \u0026 BBC).\n### Conclusion\n- This new model blurs the distinction between data and code by allowing you to put code in the database and user-defined access methods. \n- However widespread adoption was significantly hindered by lack of standards, which seems to be a huge requirement to gain significant adoption by the big tech players in the market.\n### Lessons From the Paper\n- **Lesson 14**: The major benefits of OR is two-fold: putting code in the data base (and thereby blurring the distinction between code and data) and user-defined access methods. \n- **Lesson 15**: Widespread adoption of new technology requires either standards and/or an elephant pushing hard.\n## Semi-Structured Era (Late 1990s and 2000s)\nIn 2004, we saw a push toward databases for semi structured data which exemplified two characteristics that we had already seen do poorly in the past:\n### Schema Last\nThe data instances that need to be stored must be self describing. Without a self-describing format, a record is merely \"a bucker of bits\". To make a record self-describing, one must tag each attribute with metadata that defined the meaning of the attribute. However, by not requiring a schema in advance, we lose a lot of integral properties we see in today's database systems like constraint checking, validity, etc. \nConsider the records of the following two people:\n\n```\nPerson: \n\tName: Joe Jones \n\tWages: 14.75 \n\tEmployer: My_accounting \n\tHobbies: skiing, bicycling \n\tWorks for: ref (Fred Smith) \n\tFavorite joke: Why did the chicken cross the road? To get to the other side \n\tOffice number: 247 \n\tMajor skill: accountant \nEnd Person\n```\n\n```\nPerson: \n\tName: Smith, Vanessa \n\tWages: 2000 \n\tFavorite coffee: Arabian \n\tPasstimes: sewing, swimming \n\tWorks_for: Between jobs \n\tFavorite restaurant: Panera \n\tNumber of children: 3 \nEnd Person:\n```\n\nIn this example, we can see records which may only appear in one of the two, which may appear under a different name (alias) in the other record or which may appear in varying formats or meanings under the same name. This is an example of **semantic heterogeneity**. Such examples are extremely difficult to carry out query processing on, since there is no structure on which to base indexing decisions and query execution strategies. However, there are very few instances where we encounter such semantically heterogenous data in business practices. \n\t\nFor applications that deal with rigidly structured data or rigidly structured data with some text fields, a standard RDBMS system is more than capable of handling all business needs. For applications dealing with only text, the schema last framework does not work since schema last requires there to be some self-tagged metadata or \"semi-structure\" in the data it stores, which free text does not have. The problem of dealing with free text data is tackled by people working on Information Retrieval systems.\n\nIt is very difficult to come up with applications which might have to deal with \"semi-structured\" data. The authors cite advertisements and resumes as examples, but even in this field we have seen companies require mandated fields for resume entry which leads to more structured data parsing. In essence, it is better to avoid designing a system that requires \"semi-structured\" data than use a schema last system.\n### XML Data Model\nDocument Type Definition (DTDs) and XML Schema (XML) were intended to deal with the structure of formatted documents. They are both essentially document markup languages. DTDs \u0026 XML can, for example, be used to define the schema used by a DBMS table. However, there were attempts to use these models for actual DBMS applications. However, these were categorized by the authors of the paper as being seriously flawed (And I would agree, since I'm reading this 20 years after the date this paper was published :) The primary concerns cited are the sheer amount of complexity such a model introduces. We have already seen every DBMS model not following KISS (Keep It Simple Stupid) fail miserably. XML decides to then define a model where:\n\n1. Records can be hierarchical (Like in IMS)\n2. Records can have \"links\" or references to other records (Like in CODASYL \u0026 other network models)\n3. Records can have set-based attributed (Like in SDM)\n4. Records can inherit from other records in several ways (Like in SDM \u0026 OO)\n\nOn **top** of all this, XML also touted support for other features. One notable feature being **union types**. This is in the exact context as used by it's C counterpart. An attribute can be one of multiple types. For example, the \"works for\" field in an employee's record could be a department number of the name of an outside firm. (Yes you can also just give outside firms unique IDs but for the sake of the example...) However, B-tree indexes for records containing union attributes required one index per record in the type. And if you did joins between union types containing $N$ and $M$ base types, you'd need $max(N, M)$ plans to co-ordinate. \n\nThis is how the paper places it's predictions.\n\n\u003e *Obviously, XMLSchema is far and away the most complex data model ever proposed. It is clearly at the other extreme from the relational model on the “keep it simple stupid” (KISS) scale. It is hard to imaging something this complex being used as a model for structured data. We can see three scenarios off into the future.* \n\u003e \n\u003e *Scenario 1: XMLSchema will fail because of excessive complexity.*\n\u003e \n\u003e *Scenario 2: A “data-oriented” subset of XMLSchema will be proposed that is vastly simpler.* \n\u003e \n\u003e *Scenario 3: XMLSchema will become popular. Within a decade all of the problems with IMS and CODASYL that motivated Codd to invent the relational model will resurface. At that time some enterprising researcher, call him $Y$, will “dust off” Codd’s original paper, and there will be a replay of the “Great Debate”. Presumably it will end the same way as the last one. Moreover, Codd won the Turing award in 1981 for his contribution. In this scenario, $Y$ will win the Turing award circa 2015*\n\nAs a person from the future, I can spoil it and let you know that Scenario 1 played out as expected. \n#### Conclusions (Predictions) for XML\nThey claim that XML will be popular for \"on-the-write\" data transfer due to it's abilities to pass through firewalls. XML can be used to transfer data to and from machines, and OR functions can be written to import and export this data. They claim that it will take at least a decade for XML DBMSs to become high performance engines capable of competing with the best of the current RDBMSs. It is more likely that a subset of XML-schema is implemented, which would likely just map to a current RDBMS anyway, making it not very useful. In short, the future for XML DBMSs is very bleak. \n\nXML was sometimes marketed as the solution to the semantic heterogeneity problem. But this is not true, two people can tag the same field as \"salary\", but one could be post-tax returns in French Francs and the other pre-tax in USD. The fields are not in any way comparable to each other and should not be stored as the same attribute. \n\nThey also make a couple of claims regarding cross-enterprise information sharing, essentially data being shared from different businesses in the same field to an external party. For example, there are hundreds of vacation / airplane ticket booking websites with varying schemas used under the hood but they all communicate with the same airline company to book the ticket. They also make note that Microsoft had initially pushed \"OLE-DB\" when it perceived a competitive advantage there and killed it off as soon as it didn't see one there. Similarly Microsoft is pushing hard on XML because it sees a thread from Java and J2EE. The closing note is worth reading:\n\n\u003e*Less cynically, we claim that technological advances keep changing the rules. For example, it is clear that the micro-sensor technology coming to the market in the next few years will have a huge impact on system software, and we expect DBMSs and their interfaces to be affected in some (yet to be figured out) way.*\n\u003e\n\u003e*Hence, we expect a succession of new DBMS standards off into the future. In such an ever changing world, it is crucial that a DBMS be very adaptable, so it can deal with whatever the next “big thing” is. OR DBMSs have that characteristic; native XML DBMSs do not.*\n\u003e- [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n### Lessons From the Paper\n- **Lesson 16**: Schema-last is a probably a niche market \n- **Lesson 17**: XQuery is pretty much OR SQL with a different syntax \n- **Lesson 18**: XML will not solve the semantic heterogeneity either inside or outside the enterprise.\n# Full Circle\n\n\u003e*This paper has surveyed three decades of data model thinking. It is clear that we have come “full circle”. We started off with a complex data model, which was followed by a great debate between a complex model and a much simpler one. The simpler one was shown to be advantageous in terms of understandability and its ability to support data independence.*\n\u003e\n\u003e*Then, a substantial collection of additions were proposed, none of which gained substantial market traction, largely because they failed to offer substantial leverage in exchange for the increased complexity. The only ideas that got market traction were user-defined functions and user-defined access methods, and these were performance constructs not data model constructs. The current proposal is now a superset of the union of all previous proposals. I.e. we have navigated a full circle.*\n\u003e\n\u003e*The debate between the XML advocates and the relational crowd bears a suspicious resemblance to the first “Great Debate” from a quarter of a century ago. A simple data model is being compared to a complex one. Relational is being compared to “CODASYL II”. The only difference is that “CODASYL II” has a high level query language. Logical data independence will be harder in CODASYL II than in its predecessor, because CODASYL II is even more complex than its predecessor. We can see history repeating itself. If native XML DBMSs gain traction, then customers will have problems with logical data independence and complexity. To avoid repeating history, it is always wise to stand on the shoulders of those who went before, rather than on their feet. As a field, if we don’t start learning something from history, we will be condemned to repeat it yet again.*\n\u003e\n\u003e*More abstractly, we see few new data model ideas. Most everything put forward in the last 20 years is a reinvention of something from a quarter century ago. The only concepts noticeably new appear to be:*\n\u003e\n\u003e- Code in the data base (from the OR camp) \n\u003e- Schema last (from the semi-structured data camp) \n\u003e\n\u003e*Schema last appears to be a niche market, and we don’t see it as any sort of watershed idea. Code in the data base appears to be a really good idea. Moreover, it seems to us that designing a DBMS which made code and data equal class citizens would be a very helpful. If so, then add-ons to DBMSs such as stored procedures, triggers, and alerters would become first class citizens. The OR model got part way there; maybe it is now time to finish that effort.*\n\u003e\n\u003e- [\"What Goes Around Comes Around\"](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)\n", - "lastmodified": "2026-01-11T05:05:40.721967835+05:30", + "lastmodified": "2026-01-11T10:58:59.240825986+05:30", "tags": [ "domain-cs-systems-databases" ] }, "/blog/what-is-the-stock-market": { "title": "What Is the Stock Market?", - "content": "# Origins\n## Why Stock Market - A Toy Example\nThe understand the origins of the stock market, we need to understand why there was a need for the stock market in the first place. Let's say you're the owner of a coffee company, you've got a local shop set up and your business is booming. Your method of sourcing the right coffee beans and technique of preparation works! You start thinking bigger and want to expand your local outlet to shops across the country. You need money for this, and so you look for investors (this is too risky for bank loans!). Let's say you get some initial investors Arya, Bharat and Chandan. They give you some amount of capital $C$ in exchange for $S$ shares in your company. Shares are just an written/electronic object that represents an unit of ownership in a company. Each share of your company is initially valued by them at Rs. $C/S$. If there are $S'$ shares of your company in total, your company is valued at Rs. $S' \\times C$. \n\nThings go well, but you realize you need more capital to expand. Your company can now choose to do what is known as an **IPO** (Initial Public Offering) on the public stock market. Now, **any company or individual** who believes your business could be profitable might buy some shares. These shares allow investors on the public market to become partial owners in your business. Their investment also helps your company to grow, and as it generates more profit, more buyers might see potential and start buying your stock. As the demand for you company's shares increases, since the supply is fixed, the price increases. This raises the value of the company's stock that people already own, making a lot of profit for your earlier investors. For your company, increased public exposure on the market serves as publicity and also helps you fund new initiatives. \n\nHowever, the converse is also true. Let's suppose you cut costs at one of your coffee shops, decreasing the quality of coffee and the public get to know about it. The negative news might convince investors that your company might not make much profits in the future, expect your company's share value to decline and to sell their holdings. As stocks are sold and demand for your stock goes down, the price falls, and with it, the company's market value. This could leave investors with big losses, unless the company starts looking profitable again.\n\nThis see-saw of supply and demand is what is responsible for the ever fluctuating prices on the stock market. Companies are under the unavoidable influence of fluctuating market forces, such as the cost of materials, labor, or unexpected new laws, bad publicity or changes in leadership. All these variables cause day to day noise in the market, which can appear to make companies appear more or less successful. The sad reality is that often *appearing to lose value*, often leads to losing investors which in turn makes your company lose actual value. Human confidence in the market has the power to trigger everything from economic booms to financial crisis. \n## The Actual Origin Story\nIn the 1600s, European companies and government had an operation going where the company would set up a colony in a different country, take valuable goods for free or at cheap prices and ship them back home to sell for massive profits. However, there were many situations back then where bad weather, pirates, etc. could take down the ship, incurring huge losses for the ship owners. To fund this whole (sometimes risky) operation, ship owners would find rich investors to pay for the cost of the voyage. If the ship returned, everyone would get a share in the loot and reap profits. If the ship didn't, everyone would have the lost the invested money and made no profits. Oftentimes, the investors would invest in multiple ships at once to spread their risk. \n\nSoon, the Dutch East India Company took this concept to the next level. Instead of doing the investment on a ship-by-ship basis, they began to sell shares of their company. Since they had a fleet of ships under their command, they essentially bypassed the invest-in-multiple ships part of the process for investors. Investors invested in the company, and the company gained access to more funds to carry out more expeditions to India and hence make more profits. Investors were able to sell their shares in the company at local bars, public gathering spots, etc. or just reap **dividends** from the company. Dividends are payouts made by the company to investors (these payments are not mandatory, but can help boost the investor sentiment for the company). This is how the basis for the first major stock market was created. \n\n\u003e The Amsterdam stock exchange is considered to be the oldest '**modern**' securities market in the world. Created shortly after the establishment of the Dutch East India Company in 1602 when equities began trading on a regular basis as a secondary market to trade its shares. - [Wikipedia - Euronext Amsterdam](https://en.wikipedia.org/wiki/Euronext_Amsterdam)\n\n# Is the Stock Market a Force for Good?\n## The Good Stuff\n- Back in the day, just one person used to call all the shots of a business. Vanderbilt and Rockefeller are famous names that come to mind. They exercised tight control over their businesses and built huge empires. But this all began to change in the beginning of the 20th century. Companies like General Motors (GM) and General Electric, etc. started. Companies realized that they could grow a **lot faster** if they opened up the company to the public and had public investments to accelerate growth. \n- Shareholders in the company want to make money, and if they see the company going in the wrong direction, they'll sell their shares. This will lead to more selling and the share price dropping. But risky ambitious ideas might also encourage people to invest and see the stock price go up. This is sort of the whole idea of the stock market as a force for good. It drives companies to **make more profitable decisions** =\u003e They have more money to give back to shareholders \u0026 more money to **create jobs** and **grow the business further**. This is good for everybody.\n- It serves as a vehicle for providing investment opportunities for not just the rich and wealthy investors, but to all people. People's strong belief in the market leads them to invest in the market, which causes more businesses to get the funding they need to create more jobs and make their business better, thereby directly impacting the people who invested by providing them with more job opportunities and better services / quality of life and even profits via dividends / increase in share price. Very similar idea to that described in [How the Economic Machine Works by Ray Dalio](/blog/how-the-economic-machine-works-by-ray-dalio).\n## The Bad Stuff\n- It's often not the **real value** of companies that drive their stock prices. It's often the most popular stories that people **believe** about those companies. Sometimes these stories are backed by facts, but sometimes it's all just hype or misinformation. There's often no way to calibrate or contain the spread of hype/misinformation. A famous example is the time when the dotcom bubble exploded. There was crazy hype about internet companies, and then when it exploded, there were harsh repercussions. Shareholders lost a lot of money, companies collapsed, leading to a huge loss of jobs and the Great Economic Recession. Essentially the polar opposite of the good cycle. \n- Because corporations are owned by shareholders, and because most of the stock market runs on greed, the **only obligation of businesses is to make profits.** Often, the top executive's (CEO, CTO, etc.) pay is linked to the share price performance. For example, 80% of their pay could be via stock grants. This drives them to make sure that the share price goes up. This could lead the top decision makers in these companies to take decisions like cutting costs, buying back their own shares, etc. to increase the stock prices in the short term to artificially bump up the price. Between 2007-2016, $55\\%$ of companies in the S\u0026P 500 spent their earnings on stock buybacks, $39\\%$ on dividends to investors and only $6\\%$ on jobs and growing the economy. Things which are actually beneficial for company growth and thereby the country's economy. We've evolved to have a shorter term view on shareholder rights versus a longer term view on stakeholder responsibilities. Laying off jobs, cutting costs to deliver inferior products, reducing wages, etc. negatively affects the economy but could be great for driving up a company's profits in the shorter term. But this is what the stock market encourages companies to optimize.\n- As the stock market has grown, so has inequality. In the USA, in 1970 the average CEO made about $22\\times$ more than the average worker. In 2016, this number had grown to become $271\\times$ more. \n\n\u003eWhen the stock market is booming, we're made to believe the economy is booming. In America, the stock market has been booming for nearly 40 years. But if we add up all the good and services bought and sold in the United States (the **actual** economy), that number isn't growing as quickly as it used to $(\\lt 3\\%)$. Wages have hardly budged in decades ($\\$20.19$ in 1965 to $\\$22.49$ 2020) and the average American family's net worth still hasn't really recovered from the Great Recession. ($\\$119$k in 2007 to $\\$78$k in 2016). So what exactly is the stock market measuring? \n\u003e- Quoted from [Explained | The Stock Market | FULL EPISODE | Netflix](https://www.youtube.com/watch?v=ZCFkWDdmXG8\u0026t=4s)\n# References\n1. [How does the stock market work? - Oliver Elfenbaum](https://www.youtube.com/watch?v=p7HKvqRI_Bo)\n2. [Explained | The Stock Market | FULL EPISODE | Netflix](https://www.youtube.com/watch?v=ZCFkWDdmXG8\u0026t=4s)\n3. [Wikipedia - Euronext Amsterdam](https://en.wikipedia.org/wiki/Euronext_Amsterdam)\n4. [How The Economic Machine Works by Ray Dalio](https://www.youtube.com/watch?v=PHe0bXAIuk0)\n\n\n---\n\n## Related Reading\n\n- [Derivatives - Options](/blog/derivatives-options)\n- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis)\n- [Stock Multiples](/blog/stock-multiples)\n", - "lastmodified": "2026-01-11T05:06:36.777185226+05:30", + "content": "# Origins\n## Why Stock Market - A Toy Example\nThe understand the origins of the stock market, we need to understand why there was a need for the stock market in the first place. Let's say you're the owner of a coffee company, you've got a local shop set up and your business is booming. Your method of sourcing the right coffee beans and technique of preparation works! You start thinking bigger and want to expand your local outlet to shops across the country. You need money for this, and so you look for investors (this is too risky for bank loans!). Let's say you get some initial investors Arya, Bharat and Chandan. They give you some amount of capital $C$ in exchange for $S$ shares in your company. Shares are just an written/electronic object that represents an unit of ownership in a company. Each share of your company is initially valued by them at Rs. $C/S$. If there are $S'$ shares of your company in total, your company is valued at Rs. $S' \\times C$. \n\nThings go well, but you realize you need more capital to expand. Your company can now choose to do what is known as an **IPO** (Initial Public Offering) on the public stock market. Now, **any company or individual** who believes your business could be profitable might buy some shares. These shares allow investors on the public market to become partial owners in your business. Their investment also helps your company to grow, and as it generates more profit, more buyers might see potential and start buying your stock. As the demand for you company's shares increases, since the supply is fixed, the price increases. This raises the value of the company's stock that people already own, making a lot of profit for your earlier investors. For your company, increased public exposure on the market serves as publicity and also helps you fund new initiatives. \n\nHowever, the converse is also true. Let's suppose you cut costs at one of your coffee shops, decreasing the quality of coffee and the public get to know about it. The negative news might convince investors that your company might not make much profits in the future, expect your company's share value to decline and to sell their holdings. As stocks are sold and demand for your stock goes down, the price falls, and with it, the company's market value. This could leave investors with big losses, unless the company starts looking profitable again.\n\nThis see-saw of supply and demand is what is responsible for the ever fluctuating prices on the stock market. Companies are under the unavoidable influence of fluctuating market forces, such as the cost of materials, labor, or unexpected new laws, bad publicity or changes in leadership. All these variables cause day to day noise in the market, which can appear to make companies appear more or less successful. The sad reality is that often *appearing to lose value*, often leads to losing investors which in turn makes your company lose actual value. Human confidence in the market has the power to trigger everything from economic booms to financial crisis. \n## The Actual Origin Story\nIn the 1600s, European companies and government had an operation going where the company would set up a colony in a different country, take valuable goods for free or at cheap prices and ship them back home to sell for massive profits. However, there were many situations back then where bad weather, pirates, etc. could take down the ship, incurring huge losses for the ship owners. To fund this whole (sometimes risky) operation, ship owners would find rich investors to pay for the cost of the voyage. If the ship returned, everyone would get a share in the loot and reap profits. If the ship didn't, everyone would have the lost the invested money and made no profits. Oftentimes, the investors would invest in multiple ships at once to spread their risk. \n\nSoon, the Dutch East India Company took this concept to the next level. Instead of doing the investment on a ship-by-ship basis, they began to sell shares of their company. Since they had a fleet of ships under their command, they essentially bypassed the invest-in-multiple ships part of the process for investors. Investors invested in the company, and the company gained access to more funds to carry out more expeditions to India and hence make more profits. Investors were able to sell their shares in the company at local bars, public gathering spots, etc. or just reap **dividends** from the company. Dividends are payouts made by the company to investors (these payments are not mandatory, but can help boost the investor sentiment for the company). This is how the basis for the first major stock market was created. \n\n\u003e The Amsterdam stock exchange is considered to be the oldest '**modern**' securities market in the world. Created shortly after the establishment of the Dutch East India Company in 1602 when equities began trading on a regular basis as a secondary market to trade its shares. - [Wikipedia - Euronext Amsterdam](https://en.wikipedia.org/wiki/Euronext_Amsterdam)\n\n# Is the Stock Market a Force for Good?\n## The Good Stuff\n- Back in the day, just one person used to call all the shots of a business. Vanderbilt and Rockefeller are famous names that come to mind. They exercised tight control over their businesses and built huge empires. But this all began to change in the beginning of the 20th century. Companies like General Motors (GM) and General Electric, etc. started. Companies realized that they could grow a **lot faster** if they opened up the company to the public and had public investments to accelerate growth. \n- Shareholders in the company want to make money, and if they see the company going in the wrong direction, they'll sell their shares. This will lead to more selling and the share price dropping. But risky ambitious ideas might also encourage people to invest and see the stock price go up. This is sort of the whole idea of the stock market as a force for good. It drives companies to **make more profitable decisions** =\u003e They have more money to give back to shareholders \u0026 more money to **create jobs** and **grow the business further**. This is good for everybody.\n- It serves as a vehicle for providing investment opportunities for not just the rich and wealthy investors, but to all people. People's strong belief in the market leads them to invest in the market, which causes more businesses to get the funding they need to create more jobs and make their business better, thereby directly impacting the people who invested by providing them with more job opportunities and better services / quality of life and even profits via dividends / increase in share price. Very similar idea to that described in [How the Economic Machine Works by Ray Dalio](/blog/how-the-economic-machine-works-by-ray-dalio).\n## The Bad Stuff\n- It's often not the **real value** of companies that drive their stock prices. It's often the most popular stories that people **believe** about those companies. Sometimes these stories are backed by facts, but sometimes it's all just hype or misinformation. There's often no way to calibrate or contain the spread of hype/misinformation. A famous example is the time when the dotcom bubble exploded. There was crazy hype about internet companies, and then when it exploded, there were harsh repercussions. Shareholders lost a lot of money, companies collapsed, leading to a huge loss of jobs and the Great Economic Recession. Essentially the polar opposite of the good cycle. \n- Because corporations are owned by shareholders, and because most of the stock market runs on greed, the **only obligation of businesses is to make profits.** Often, the top executive's (CEO, CTO, etc.) pay is linked to the share price performance. For example, 80% of their pay could be via stock grants. This drives them to make sure that the share price goes up. This could lead the top decision makers in these companies to take decisions like cutting costs, buying back their own shares, etc. to increase the stock prices in the short term to artificially bump up the price. Between 2007-2016, $55\\%$ of companies in the S\u0026P 500 spent their earnings on stock buybacks, $39\\%$ on dividends to investors and only $6\\%$ on jobs and growing the economy. Things which are actually beneficial for company growth and thereby the country's economy. We've evolved to have a shorter term view on shareholder rights versus a longer term view on stakeholder responsibilities. Laying off jobs, cutting costs to deliver inferior products, reducing wages, etc. negatively affects the economy but could be great for driving up a company's profits in the shorter term. But this is what the stock market encourages companies to optimize.\n- As the stock market has grown, so has inequality. In the USA, in 1970 the average CEO made about $22\\times$ more than the average worker. In 2016, this number had grown to become $271\\times$ more. \n\n\u003eWhen the stock market is booming, we're made to believe the economy is booming. In America, the stock market has been booming for nearly 40 years. But if we add up all the good and services bought and sold in the United States (the **actual** economy), that number isn't growing as quickly as it used to $(\\lt 3\\%)$. Wages have hardly budged in decades ($\\$20.19$ in 1965 to $\\$22.49$ 2020) and the average American family's net worth still hasn't really recovered from the Great Recession. ($\\$119$k in 2007 to $\\$78$k in 2016). So what exactly is the stock market measuring? \n\u003e- Quoted from [Explained | The Stock Market | FULL EPISODE | Netflix](https://www.youtube.com/watch?v=ZCFkWDdmXG8\u0026t=4s)\n# References\n1. [How does the stock market work? - Oliver Elfenbaum](https://www.youtube.com/watch?v=p7HKvqRI_Bo)\n2. [Explained | The Stock Market | FULL EPISODE | Netflix](https://www.youtube.com/watch?v=ZCFkWDdmXG8\u0026t=4s)\n3. [Wikipedia - Euronext Amsterdam](https://en.wikipedia.org/wiki/Euronext_Amsterdam)\n4. [How The Economic Machine Works by Ray Dalio](https://www.youtube.com/watch?v=PHe0bXAIuk0)\n", + "lastmodified": "2026-01-11T10:59:49.094919188+05:30", "tags": [ "domain-finance-economics" ] @@ -495,7 +495,7 @@ "/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function": { "title": "Wilson's Theorem, Fermat's Little Theorem \u0026 Euler's Totient Function", "content": "Last time, we covered the [Extended Euclidean Algorithm](/blog/extended-euclidean-algorithm). Now, we'll delve into some cooler number theory. \n# Wilson's Theorem\nWilson's Theorem states that for any number $p$, the following congruence holds $\\iff p$ is prime:\n$$(p-1)! \\equiv -1 \\pmod{p}$$\n## Proof\n### Proof for composite numbers\nWe can prove that this statement does not hold for any composite $p$ easily. Let $p$ be a composite number $\\gt 2$. Then $p$ can be represented as the product of two numbers $a \\cdot b = p$ for some $1 \\leq a \\leq b \\lt p$. Note that this means that $a \\mid (p-1)!$, hence $(p-1)! \\equiv 0 \\pmod{a}$. But if $a \\mid p$ and $(p-1)! \\equiv 0 \\pmod{a}$, then $(p-1)! \\equiv -1 \\pmod{p}$ cannot be true. $(p-1)! \\pmod{p}$ must be 0. This is a contradiction. \nTherefore, if $p$ is composite, $(p-1)! \\not \\equiv 0 \\pmod{p}$. Similarly, if the equivalence is -1, then $p$ cannot be composite. \n### Proof for prime numbers\nLet's prove the case for $p = 2$ first. $(2-1)! = 1! \\equiv -1 \\pmod{2}$ is seen trivially. Now we will prove for all odd primes $p$. \nNote that in $Z_{p+} = \\{1, 2, 3, \\ldots, p-1\\}$, $\\forall x \\in Z_{p+}, \\ \\exists \\ x' \\mid x \\cdot x' \\equiv 1 \\pmod{p}$. This is essentially the existence of an inverse. Also note that the inverse must always be unique for each $x \\in Z_{p+}$. Now, there are two possible cases, $x = x'$ or $x \\neq x'$. \n\nLet's assume $x = x'$. Then, \n$$\n\\begin{aligned}\nx \\cdot x' \\equiv 1 \\pmod{p} \\\\\nx^2 \\equiv 1 \\pmod{p} \\\\\n\\implies x \\equiv \\pm1 \\pmod{p} \\\\ \n\\implies x = 1 \\ \\lor \\ x = p-1\n\\end{aligned}\n$$\nTherefore, the only two elements in this field with inverses equivalent to themselves are $1$ and $p-1$. Now, let's consider the entire product of $(p-1)!$. \n$$\n\\begin{aligned}\n(p-1)! \\pmod{p} \\equiv (p-1)\\cdot(p-2)\\cdot(p-3)\\cdots1 \\pmod{p} \\\\\n\\text{Pairing off all the other elements with their unique inverses gives us} \\\\\n(p-1)! \\equiv 1\\cdot (p-1) \\pmod{p} \\\\\n\\implies (p-1)! \\equiv -1 \\pmod{p}\n\\end{aligned}\n$$\nHence we have proved Wilson's theorem. \n# Fermat's Little Theorem\n\nFermat's little theorem states the following:\n\n_If $p$ is a prime number, then for any integer $a$, the number $a^p -a$ is an integer multiple of $p$._\n\nIn other words,\n\n$$ a^p\\equiv a(mod \\ p) $$\n\nFurther, if $a$ is not divisible by $p$, then\n\n$$ a^{p-1} \\equiv 1(mod \\ p) $$\n\n\u003e Fun fact, this theorem is used to come up with a **very** accurate probabilistic [Randomization, Primality Testing Algorithms](/blog/randomization-primality-testing-algorithms)!\n## Proof\nThe proof is as follows: Consider the set $Z_p = \\{1, 2, 3, \\ldots, p-1\\}$, which contains all the non-zero integers modulo $p$. Let's construct the following equation and work on rearranging / substituting terms. \n$$\n\\begin{aligned} \n(a \\cdot 1)(a \\cdot 2)(a \\cdot 3) \\cdots (a \\cdot (p-1)) \u0026\\equiv a^{p-1} \\cdot (1 \\cdot 2 \\cdot 3 \\cdots (p-1)) \\pmod{p} \\\\ \u0026\\equiv a^{p-1} \\cdot (p-1)! \\pmod{p} \n\\end{aligned}$$ By Wilson's Theorem, we know that $(p-1)! \\equiv -1 \\pmod{p}$ for any prime $p$. Substituting this, we get: $$a^{p-1} \\cdot (p-1)! \\equiv a^{p-1} \\cdot (-1) \\pmod{p}$$ Therefore, we have: $$a^{p-1} \\cdot (-1) \\equiv -a^{p-1} \\pmod{p}$$ Rearranging the terms, we get: $$a^{p} - 1 \\equiv 0 \\pmod{p}$$ This can be rewritten as: $$a^{p} \\equiv a \\pmod{p}$$ Thus, we have proved Fermat's Little Theorem.\n# Euler's Totient Function\n\nEuler came along later and gave a more generalized version of Fermat's little theorem. He stated that for _any_ modulus $n$ and any integer $a$ co-prime to $n$, the following holds true.\n\n$$ a^{\\phi(n)} \\equiv 1 (mod \\ n) $$\n\nHere, $\\phi(n)$ is known as **Euler's Totient function.** It counts the number of integers between 1 and $n$ inclusive, which are co-prime to n. Or in simpler words, it is equivalent to the number of numbers less than $n$ that do not share any divisors with $n$.\n\n**Some interesting properties:**\n\n1. Notice that for any prime number $p$, $\\phi (p) = p-1$. By virtue of being prime, $p$ does not share any factor with any number less than itself.\n \n2. The totient function is a **multiplicative function**. This is not a trivial thing to see and follows from the Chinese remainder theorem. This [stack link](https://math.stackexchange.com/questions/192452/whats-the-proof-that-the-euler-totient-function-is-multiplicative) has a really nice write up of the proof. This property essentially means that for relatively prime $a$ and $b$,\n \n $$ \\phi (ab) = \\phi(a)\\cdot\\phi(b) $$\n \n\nNotice that Fermat's is indeed a special case of this theorem. When $n$ is prime, we get Fermat's little theorem.\n\nFurther, just like factorization, computing the value of $\\phi(n)$ is a **hard** problem. However, notice that if the factorization of $n$ is known, we can compute the value easily. We can simply write $n$ in terms of its prime products and write $\\phi(n) = \\phi(p_1) \\cdot \\phi(p_2)\\cdots \\phi(p_n)$. And it is easy to compute $\\phi(p)$ where $p$ is prime.\n\n# References\nThese notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. \n1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis \u0026 Design in IIIT-H", - "lastmodified": "2026-01-11T05:06:36.785908413+05:30", + "lastmodified": "2026-01-11T10:59:48.862917257+05:30", "tags": [ "domain-math-number-theory" ] diff --git a/assets/indices/linkIndex.json b/assets/indices/linkIndex.json index 603c9d9..b484ab8 100644 --- a/assets/indices/linkIndex.json +++ b/assets/indices/linkIndex.json @@ -45,23 +45,6 @@ "text": "Introduction to Complexity Theory" } ], - "/blog/activity-selection-huffman-encoding": [ - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "text": "More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union" - }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" - }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - } - ], "/blog/algorithms-for-approximate-string-matching-alignment": [ { "source": "/blog/algorithms-for-approximate-string-matching-alignment", @@ -147,16 +130,6 @@ "source": "/blog/chain-matrix-multiplication", "target": "/blog/levenshtein-edit-distance", "text": "Levenshtein Edit Distance" - }, - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" } ], "/blog/covered-calls-strategy": [ @@ -164,16 +137,6 @@ "source": "/blog/covered-calls-strategy", "target": "/blog/derivatives-options", "text": "Derivatives - Options" - }, - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/what-is-the-stock-market", - "text": "What is the Stock Market?" } ], "/blog/de-novo-assembly-overlap-graphs": [ @@ -200,16 +163,6 @@ "source": "/blog/derivatives-options", "target": "/blog/what-is-the-stock-market", "text": "What is the Stock Market?" - }, - { - "source": "/blog/derivatives-options", - "target": "/blog/covered-calls-strategy", - "text": "Covered Calls Strategy" - }, - { - "source": "/blog/derivatives-options", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" } ], "/blog/dna-sequencing": [ @@ -244,21 +197,6 @@ "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "target": "/blog/complexity-theory-reductions", "text": "Complexity Theory - Reductions" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/levenshtein-edit-distance", - "text": "Levenshtein Edit Distance" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" } ], "/blog/finite-and-infinite-sums": [ @@ -355,16 +293,6 @@ "source": "/blog/levenshtein-edit-distance", "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" } ], "/blog/log-structured-merge-tree-lsm-trees": [ @@ -423,16 +351,6 @@ "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", "target": "/blog/activity-selection-huffman-encoding", "text": "Activity Selection \u0026 Huffman Encoding" - }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" - }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" } ], "/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey": [ @@ -509,11 +427,6 @@ "source": "/blog/set-cover-approximation-algorithms", "target": "/blog/activity-selection-huffman-encoding", "text": "Activity Selection \u0026 Huffman Encoding" - }, - { - "source": "/blog/set-cover-approximation-algorithms", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" } ], "/blog/shortest-common-superstring-de-brujin-graphs": [ @@ -663,21 +576,6 @@ "source": "/blog/what-is-the-stock-market", "target": "/blog/how-the-economic-machine-works-by-ray-dalio", "text": "How the Economic Machine Works by Ray Dalio" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/derivatives-options", - "text": "Derivatives - Options" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/stock-multiples", - "text": "Stock Multiples" } ], "/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function": [ @@ -695,26 +593,11 @@ }, "backlinks": { "/blog/a-deep-dive-into-the-knapsack-problem": [ - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, { "source": "/blog/knapsack-using-branch-and-bounding", "target": "/blog/a-deep-dive-into-the-knapsack-problem", "text": "A Deep Dive into the Knapsack Problem" }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, { "source": "/blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree", "target": "/blog/a-deep-dive-into-the-knapsack-problem", @@ -860,16 +743,6 @@ "source": "/blog/a-deep-dive-into-the-knapsack-problem", "target": "/blog/chain-matrix-multiplication", "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" } ], "/blog/church-turing-hypothesis": [ @@ -896,13 +769,6 @@ "text": "Complexity Theory - Reductions" } ], - "/blog/covered-calls-strategy": [ - { - "source": "/blog/derivatives-options", - "target": "/blog/covered-calls-strategy", - "text": "Covered Calls Strategy" - } - ], "/blog/de-novo-assembly-overlap-graphs": [ { "source": "/blog/shortest-common-superstring-de-brujin-graphs", @@ -932,11 +798,6 @@ "source": "/blog/the-economics-of-foreign-exchange-forex", "target": "/blog/derivatives-options", "text": "Derivatives - Options" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/derivatives-options", - "text": "Derivatives - Options" } ], "/blog/dna-sequencing": [ @@ -952,47 +813,10 @@ "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, { "source": "/blog/levenshtein-edit-distance", "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/set-cover-approximation-algorithms", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - } - ], - "/blog/efficient-market-hypothesis": [ - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/derivatives-options", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" } ], "/blog/extended-euclidean-algorithm": [ @@ -1102,11 +926,6 @@ "source": "/blog/chain-matrix-multiplication", "target": "/blog/levenshtein-edit-distance", "text": "Levenshtein Edit Distance" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/levenshtein-edit-distance", - "text": "Levenshtein Edit Distance" } ], "/blog/linearity-of-expectation": [ @@ -1124,11 +943,6 @@ } ], "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union": [ - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "text": "More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union" - }, { "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "target": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", @@ -1257,20 +1071,10 @@ } ], "/blog/set-cover-approximation-algorithms": [ - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" - }, { "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "target": "/blog/set-cover-approximation-algorithms", "text": "Set Cover \u0026 Approximation Algorithms" - }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" } ], "/blog/sfinae": [ @@ -1287,13 +1091,6 @@ "text": "Smart Pointers" } ], - "/blog/stock-multiples": [ - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/stock-multiples", - "text": "Stock Multiples" - } - ], "/blog/templates-in-c": [ { "source": "/blog/a-c-programmer-s-guide-to-c", @@ -1380,11 +1177,6 @@ } ], "/blog/what-is-the-stock-market": [ - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/what-is-the-stock-market", - "text": "What is the Stock Market?" - }, { "source": "/blog/derivatives-options", "target": "/blog/what-is-the-stock-market", @@ -1466,21 +1258,6 @@ "target": "/blog/introduction-to-complexity-theory", "text": "Introduction to Complexity Theory" }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "text": "More Greedy Algorithms! Kruskal's \u0026 Disjoint Set Union" - }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" - }, - { - "source": "/blog/activity-selection-huffman-encoding", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, { "source": "/blog/algorithms-for-approximate-string-matching-alignment", "target": "/blog/levenshtein-edit-distance", @@ -1556,31 +1333,11 @@ "target": "/blog/levenshtein-edit-distance", "text": "Levenshtein Edit Distance" }, - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, - { - "source": "/blog/chain-matrix-multiplication", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, { "source": "/blog/covered-calls-strategy", "target": "/blog/derivatives-options", "text": "Derivatives - Options" }, - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/covered-calls-strategy", - "target": "/blog/what-is-the-stock-market", - "text": "What is the Stock Market?" - }, { "source": "/blog/de-novo-assembly-overlap-graphs", "target": "/blog/dna-sequencing", @@ -1601,16 +1358,6 @@ "target": "/blog/what-is-the-stock-market", "text": "What is the Stock Market?" }, - { - "source": "/blog/derivatives-options", - "target": "/blog/covered-calls-strategy", - "text": "Covered Calls Strategy" - }, - { - "source": "/blog/derivatives-options", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, { "source": "/blog/dna-sequencing", "target": "/blog/algorithms-for-approximate-string-matching-alignment", @@ -1641,21 +1388,6 @@ "target": "/blog/complexity-theory-reductions", "text": "Complexity Theory - Reductions" }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/levenshtein-edit-distance", - "text": "Levenshtein Edit Distance" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, { "source": "/blog/finite-and-infinite-sums", "target": "/blog/riemann-series-rearrangement", @@ -1736,16 +1468,6 @@ "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/chain-matrix-multiplication", - "text": "Chain Matrix Multiplication" - }, - { - "source": "/blog/levenshtein-edit-distance", - "target": "/blog/a-deep-dive-into-the-knapsack-problem", - "text": "A Deep Dive into the Knapsack Problem" - }, { "source": "/blog/log-structured-merge-tree-lsm-trees", "target": "/blog/rocksdb", @@ -1796,16 +1518,6 @@ "target": "/blog/activity-selection-huffman-encoding", "text": "Activity Selection \u0026 Huffman Encoding" }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/set-cover-approximation-algorithms", - "text": "Set Cover \u0026 Approximation Algorithms" - }, - { - "source": "/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, { "source": "/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey", "target": "/blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns", @@ -1866,11 +1578,6 @@ "target": "/blog/activity-selection-huffman-encoding", "text": "Activity Selection \u0026 Huffman Encoding" }, - { - "source": "/blog/set-cover-approximation-algorithms", - "target": "/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn", - "text": "DP as DAGs, Shortest path on DAGs \u0026 LIS in O(nlogn)" - }, { "source": "/blog/shortest-common-superstring-de-brujin-graphs", "target": "/blog/de-novo-assembly-overlap-graphs", @@ -1996,21 +1703,6 @@ "target": "/blog/how-the-economic-machine-works-by-ray-dalio", "text": "How the Economic Machine Works by Ray Dalio" }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/derivatives-options", - "text": "Derivatives - Options" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/efficient-market-hypothesis", - "text": "Efficient-Market-Hypothesis" - }, - { - "source": "/blog/what-is-the-stock-market", - "target": "/blog/stock-multiples", - "text": "Stock Multiples" - }, { "source": "/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function", "target": "/blog/extended-euclidean-algorithm", diff --git a/content/blog/activity-selection-huffman-encoding.md b/content/blog/activity-selection-huffman-encoding.md index fe39356..527aae9 100644 --- a/content/blog/activity-selection-huffman-encoding.md +++ b/content/blog/activity-selection-huffman-encoding.md @@ -187,12 +187,4 @@ The following video was referenced while making this diary and is the source of # References These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. 1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H -2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr) - ---- - -## Related Reading - -- [More Greedy Algorithms! Kruskal's & Disjoint Set Union](/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union) -- [Set Cover & Approximation Algorithms](/blog/set-cover-approximation-algorithms) -- [DP as DAGs, Shortest path on DAGs & LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn) +2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr) \ No newline at end of file diff --git a/content/blog/chain-matrix-multiplication.md b/content/blog/chain-matrix-multiplication.md index d3538c0..4812997 100644 --- a/content/blog/chain-matrix-multiplication.md +++ b/content/blog/chain-matrix-multiplication.md @@ -117,11 +117,4 @@ In this DP solution, we cannot simply construct a DAG structure and find the lon Last time, we were able to reduce the space complexity of our DP by realizing that the DP only relied on the states of the DP solution exactly **one** level below the current level. However, here we realize that this is sadly not the case. The solution at some node $[i, j]$ is very much reliant on every level below it. 1D row optimization etc does not seem to be of much use here. There is also no monotonicity that can be exploited to make the linear computation at some node logarithmic similar to how we did with LIS. Hence I do not think there is a better way to solve this problem. # References These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. -1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H - ---- - -## Related Reading - -- [DP as DAGs, Shortest path on DAGs & LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn) -- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem) +1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H \ No newline at end of file diff --git a/content/blog/covered-calls-strategy.md b/content/blog/covered-calls-strategy.md index 887a429..5f3a31a 100644 --- a/content/blog/covered-calls-strategy.md +++ b/content/blog/covered-calls-strategy.md @@ -34,11 +34,4 @@ The premiums received from selling call options are generally considered short-t ## Share Holding Power Reduction If the stock price fluctuates significantly, even if it remains flat over the long term, the covered call strategy may result in a gradual reduction in the number of shares held due to the exercise of call options. This can lead to realizing losses, despite the overall position being flat. ## Risk-Adjusted Return Considerations -While some people argue that covered call strategies may offer superior risk-adjusted returns, meaning higher returns per unit of risk taken, this claim is subject to debate and may depend on the specific market conditions and underlying assets involved. - ---- - -## Related Reading - -- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis) -- [What is the Stock Market?](/blog/what-is-the-stock-market) +While some people argue that covered call strategies may offer superior risk-adjusted returns, meaning higher returns per unit of risk taken, this claim is subject to debate and may depend on the specific market conditions and underlying assets involved. \ No newline at end of file diff --git a/content/blog/derivatives-options.md b/content/blog/derivatives-options.md index b8b3f7c..2ccaea4 100644 --- a/content/blog/derivatives-options.md +++ b/content/blog/derivatives-options.md @@ -157,11 +157,3 @@ The Indian National Stock Exchange (since July 2018 expiry contracts), uses phys 4. [Physical Delivery - Investopedia](https://www.investopedia.com/terms/p/physicaldelivery.asp) 5. [Policy on settlement of compulsory delivery derivative contracts — Update Oct 2019 - Zerodha](https://zerodha.com/z-connect/general/policy-on-settlement-of-compulsory-delivery-derivative-contracts-update-oct-2019) 6. [Physical delivery of stock F&O & their risks - Zerodha](https://zerodha.com/z-connect/general/physical-delivery-of-stock-fo-their-risks) - - ---- - -## Related Reading - -- [Covered Calls Strategy](/blog/covered-calls-strategy) -- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis) diff --git a/content/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn.md b/content/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn.md index af122ff..dca53d2 100644 --- a/content/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn.md +++ b/content/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn.md @@ -190,12 +190,4 @@ int lis(int arr[], int n) { # References These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. 1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H -2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr) - ---- - -## Related Reading - -- [Levenshtein Edit Distance](/blog/levenshtein-edit-distance) -- [Chain Matrix Multiplication](/blog/chain-matrix-multiplication) -- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem) +2. [Huffman Codes: An Information Theory Perspective - Reducible](https://youtu.be/B3y0RsVCyrw?si=5kFqRPa_XsGxpqBr) \ No newline at end of file diff --git a/content/blog/getting-rid-of-short-form-content.md b/content/blog/getting-rid-of-short-form-content.md index d5dd22b..0d3a88e 100644 --- a/content/blog/getting-rid-of-short-form-content.md +++ b/content/blog/getting-rid-of-short-form-content.md @@ -1,7 +1,7 @@ --- author: Kishore Kumar date: 2024-06-25 14:25:53+0530 -doc: 2024-06-05 05:06:01+0530 +doc: 2025-05-18 17:07:02+0530 title: Getting Rid of Short-Form Content --- Social media platforms are universally competing to capture all our time and attention by spamming us with brain-rot short form content... and it's working. And unlike other forms of addiction, it's not even true that we get dopamine hits from consuming short form content. More often than not, we don't even realize the amount of time that was passed consuming tidbits of random brain rot. YouTube for example randomly starts on the shorts page and users don't even realize they're scrolling through shorts until much later. Below we'll quickly outline a few software fixes we can implement at the moment to try to purge short form content from our life. Needless to say, if you're an iOS user, please chuck your phone. diff --git a/content/blog/levenshtein-edit-distance.md b/content/blog/levenshtein-edit-distance.md index 9b24bbc..bfc05f1 100644 --- a/content/blog/levenshtein-edit-distance.md +++ b/content/blog/levenshtein-edit-distance.md @@ -155,11 +155,3 @@ The single row optimized dp code for calculating the Levenshtein distance betwee These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. 1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H 2. [How do Spell Checkers work? Levenshtein Edit Distance - Creel](https://youtu.be/Cu7Tl7FGigQ?si=pFru3JaBAeKStvtz) (Excellent channel, do check him out. Has a lot of unique amazing content!) - - ---- - -## Related Reading - -- [Chain Matrix Multiplication](/blog/chain-matrix-multiplication) -- [A Deep Dive into the Knapsack Problem](/blog/a-deep-dive-into-the-knapsack-problem) diff --git a/content/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union.md b/content/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union.md index 85cafe9..25b9b0b 100644 --- a/content/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union.md +++ b/content/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union.md @@ -161,11 +161,4 @@ Below are links to `C++` implementations of both the fully equipped Disjoint Set # References These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. 1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H -2. [Disjoint Set Union - cp-algorithms](https://cp-algorithms.com/data_structures/disjoint_set_union.html) - ---- - -## Related Reading - -- [Set Cover & Approximation Algorithms](/blog/set-cover-approximation-algorithms) -- [DP as DAGs, Shortest path on DAGs & LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn) +2. [Disjoint Set Union - cp-algorithms](https://cp-algorithms.com/data_structures/disjoint_set_union.html) \ No newline at end of file diff --git a/content/blog/set-cover-approximation-algorithms.md b/content/blog/set-cover-approximation-algorithms.md index 26c9fcc..08541e4 100644 --- a/content/blog/set-cover-approximation-algorithms.md +++ b/content/blog/set-cover-approximation-algorithms.md @@ -131,10 +131,4 @@ Note that $n_t$ is the number of elements left at the $i^{th}$ iteration. Theref This **must** mean that the algorithm will terminate after $t=k\ ln(n)$ iterations. Our algorithm picks exactly 1 set per iteration. This also implies that if our optimal answer is $k$, our greedy strategy will pick at most $k \ ln(n)$ sets. Hence we have successfully managed to put a bound on the approximation. # References These notes are old and I did not rigorously horde references back then. If some part of this content is your's or you know where it's from then do reach out to me and I'll update it. -1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H - ---- - -## Related Reading - -- [DP as DAGs, Shortest path on DAGs & LIS in O(nlogn)](/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn) +1. Professor [Kannan Srinathan's](https://www.iiit.ac.in/people/faculty/srinathan/) course on Algorithm Analysis & Design in IIIT-H \ No newline at end of file diff --git a/content/blog/what-is-the-stock-market.md b/content/blog/what-is-the-stock-market.md index 995798a..205a88c 100644 --- a/content/blog/what-is-the-stock-market.md +++ b/content/blog/what-is-the-stock-market.md @@ -39,12 +39,3 @@ Soon, the Dutch East India Company took this concept to the next level. Instead 2. [Explained | The Stock Market | FULL EPISODE | Netflix](https://www.youtube.com/watch?v=ZCFkWDdmXG8&t=4s) 3. [Wikipedia - Euronext Amsterdam](https://en.wikipedia.org/wiki/Euronext_Amsterdam) 4. [How The Economic Machine Works by Ray Dalio](https://www.youtube.com/watch?v=PHe0bXAIuk0) - - ---- - -## Related Reading - -- [Derivatives - Options](/blog/derivatives-options) -- [Efficient-Market-Hypothesis](/blog/efficient-market-hypothesis) -- [Stock Multiples](/blog/stock-multiples) diff --git a/publisher-v2/config.yaml b/publisher-v2/config.yaml index 9c9d47d..65a191f 100644 --- a/publisher-v2/config.yaml +++ b/publisher-v2/config.yaml @@ -25,7 +25,6 @@ filters: # Feature Flags features: - enable_related_reading: true optimize_images: true # Image Optimization Settings diff --git a/publisher-v2/link_processor.py b/publisher-v2/link_processor.py index 46e4e37..196b679 100644 --- a/publisher-v2/link_processor.py +++ b/publisher-v2/link_processor.py @@ -14,7 +14,6 @@ class LinkProcessor: def __init__(self, config: Dict): self.config = config - self.enable_related = config['features'].get('enable_related_reading', True) def parameterize_title(self, title: str) -> str: """Convert note title to URL-safe slug""" @@ -120,53 +119,6 @@ def process_images(self, content: str) -> Tuple[str, Set[str]]: return processed_content, image_deps - def generate_related_section(self, frontmatter: Dict, note_slug: str, already_linked: Set[str] = None) -> str: - """ - Generate Related Reading section from frontmatter - - Args: - frontmatter: Note frontmatter dict - note_slug: URL slug of current note (to avoid self-reference) - already_linked: Set of note names already linked in content (to avoid duplicates) - - Returns: - Markdown string for related section - """ - if not self.enable_related: - return "" - - related = frontmatter.get('related', []) - if not related: - return "" - - if already_linked is None: - already_linked = set() - - # Filter out self-references, already-linked notes, and convert to links - related_links = [] - for item in related: - # Extract note name from wikilink format - note_name = item.strip('[]').split('|')[0] - slug = self.parameterize_title(note_name) - - # Skip self-references - if slug == note_slug: - continue - - # Skip notes already linked in content - if note_name in already_linked: - continue - - related_links.append(f"- [{note_name}](/blog/{slug})") - - if not related_links: - return "" - - section = "\n\n---\n\n## Related Reading\n\n" - section += "\n".join(related_links) - section += "\n" - - return section def main(): @@ -206,21 +158,6 @@ def main(): print("After processing images:") print(processed) print(f"\nImage dependencies: {images}") - print("\n" + "="*60 + "\n") - - # Test related section - frontmatter = { - 'title': 'Test Note', - 'related': [ - '[[Another Note]]', - '[[Deep Dive into Algorithms]]', - '[[Complex Note Title]]' - ] - } - - related_section = processor.generate_related_section(frontmatter, 'test-note') - print("Generated related section:") - print(related_section) if __name__ == '__main__': diff --git a/publisher-v2/publisher.py b/publisher-v2/publisher.py index eb0b7a3..2a0eaef 100755 --- a/publisher-v2/publisher.py +++ b/publisher-v2/publisher.py @@ -185,11 +185,6 @@ def process_note(self, note_path: Path) -> bool: processed_content, referenced_notes = self.link_processor.process_wikilinks(content) processed_content, image_deps = self.link_processor.process_images(processed_content) - # Add related reading section (uses original frontmatter for 'related' field) - # Pass referenced_notes to avoid duplicating links already in content - if self.config['features']['enable_related_reading']: - related_section = self.link_processor.generate_related_section(frontmatter, slug, referenced_notes) - processed_content += related_section # Process and copy images self._process_images(image_deps, note_path) @@ -337,11 +332,12 @@ def _clean_orphaned_images(self): # Match ![alt](/images/name.webp) pattern matches = re.findall(r'!\[[^\]]*\]\(/images/([^)]+)\)', content) for match in matches: - # Add both the referenced file and its base name (for .png fallback) + # Add the referenced file referenced_images.add(match) - # Also add the .png version if .webp is referenced - if match.endswith('.webp'): - referenced_images.add(match.replace('.webp', '.png')) + # Also add all related formats (webp, png, jpg) for fallback handling + base_name = Path(match).stem + for ext in ['.webp', '.png', '.jpg', '.jpeg', '.gif']: + referenced_images.add(f"{base_name}{ext}") # Get all images in static/images existing_images = set() diff --git a/static/linkmap b/static/linkmap index 49373bc..2e6f478 100644 --- a/static/linkmap +++ b/static/linkmap @@ -1,62 +1,62 @@ -/blog/vectorization-analyzing-loop-dependencies/index.{html} /blog/vectorization-analyzing-loop-dependencies/ -/blog/derivatives-options/index.{html} /blog/derivatives-options/ -/blog/finite-and-infinite-sums/index.{html} /blog/finite-and-infinite-sums/ -/blog/theoretical-metrics-for-benchmarking/index.{html} /blog/theoretical-metrics-for-benchmarking/ -/blog/towards-a-completely-self-driven-dbms-challenges-ottertune-faced/index.{html} /blog/towards-a-completely-self-driven-dbms-challenges-ottertune-faced/ -/blog/log-structured-merge-tree-lsm-trees/index.{html} /blog/log-structured-merge-tree-lsm-trees/ -/blog/riemann-series-rearrangement/index.{html} /blog/riemann-series-rearrangement/ -/blog/the-fast-fourier-transform-fft/index.{html} /blog/the-fast-fourier-transform-fft/ +/blog/defining-computational-problems/index.{html} /blog/defining-computational-problems/ +/blog/network-flow-algorithms-ford-fulkerson/index.{html} /blog/network-flow-algorithms-ford-fulkerson/ +/blog/randomization-primality-testing-algorithms/index.{html} /blog/randomization-primality-testing-algorithms/ +/blog/shortest-common-superstring-de-brujin-graphs/index.{html} /blog/shortest-common-superstring-de-brujin-graphs/ +/blog/the-black-scholes-merton-equation/index.{html} /blog/the-black-scholes-merton-equation/ /blog/what-goes-around-comes-around-the-history-of-database-systems-part-1-1960-2000/index.{html} /blog/what-goes-around-comes-around-the-history-of-database-systems-part-1-1960-2000/ -/index.html / -/blog/a-deep-dive-into-the-knapsack-problem/index.{html} /blog/a-deep-dive-into-the-knapsack-problem/ +/blog/are-there-computational-problems-that-computers-cannot-solve/index.{html} /blog/are-there-computational-problems-that-computers-cannot-solve/ /blog/getting-rid-of-short-form-content/index.{html} /blog/getting-rid-of-short-form-content/ /blog/new-and-delete/index.{html} /blog/new-and-delete/ -/blog/parallelism-with-omp/index.{html} /blog/parallelism-with-omp/ -/blog/profiling-a-program/index.{html} /blog/profiling-a-program/ -/blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns/index.{html} /blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns/ -/blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree/index.{html} /blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree/ -/blog/defining-computational-problems/index.{html} /blog/defining-computational-problems/ -/blog/knapsack-using-branch-and-bounding/index.{html} /blog/knapsack-using-branch-and-bounding/ -/blog/public-key-cryptography-coming-up-with-rsa/index.{html} /blog/public-key-cryptography-coming-up-with-rsa/ -/blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting/index.{html} /blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting/ -/blog/a-c-programmer-s-guide-to-c/index.{html} /blog/a-c-programmer-s-guide-to-c/ -/blog/booyer-moore-knuth-morris-pratt-for-exact-matching/index.{html} /blog/booyer-moore-knuth-morris-pratt-for-exact-matching/ -/blog/dna-sequencing/index.{html} /blog/dna-sequencing/ -/blog/how-the-economic-machine-works-by-ray-dalio/index.{html} /blog/how-the-economic-machine-works-by-ray-dalio/ -/blog/complexity-theory-reductions/index.{html} /blog/complexity-theory-reductions/ -/blog/extended-euclidean-algorithm/index.{html} /blog/extended-euclidean-algorithm/ -/blog/p-vs-np/index.{html} /blog/p-vs-np/ /blog/set-cover-approximation-algorithms/index.{html} /blog/set-cover-approximation-algorithms/ -/blog/analyzing-fibonacci-karatsuba-multiplication/index.{html} /blog/analyzing-fibonacci-karatsuba-multiplication/ -/blog/church-turing-hypothesis/index.{html} /blog/church-turing-hypothesis/ -/blog/smart-pointers/index.{html} /blog/smart-pointers/ -/blog/are-there-computational-problems-that-computers-cannot-solve/index.{html} /blog/are-there-computational-problems-that-computers-cannot-solve/ +/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function/index.{html} /blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function/ +/blog/amdahl-s-law-gustafson-s-law/index.{html} /blog/amdahl-s-law-gustafson-s-law/ /blog/flynn-s-taxonomy/index.{html} /blog/flynn-s-taxonomy/ -/blog/levenshtein-edit-distance/index.{html} /blog/levenshtein-edit-distance/ -/blog/shortest-common-superstring-de-brujin-graphs/index.{html} /blog/shortest-common-superstring-de-brujin-graphs/ +/blog/riemann-series-rearrangement/index.{html} /blog/riemann-series-rearrangement/ /blog/stock-multiples/index.{html} /blog/stock-multiples/ -/blog/what-is-the-stock-market/index.{html} /blog/what-is-the-stock-market/ -/blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function/index.{html} /blog/wilson-s-theorem-fermat-s-little-theorem-euler-s-totient-function/ +/blog/dna-sequencing/index.{html} /blog/dna-sequencing/ +/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union/index.{html} /blog/more-greedy-algorithms-kruskal-s-disjoint-set-union/ +/blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree/index.{html} /blog/shortest-reliable-path-floyd-warshall-max-independent-set-tree/ /blog/basics-of-supervised-learning-linear-regression/index.{html} /blog/basics-of-supervised-learning-linear-regression/ -/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey/index.{html} /blog/multi-agent-systems-harnessing-collective-intelligence-a-survey/ -/blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning/index.{html} /blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning/ -/blog/brent-s-theorem-task-level-parallelism/index.{html} /blog/brent-s-theorem-task-level-parallelism/ -/blog/the-black-scholes-merton-equation/index.{html} /blog/the-black-scholes-merton-equation/ -/blog/de-novo-assembly-overlap-graphs/index.{html} /blog/de-novo-assembly-overlap-graphs/ -/blog/the-economics-of-foreign-exchange-forex/index.{html} /blog/the-economics-of-foreign-exchange-forex/ -/blog/in-search-of-an-understandable-consensus-algorithm-extended-version-raft/index.{html} /blog/in-search-of-an-understandable-consensus-algorithm-extended-version-raft/ +/blog/levenshtein-edit-distance/index.{html} /blog/levenshtein-edit-distance/ +/blog/log-structured-merge-tree-lsm-trees/index.{html} /blog/log-structured-merge-tree-lsm-trees/ +/blog/smart-pointers/index.{html} /blog/smart-pointers/ +/blog/introduction-to-complexity-theory/index.{html} /blog/introduction-to-complexity-theory/ +/blog/vectorization-analyzing-loop-dependencies/index.{html} /blog/vectorization-analyzing-loop-dependencies/ +/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn/index.{html} /blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn/ +/blog/finite-and-infinite-sums/index.{html} /blog/finite-and-infinite-sums/ +/blog/technical-analysis/index.{html} /blog/technical-analysis/ /blog/activity-selection-huffman-encoding/index.{html} /blog/activity-selection-huffman-encoding/ -/blog/amdahl-s-law-gustafson-s-law/index.{html} /blog/amdahl-s-law-gustafson-s-law/ +/blog/derivatives-options/index.{html} /blog/derivatives-options/ +/blog/p-vs-np/index.{html} /blog/p-vs-np/ +/blog/a-deep-dive-into-the-knapsack-problem/index.{html} /blog/a-deep-dive-into-the-knapsack-problem/ /blog/chain-matrix-multiplication/index.{html} /blog/chain-matrix-multiplication/ -/blog/network-flow-algorithms-ford-fulkerson/index.{html} /blog/network-flow-algorithms-ford-fulkerson/ -/blog/randomization-primality-testing-algorithms/index.{html} /blog/randomization-primality-testing-algorithms/ +/blog/how-the-economic-machine-works-by-ray-dalio/index.{html} /blog/how-the-economic-machine-works-by-ray-dalio/ +/blog/in-search-of-an-understandable-consensus-algorithm-extended-version-raft/index.{html} /blog/in-search-of-an-understandable-consensus-algorithm-extended-version-raft/ +/blog/master-s-theorem-strassen-s-matrix-multiplication-kth-order-statistics/index.{html} /blog/master-s-theorem-strassen-s-matrix-multiplication-kth-order-statistics/ +/blog/theoretical-metrics-for-benchmarking/index.{html} /blog/theoretical-metrics-for-benchmarking/ +/index.html / +/blog/a-c-programmer-s-guide-to-c/index.{html} /blog/a-c-programmer-s-guide-to-c/ +/blog/knapsack-using-branch-and-bounding/index.{html} /blog/knapsack-using-branch-and-bounding/ +/blog/complexity-theory-reductions/index.{html} /blog/complexity-theory-reductions/ +/blog/towards-a-completely-self-driven-dbms-challenges-ottertune-faced/index.{html} /blog/towards-a-completely-self-driven-dbms-challenges-ottertune-faced/ +/blog/algorithms-for-approximate-string-matching-alignment/index.{html} /blog/algorithms-for-approximate-string-matching-alignment/ +/blog/analyzing-fibonacci-karatsuba-multiplication/index.{html} /blog/analyzing-fibonacci-karatsuba-multiplication/ +/blog/booyer-moore-knuth-morris-pratt-for-exact-matching/index.{html} /blog/booyer-moore-knuth-morris-pratt-for-exact-matching/ +/blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns/index.{html} /blog/reasoning-acting-and-learning-a-survey-of-single-agent-llm-patterns/ +/blog/what-is-the-stock-market/index.{html} /blog/what-is-the-stock-market/ /blog/building-a-type-safe-tool-framework-for-llms-in-scala/index.{html} /blog/building-a-type-safe-tool-framework-for-llms-in-scala/ +/blog/church-turing-hypothesis/index.{html} /blog/church-turing-hypothesis/ +/blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms/index.{html} /blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms/ +/blog/public-key-cryptography-coming-up-with-rsa/index.{html} /blog/public-key-cryptography-coming-up-with-rsa/ +/blog/brent-s-theorem-task-level-parallelism/index.{html} /blog/brent-s-theorem-task-level-parallelism/ /blog/covered-calls-strategy/index.{html} /blog/covered-calls-strategy/ -/blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn/index.{html} /blog/dp-as-dags-shortest-path-on-dags-lis-in-o-nlogn/ +/blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting/index.{html} /blog/how-to-analyze-algorithms-proving-a-lower-bound-for-comparison-based-sorting/ +/blog/multi-agent-systems-harnessing-collective-intelligence-a-survey/index.{html} /blog/multi-agent-systems-harnessing-collective-intelligence-a-survey/ /interests/index.{html} /interests/ -/blog/algorithms-for-approximate-string-matching-alignment/index.{html} /blog/algorithms-for-approximate-string-matching-alignment/ -/blog/introduction-to-complexity-theory/index.{html} /blog/introduction-to-complexity-theory/ -/blog/master-s-theorem-strassen-s-matrix-multiplication-kth-order-statistics/index.{html} /blog/master-s-theorem-strassen-s-matrix-multiplication-kth-order-statistics/ -/blog/more-greedy-algorithms-kruskal-s-disjoint-set-union/index.{html} /blog/more-greedy-algorithms-kruskal-s-disjoint-set-union/ -/blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms/index.{html} /blog/mega-project-kblas-writing-a-benchmark-library-in-c-optimizing-l1-l2-basic-linear-algebra-subprograms/ -/blog/technical-analysis/index.{html} /blog/technical-analysis/ +/blog/the-fast-fourier-transform-fft/index.{html} /blog/the-fast-fourier-transform-fft/ +/blog/extended-euclidean-algorithm/index.{html} /blog/extended-euclidean-algorithm/ +/blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning/index.{html} /blog/ottertune-automatic-database-management-system-tuning-through-large-scale-machine-learning/ +/blog/parallelism-with-omp/index.{html} /blog/parallelism-with-omp/ +/blog/profiling-a-program/index.{html} /blog/profiling-a-program/ +/blog/the-economics-of-foreign-exchange-forex/index.{html} /blog/the-economics-of-foreign-exchange-forex/ +/blog/de-novo-assembly-overlap-graphs/index.{html} /blog/de-novo-assembly-overlap-graphs/