AustinTSchaffer
diff --git a/‎OMSCS/Courses/ML/Not Lectures/Information Theory.md‎
Lines changed: 77 additions & 0 deletions b/‎OMSCS/Courses/ML/Not Lectures/Information Theory.md‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎OMSCS/Courses/ML/Not Lectures/No Free Lunch Theorem.md‎
Lines changed: 17 additions & 0 deletions b/‎OMSCS/Courses/ML/Not Lectures/No Free Lunch Theorem.md‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎OMSCS/Courses/ML/SL04 - Instance Based Learning.md‎
Lines changed: 132 additions & 0 deletions b/‎OMSCS/Courses/ML/SL04 - Instance Based Learning.md‎
Lines changed: 132 additions & 0 deletions
diff --git a/‎OMSCS/Courses/ML/SL05 - Ensemble Learning and Boosting.md‎
Lines changed: 83 additions & 0 deletions b/‎OMSCS/Courses/ML/SL05 - Ensemble Learning and Boosting.md‎
Lines changed: 83 additions & 0 deletions
@@ -0,0 +1,77 @@
+---
+tags:
+  - OMSCS
+  - ML
+---
+# Information Theory
+
+## Random Variables and Probability
+- A variable is an object that can take on any value from a set of values.
+- Can be discrete or continuous
+- Random variables have unpredictable values
+- A value that a random variable has taken on is called a "trial"
+- A collection of trials is called a "sample"
+- Every random variable has a probability distribution (discrete) or probability density (continuous)
+- Random variables may be dependent on each other, or they may be independent.
+- A joint distribution tells us everything about the co-occurrence of 2 different variables.
+	- [[Module 05 - Probability]]
+	- [[Module 6 - Bayes Nets]]
+
+$$
+P(X) = \sum_{y \in \Omega_y}P(X,Y=y)
+$$
+
+- Variables are independent iff: $P(X,Y)=P(X)*P(Y)$
+- dependent variables use the conditional distribution: $P(Y|X)=\frac{P(X,Y)}{P(X)}$
+- This leads to bayes rule: $P(X|Y)=P(Y|X)\frac{P(X)}{P(Y)}$
+- It's possible to do this with more than 2 variables.
+
+## Moments
+- average = mean = expected value
+- "avg/mean/E.V. of X": $E[X]$
+- Variance measures the variation of values above/below the mean: $Var(X)=E[X^2]-E[X]^2$
+- The standard deviation: $\sigma(X)=\sqrt{Var(X)}$
+- Variables have many "moments". There are $k$ moments, each expressed as: $E[X^k]$
+	- Central moment: $E[(X-E[X])^k]$
+	- Variance is the second central moment of X
+	- Normalized central moment: $\frac{E[(X-E[X])^k]}{\sigma(X)^k}$
+	- The fourth normalized central moment is called "kurtosis" which measures the "peakedness" of a distribution.
+
+## Entropy
+- Fundamental measure in information theory
+- captures the amount of randomness or uncertainty in a variable
+
+$$
+H(X) = -E[\text{log}P(X)] = -\sum_{x \in \Omega_X}P(X=x)\text{log}P(X=x)
+$$
+(logarithm base is usually 2)
+
+- measure of the average length of a message that would have to be sent to describe a sample
+	- Fair coin:
+		- $H(X)=–(0.5 \space\text{log}\space 0.5 + 0.5 \space\text{log}\space 0.5) = 1$.
+		- 100 flips requires 100 bits.
+	- One sided coin:
+		- $H(X) = -(1 \space\text{log}\space 1 + 0\space\text{log}\space 0)=0$.
+		- 100 flips requires 0 bits.
+	- 75% coin:
+		- $(0.75\space\text{log}\space0.75 + 0.25\space\text{log}\space0.25)=0.8113$.
+		- 100 flips requires 82 bits.
+	- Just because you only need $M$ bits to describe a sample doesn't mean it's easy to formulate the message required to describe the sample.
+	- "There exists a coder that can construct messages of length $H(X)+1$."
+- Joint entropy
+	- $H(X,Y)=-E_X[E_Y[\text{log}(\space P(X,Y) \space)]]$
+	- $$H(X,Y)=-\sum_{x \in \Omega_X, y \in \Omega_Y}P(X=x,Y=y) \space\text{log}(\space P(X=x,Y=y)\space)$$
+- Conditional entropy
+	- $H(X|Y)=-E_X[E_Y[\text{log}(\space P(Y|X) \space)]]$
+	- $$H(X,Y)=-\sum_{x \in \Omega_X, y \in \Omega_Y}P(X=x,Y=y) \space\text{log}(\space P(Y=y|X=x)\space)$$
+## Mutual Information
+- Conditional entropy can tell when 2 variables are completely independent
+- This is not an adequate measure of dependence.
+- "A small value for $H(Y|X)$ implies that $X$ tells us a great deal about $Y$ or that $H(Y)$ is small to begin with."
+- We measure dependence using mutual information: $I(X,Y)=H(Y)-H(Y|X)$
+- Measure of the reduction of randomness of a variable given knowledge of another variable.
+	- $I(X,Y)=H(Y)-H(Y|X)$
+	- $I(X,Y)=H(X)-H(X|Y)$
+	- $I(X,Y)=H(X)+H(Y)-H(X,Y)$
+	- $I(X,Y)=I(Y,X)$
+
@@ -0,0 +1,17 @@
+---
+tags:
+  - OMSCS
+  - ML
+---
+# No Free Lunch Theorem
+
+- impossibility theorem
+- a general purpose universal optimization strategy is impossible
+- "the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration"
+
+One can build a matrix where the rows are optimization algorithms and the columns are optimization problems. The cells of the matrix encode the effectiveness of each optimization algorithm on each optimization problem.
+
+The paper posits that each row has the same average performance.
+
+The effectiveness of one particular optimization algorithm is useless without considering the set of optimization problems that you hope to apply it to.
+
@@ -0,0 +1,132 @@
+---
+tags:
+  - OMSCS
+  - ML
+---
+# SL04 - Instance Based Learning
+
+- Other Supervised Learning algorithms train a model based on the data then throw the data away.
+- In IBL, we put the data into a database and then lookup data in the database when we need to train on a new datapoint.
+
+- Simple Database
+	- Just store the examples, look them up when asked
+	- Advantages
+		- Reliable / dependable ((X, Y) -> DB, Lookup(X) -> Y)
+		- Fast (to "train", there's basically no training)
+		- Simple
+	- Disadvantages
+		- No generalization
+		- Overfitting (querying datapoints that had mistakes always yields the same mistake)
+
+## Cost of a House
+- Have a DB of house costs
+	- size
+	- date sold
+	- price of property when sold
+	- location
+	- zip code
+- Nearest Neighbor
+	- Find the nearest existing datapoint, use that cost
+	- Falls apart if the unclassified datapoint is too far from a neighbor
+- K Nearest Neighbor (KNN)
+	- Take the $K$ nearest existing datapoints
+	- Take the average of those $K$ datapoints
+		- Can (should?) be a weighted average based on "distance"/"similarity"
+		- The weighting function used is a "hyperparameter" of KNN
+		- weighting function can be as simple as $1/k$ (unweighted)
+
+## Comparison
+
+![[Pasted image 20250128103204.png]]
+
+- Do all the work upfront? (eager learner)
+- Do all the work on the backend (at query time)? (lazy learner)
+- Combination of approaches? No reason why you can't "cache" the result via a linear regression.
+
+## KNN Example
+
+![[Pasted image 20250128103648.png]]
+
+```python
+import sklearn.neighbors
+
+X = [[1, 6], [2, 4], [3, 7], [6, 8], [7, 1], [8, 4]]
+Y = [7, 8, 13, 44, 50, 68]
+Q = [[4,2]]
+
+# We need to use KNeighborsRegressor. KNeighborsClassifier
+# uses "majority vote" or "weighted majority vote".
+# KNeighborsRegressor returns the "average" or "weighted avg".
+
+sklearn.neighbors\
+	.KNeighborsRegressor(1, metric='euclidean')\
+	.fit(X, Y)\
+	.predict(Q)
+# 8
+
+sklearn.neighbors\
+	.KNeighborsRegressor(3, metric='euclidean')\
+	.fit(X, Y)\
+	.predict(Q)
+# 42
+
+sklearn.neighbors\
+	.KNeighborsRegressor(1, metric='manhattan')\
+	.fit(X, Y)\
+	.predict(Q)
+# 8
+
+sklearn.neighbors\
+	.KNeighborsRegressor(3, metric='manhattan')\
+	.fit(X, Y)\
+	.predict(Q)
+# 23.66666667 
+```
+
+| $d()$     | $K$ | `sklearn`        | Empirical | Notes                                   |
+| --------- | --- | ---------------- | --------- | --------------------------------------- |
+| Euclidean | 1   | $8$              |           |                                         |
+| Euclidean | 3   | $42$             |           |                                         |
+| Manhattan | 1   | $8$              | 29        | (4,2) is equidistant to (2,4) and (7,1) |
+| Manhattan | 3   | $23 \frac{2}{3}$ | 35.5      | (4,2) is equidistant to (3,7) and (8,4) |
+> Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor `k+1` and `k`, have identical distances but different labels, the results will depend on the ordering of the training data.
+
+![[Pasted image 20250128110752.png]]
+
+## KNN Bias
+Preference Bias
+- Locality -> Near Points are Similar
+- Smoothness -> Averaging
+- **All features matter equally**
+
+## Curse of Dimensionality
+> As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.
+
+Intuition says "let's add more features, that'll help it classify better". In reality, that makes the problem worse if you have insufficient data.
+
+If you have one dimension, the information density of the space is $1/N$, where $N$ is the number of datapoints. If you add another dimension, you need $N^2$ datapoints in order to achieve the same level of information density. If you add a 3rd dimension, you need $N^3$ datapoints to achieve the same level of information density.
+
+![[Pasted image 20250128204050.png]]
+
+Weighting different dimensions differently can help with the curse of dimensionality.
+
+## Other Stuff
+- Distance functions
+	- Euclidean
+	- Manhattan
+	- Hamming
+- Weighted vs Unweighted distances
+- What's the best value for $K$?
+- Weighted vs unweighted average
+- Locally weighted regression
+- Locally weighted linear regression
+- Locally weighted quadratic regression
+- Locally weighted $WHATEVER regression
+
+## Summary
+- lazy vs eager learning
+- knn
+- similarity = distance
+- classification vs regression
+- averaging
+- domain knowledge matters
@@ -0,0 +1,83 @@
+---
+tags:
+  - OMSCS
+  - ML
+---
+# SL05 - Ensemble Learning and Boosting
+
+- Spam email
+- is it spam or not?
+- Come up with some simple rules for classifying if it's spam
+	- from spouse? not spam
+	- body contains "manly"? spam
+	- short? spam
+	- just URLs? spam
+	- just an image? spam
+	- misspelled words? spam
+	- blocklist of words? spam
+	- make money fast? spam
+- each simple rule is not good enough on its own
+- combine each simple rule into a complex rule that works well enough on its own
+- learn over subsets of the data to generate those simple rules
+
+## Algorithm
+- What is this notion of "combine"?
+- How do we pick subsets?
+
+![[Pasted image 20250128210750.png]]
+
+## Bagging
+- This is called Bagging (Bootstrap Aggregation)
+- Take some random subset of the data
+- Train over that data
+- Keep doing it
+- Take the average result of the models
+
+![[Pasted image 20250128210950.png]]
+
+![[Pasted image 20250128211015.png]]
+
+## Boosting
+- similar to bagging
+- Take the "hardest" examples
+- perform a weighted mean
+- Error calculations should be based on the "likelihood" of each data point
+- Which examples are important to learn? Which examples aren't important to learn?
+- "Weak" learners
+	- Does better than chance
+	- Expected error is always less than half
+- Given training $(x_i, y_i)$ where $y_i \in \{-1, +1\}$
+- For t=1 to T
+	- Construct distribution D
+	- find weak classifier $h_t(X)$
+		- with small error
+		- $\epsilon_t = P_{D_t}[h_t(x_i) \ne y_i]$
+		- This looks crazy, but we're essentially doing a weighted average.
+- output $H_{\text{final}}$
+
+- Start off with uniform distribution: $D_1(i)=\frac{1}{n}$
+$$
+D_{t+1}(i)=D_t(i)e^{-\alpha_ty_ih_t(x_i)}Z_t^{-1}
+$$
+$$
+\alpha_t=\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})
+$$
+
+When $h_t(x)=y_i$, sometimes the $D_{t+1}(i) \le D_t(i)$. It usually goes down. Sometimes it stays the same. It depends on how the rest of the distribution is affected.
+
+When $h_t(x_i) \ne y_i$, $D_{t+1}(i) \gt D_t(i)$. It always increases, to put more weight on the examples it got wrong.
+
+$$
+H_{\text{final}}=\text{sgn}(\sum_t\alpha_th_t(x))
+$$
+
+## Three Little Boxes
+![[Pasted image 20250128214602.png]]
+
+- $H$ is the set of axis-aligned semi-planes
+- (Everything on one side of a line is in the range)
+
+![[Pasted image 20250128215046.png]]
+
+![[Pasted image 20250128215156.png]]
+