Skip to content

Commit a948c3e

Browse files
Uploading OMSCS ML notes so far
1 parent 99ad69f commit a948c3e

21 files changed

+439
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
tags:
3+
- OMSCS
4+
- ML
5+
---
6+
# Information Theory
7+
8+
## Random Variables and Probability
9+
- A variable is an object that can take on any value from a set of values.
10+
- Can be discrete or continuous
11+
- Random variables have unpredictable values
12+
- A value that a random variable has taken on is called a "trial"
13+
- A collection of trials is called a "sample"
14+
- Every random variable has a probability distribution (discrete) or probability density (continuous)
15+
- Random variables may be dependent on each other, or they may be independent.
16+
- A joint distribution tells us everything about the co-occurrence of 2 different variables.
17+
- [[Module 05 - Probability]]
18+
- [[Module 6 - Bayes Nets]]
19+
20+
$$
21+
P(X) = \sum_{y \in \Omega_y}P(X,Y=y)
22+
$$
23+
24+
- Variables are independent iff: $P(X,Y)=P(X)*P(Y)$
25+
- dependent variables use the conditional distribution: $P(Y|X)=\frac{P(X,Y)}{P(X)}$
26+
- This leads to bayes rule: $P(X|Y)=P(Y|X)\frac{P(X)}{P(Y)}$
27+
- It's possible to do this with more than 2 variables.
28+
29+
## Moments
30+
- average = mean = expected value
31+
- "avg/mean/E.V. of X": $E[X]$
32+
- Variance measures the variation of values above/below the mean: $Var(X)=E[X^2]-E[X]^2$
33+
- The standard deviation: $\sigma(X)=\sqrt{Var(X)}$
34+
- Variables have many "moments". There are $k$ moments, each expressed as: $E[X^k]$
35+
- Central moment: $E[(X-E[X])^k]$
36+
- Variance is the second central moment of X
37+
- Normalized central moment: $\frac{E[(X-E[X])^k]}{\sigma(X)^k}$
38+
- The fourth normalized central moment is called "kurtosis" which measures the "peakedness" of a distribution.
39+
40+
## Entropy
41+
- Fundamental measure in information theory
42+
- captures the amount of randomness or uncertainty in a variable
43+
44+
$$
45+
H(X) = -E[\text{log}P(X)] = -\sum_{x \in \Omega_X}P(X=x)\text{log}P(X=x)
46+
$$
47+
(logarithm base is usually 2)
48+
49+
- measure of the average length of a message that would have to be sent to describe a sample
50+
- Fair coin:
51+
- $H(X)=–(0.5 \space\text{log}\space 0.5 + 0.5 \space\text{log}\space 0.5) = 1$.
52+
- 100 flips requires 100 bits.
53+
- One sided coin:
54+
- $H(X) = -(1 \space\text{log}\space 1 + 0\space\text{log}\space 0)=0$.
55+
- 100 flips requires 0 bits.
56+
- 75% coin:
57+
- $(0.75\space\text{log}\space0.75 + 0.25\space\text{log}\space0.25)=0.8113$.
58+
- 100 flips requires 82 bits.
59+
- Just because you only need $M$ bits to describe a sample doesn't mean it's easy to formulate the message required to describe the sample.
60+
- "There exists a coder that can construct messages of length $H(X)+1$."
61+
- Joint entropy
62+
- $H(X,Y)=-E_X[E_Y[\text{log}(\space P(X,Y) \space)]]$
63+
- $$H(X,Y)=-\sum_{x \in \Omega_X, y \in \Omega_Y}P(X=x,Y=y) \space\text{log}(\space P(X=x,Y=y)\space)$$
64+
- Conditional entropy
65+
- $H(X|Y)=-E_X[E_Y[\text{log}(\space P(Y|X) \space)]]$
66+
- $$H(X,Y)=-\sum_{x \in \Omega_X, y \in \Omega_Y}P(X=x,Y=y) \space\text{log}(\space P(Y=y|X=x)\space)$$
67+
## Mutual Information
68+
- Conditional entropy can tell when 2 variables are completely independent
69+
- This is not an adequate measure of dependence.
70+
- "A small value for $H(Y|X)$ implies that $X$ tells us a great deal about $Y$ or that $H(Y)$ is small to begin with."
71+
- We measure dependence using mutual information: $I(X,Y)=H(Y)-H(Y|X)$
72+
- Measure of the reduction of randomness of a variable given knowledge of another variable.
73+
- $I(X,Y)=H(Y)-H(Y|X)$
74+
- $I(X,Y)=H(X)-H(X|Y)$
75+
- $I(X,Y)=H(X)+H(Y)-H(X,Y)$
76+
- $I(X,Y)=I(Y,X)$
77+
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
tags:
3+
- OMSCS
4+
- ML
5+
---
6+
# No Free Lunch Theorem
7+
8+
- impossibility theorem
9+
- a general purpose universal optimization strategy is impossible
10+
- "the only way one strategy can outperform another is if it is specialized to the structure of the specific problem under consideration"
11+
12+
One can build a matrix where the rows are optimization algorithms and the columns are optimization problems. The cells of the matrix encode the effectiveness of each optimization algorithm on each optimization problem.
13+
14+
The paper posits that each row has the same average performance.
15+
16+
The effectiveness of one particular optimization algorithm is useless without considering the set of optimization problems that you hope to apply it to.
17+
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
tags:
3+
- OMSCS
4+
- ML
5+
---
6+
# SL04 - Instance Based Learning
7+
8+
- Other Supervised Learning algorithms train a model based on the data then throw the data away.
9+
- In IBL, we put the data into a database and then lookup data in the database when we need to train on a new datapoint.
10+
11+
- Simple Database
12+
- Just store the examples, look them up when asked
13+
- Advantages
14+
- Reliable / dependable ((X, Y) -> DB, Lookup(X) -> Y)
15+
- Fast (to "train", there's basically no training)
16+
- Simple
17+
- Disadvantages
18+
- No generalization
19+
- Overfitting (querying datapoints that had mistakes always yields the same mistake)
20+
21+
## Cost of a House
22+
- Have a DB of house costs
23+
- size
24+
- date sold
25+
- price of property when sold
26+
- location
27+
- zip code
28+
- Nearest Neighbor
29+
- Find the nearest existing datapoint, use that cost
30+
- Falls apart if the unclassified datapoint is too far from a neighbor
31+
- K Nearest Neighbor (KNN)
32+
- Take the $K$ nearest existing datapoints
33+
- Take the average of those $K$ datapoints
34+
- Can (should?) be a weighted average based on "distance"/"similarity"
35+
- The weighting function used is a "hyperparameter" of KNN
36+
- weighting function can be as simple as $1/k$ (unweighted)
37+
38+
## Comparison
39+
40+
![[Pasted image 20250128103204.png]]
41+
42+
- Do all the work upfront? (eager learner)
43+
- Do all the work on the backend (at query time)? (lazy learner)
44+
- Combination of approaches? No reason why you can't "cache" the result via a linear regression.
45+
46+
## KNN Example
47+
48+
![[Pasted image 20250128103648.png]]
49+
50+
```python
51+
import sklearn.neighbors
52+
53+
X = [[1, 6], [2, 4], [3, 7], [6, 8], [7, 1], [8, 4]]
54+
Y = [7, 8, 13, 44, 50, 68]
55+
Q = [[4,2]]
56+
57+
# We need to use KNeighborsRegressor. KNeighborsClassifier
58+
# uses "majority vote" or "weighted majority vote".
59+
# KNeighborsRegressor returns the "average" or "weighted avg".
60+
61+
sklearn.neighbors\
62+
.KNeighborsRegressor(1, metric='euclidean')\
63+
.fit(X, Y)\
64+
.predict(Q)
65+
# 8
66+
67+
sklearn.neighbors\
68+
.KNeighborsRegressor(3, metric='euclidean')\
69+
.fit(X, Y)\
70+
.predict(Q)
71+
# 42
72+
73+
sklearn.neighbors\
74+
.KNeighborsRegressor(1, metric='manhattan')\
75+
.fit(X, Y)\
76+
.predict(Q)
77+
# 8
78+
79+
sklearn.neighbors\
80+
.KNeighborsRegressor(3, metric='manhattan')\
81+
.fit(X, Y)\
82+
.predict(Q)
83+
# 23.66666667
84+
```
85+
86+
| $d()$ | $K$ | `sklearn` | Empirical | Notes |
87+
| --------- | --- | ---------------- | --------- | --------------------------------------- |
88+
| Euclidean | 1 | $8$ | | |
89+
| Euclidean | 3 | $42$ | | |
90+
| Manhattan | 1 | $8$ | 29 | (4,2) is equidistant to (2,4) and (7,1) |
91+
| Manhattan | 3 | $23 \frac{2}{3}$ | 35.5 | (4,2) is equidistant to (3,7) and (8,4) |
92+
> Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor `k+1` and `k`, have identical distances but different labels, the results will depend on the ordering of the training data.
93+
94+
![[Pasted image 20250128110752.png]]
95+
96+
## KNN Bias
97+
Preference Bias
98+
- Locality -> Near Points are Similar
99+
- Smoothness -> Averaging
100+
- **All features matter equally**
101+
102+
## Curse of Dimensionality
103+
> As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.
104+
105+
Intuition says "let's add more features, that'll help it classify better". In reality, that makes the problem worse if you have insufficient data.
106+
107+
If you have one dimension, the information density of the space is $1/N$, where $N$ is the number of datapoints. If you add another dimension, you need $N^2$ datapoints in order to achieve the same level of information density. If you add a 3rd dimension, you need $N^3$ datapoints to achieve the same level of information density.
108+
109+
![[Pasted image 20250128204050.png]]
110+
111+
Weighting different dimensions differently can help with the curse of dimensionality.
112+
113+
## Other Stuff
114+
- Distance functions
115+
- Euclidean
116+
- Manhattan
117+
- Hamming
118+
- Weighted vs Unweighted distances
119+
- What's the best value for $K$?
120+
- Weighted vs unweighted average
121+
- Locally weighted regression
122+
- Locally weighted linear regression
123+
- Locally weighted quadratic regression
124+
- Locally weighted $WHATEVER regression
125+
126+
## Summary
127+
- lazy vs eager learning
128+
- knn
129+
- similarity = distance
130+
- classification vs regression
131+
- averaging
132+
- domain knowledge matters
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
---
2+
tags:
3+
- OMSCS
4+
- ML
5+
---
6+
# SL05 - Ensemble Learning and Boosting
7+
8+
- Spam email
9+
- is it spam or not?
10+
- Come up with some simple rules for classifying if it's spam
11+
- from spouse? not spam
12+
- body contains "manly"? spam
13+
- short? spam
14+
- just URLs? spam
15+
- just an image? spam
16+
- misspelled words? spam
17+
- blocklist of words? spam
18+
- make money fast? spam
19+
- each simple rule is not good enough on its own
20+
- combine each simple rule into a complex rule that works well enough on its own
21+
- learn over subsets of the data to generate those simple rules
22+
23+
## Algorithm
24+
- What is this notion of "combine"?
25+
- How do we pick subsets?
26+
27+
![[Pasted image 20250128210750.png]]
28+
29+
## Bagging
30+
- This is called Bagging (Bootstrap Aggregation)
31+
- Take some random subset of the data
32+
- Train over that data
33+
- Keep doing it
34+
- Take the average result of the models
35+
36+
![[Pasted image 20250128210950.png]]
37+
38+
![[Pasted image 20250128211015.png]]
39+
40+
## Boosting
41+
- similar to bagging
42+
- Take the "hardest" examples
43+
- perform a weighted mean
44+
- Error calculations should be based on the "likelihood" of each data point
45+
- Which examples are important to learn? Which examples aren't important to learn?
46+
- "Weak" learners
47+
- Does better than chance
48+
- Expected error is always less than half
49+
- Given training $(x_i, y_i)$ where $y_i \in \{-1, +1\}$
50+
- For t=1 to T
51+
- Construct distribution D
52+
- find weak classifier $h_t(X)$
53+
- with small error
54+
- $\epsilon_t = P_{D_t}[h_t(x_i) \ne y_i]$
55+
- This looks crazy, but we're essentially doing a weighted average.
56+
- output $H_{\text{final}}$
57+
58+
- Start off with uniform distribution: $D_1(i)=\frac{1}{n}$
59+
$$
60+
D_{t+1}(i)=D_t(i)e^{-\alpha_ty_ih_t(x_i)}Z_t^{-1}
61+
$$
62+
$$
63+
\alpha_t=\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})
64+
$$
65+
66+
When $h_t(x)=y_i$, sometimes the $D_{t+1}(i) \le D_t(i)$. It usually goes down. Sometimes it stays the same. It depends on how the rest of the distribution is affected.
67+
68+
When $h_t(x_i) \ne y_i$, $D_{t+1}(i) \gt D_t(i)$. It always increases, to put more weight on the examples it got wrong.
69+
70+
$$
71+
H_{\text{final}}=\text{sgn}(\sum_t\alpha_th_t(x))
72+
$$
73+
74+
## Three Little Boxes
75+
![[Pasted image 20250128214602.png]]
76+
77+
- $H$ is the set of axis-aligned semi-planes
78+
- (Everything on one side of a line is in the range)
79+
80+
![[Pasted image 20250128215046.png]]
81+
82+
![[Pasted image 20250128215156.png]]
83+

0 commit comments

Comments
 (0)