-
-
Notifications
You must be signed in to change notification settings - Fork 37
some small understandability improvements to the README #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -229,22 +229,24 @@ plt.close() | |
| ``` | ||
|
|
||
| Your results should look like the following: | ||
| In these plots score is conveyed by the color of the point. | ||
|
|
||
| **LoOP Scores without Clustering** | ||
|  | ||
|
|
||
| **LoOP Scores with Clustering** | ||
|  | ||
|
|
||
| - | ||
| **DBSCAN Cluster Assignments** | ||
|  | ||
|
|
||
| - | ||
|
|
||
| Note the differences between using LocalOutlierProbability with and without clustering. In the example without clustering, samples are | ||
| scored according to the distribution of the entire data set. In the example with clustering, each sample is scored | ||
| according to the distribution of each cluster. Which approach is suitable depends on the use case. | ||
|
|
||
| **NOTE**: Data was not normalized in this example, but it's probably a good idea to do so in practice. | ||
| - Why? | ||
|
|
||
| ## Using Numpy | ||
|
|
||
|
|
@@ -264,6 +266,7 @@ scores = loop.LocalOutlierProbability(data, n_neighbors=3).fit().local_outlier_p | |
| print(scores) | ||
|
|
||
| ``` | ||
| -- I'll insert a table here | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good ✅ |
||
|
|
||
| The shape of the input array shape corresponds to the rows (observations) and columns (features) in the data: | ||
|
|
||
|
|
@@ -279,7 +282,7 @@ data = np.random.rand(100, 5) | |
| scores = loop.LocalOutlierProbability(data).fit().local_outlier_probabilities | ||
| print(scores) | ||
| ``` | ||
|
|
||
| -- I'll insert a table of the scores here | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good ✅ |
||
| ## Specifying a Distance Matrix | ||
|
|
||
| PyNomaly provides the ability to specify a distance matrix so that any | ||
|
|
@@ -317,6 +320,8 @@ distances = np.delete(distances, 0, 1) | |
| m = loop.LocalOutlierProbability(distance_matrix=d, neighbor_matrix=idx, n_neighbors=n_neighbors+1).fit() | ||
| scores = m.local_outlier_probabilities | ||
| ``` | ||
| -- insert a table of the results | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A full table of the values may be too large, but showing a truncated view would be helpful. |
||
| What are the results telling us | ||
|
|
||
| The below visualization shows the results by a few known distance metrics: | ||
|
|
||
|
|
@@ -375,7 +380,7 @@ print(rmse) | |
| ``` | ||
|
|
||
| The root mean squared error (RMSE) between the two approaches is approximately 0.199 (your scores will vary depending on the data and specification). | ||
| The plot below shows the scores from the stream approach. | ||
| The plot below shows the scores from the stream approach as a colormap on the figures. | ||
|
|
||
| ```python | ||
| fig = plt.figure(figsize=(7, 7)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to note give extra weight to a particular column of data. Data normalization ensures that all features in a dataset are on a similar scale, preventing features with larger values from disproportionately influencing algorithms and improving model performance.