Skip to content

Commit 2630968

Browse files
author
Gustavo Muniz do Carmo
committed
first commit
0 parents  commit 2630968

37 files changed

+1028
-0
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
.vagrant/
2+
.vscode/
3+
*.retry
4+
ubuntu-*-cloudimg-console.log
5+
env
6+
__pycache__

.travis.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
language: python
3+
python: "3.6"
4+
5+
addons:
6+
apt:
7+
packages:
8+
- python-pip
9+
10+
install:
11+
- pip install -r requirements.txt
12+
13+
script:
14+
- molecule test -s aws

.yamllint

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
extends: default
3+
ignore: |
4+
**/lib/
5+
rules:
6+
braces:
7+
max-spaces-inside: 1
8+
level: error
9+
brackets:
10+
max-spaces-inside: 1
11+
level: error
12+
line-length: disable
13+
# NOTE(retr0h): Templates no longer fail this lint rule.
14+
# Uncomment if running old Molecule templates.
15+
# truthy: disable

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2017 Esign Consulting Ltda.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Get metrics for alerting in advance and preventing trouble
2+
3+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![GitHub release](https://img.shields.io/github/release/codeyourinfra/get_metrics_for_alerting.svg)](https://github.com/codeyourinfra/get_metrics_for_alerting/releases/latest) [![Build status](https://travis-ci.org/codeyourinfra/get_metrics_for_alerting.svg?branch=master)](https://travis-ci.org/codeyourinfra/get_metrics_for_alerting)
4+
5+
This solution is explained in detail in the Codeyourinfra project blog post [How to get metrics for alerting in advance and preventing trouble](http://codeyourinfra.today/how-to-get-metrics-for-alerting-in-advance-and-preventing-trouble). Check it out!
6+
7+
## Problem
8+
9+
You may already have a monitoring solution. After all, you are responsible for keeping all the IT services available. You don't want to be surprised by an unexpected outage, then you install in every server an agent for collecting relevant data for monitoring purposes. In addition, automatic emails are sent if something is going wrong, you've configured that. The problem is that you can't handle it anymore because you now have more than a thousand of servers to be monitored. Furthermore, people no more give attention to the alerts received by email, due to the big amount of false positive ones.
10+
11+
## Solution
12+
13+
The solution is based on [InfluxDB](https://docs.influxdata.com/influxdb), a high performance time series database, on [Grafana](https://grafana.com/), a time series analytics and monitoring tool, and on [Ansible](https://www.ansible.com/), an agentless automation tool. They are all open source tools and can be easily integrated with each other in order to create a monitoring service. With Ansible is possible to extract the servers' hardware metrics and store them in the InfluxDB ([playbook-get-metrics.yml](templates/playbook-get-metrics.yml)). With Grafana is possible to connect to InfluxDB and show the metrics in a graphical way, define thresholds and configure alerts that can be given through different channels, including instant messaging apps like [Slack](https://slack.com) and [Telegram](https://telegram.org).
14+
15+
![Solution picture](get_metrics_for_alerting.png)
16+
17+
## Test
18+
19+
First of all, run the command `vagrant up monitor`, in order to turn on the **monitoring server**. Then, open your web browser and access the Grafana web application through the **URL** <http://192.168.33.10:3000>. The **user** and the **password** are *admin*. After that, click in the **used_mem_pct** dashboard. You will see the **Used memory percentage** line chart, with data from the **monitoring server** itself. An alert is sent to a [Slack workspace](https://mygrafanaalerts.slack.com) (click [here](https://join.slack.com/t/mygrafanaalerts/shared_invite/enQtNjg2NTQ0MDM0MDgxLTA3NzhkNjliNjY5YWUwNTY1OWI3MjkwOGIwZjM2NDQzNzlhMDc3YjQzMjg0Mjc4MjYzYjYyNjc2MjQ5ZDA3OGU) to join) if the last 5 used memory percentage values are grater than or equal to 95%, the defined threshold.
20+
21+
You can add the other servers to the monitoring service, if you want. In order to add the **server1**, firstly boot it up, through the command `vagrant up server1`. After that, execute the command `ansible-playbook playbook-add-server.yml -e "host=192.168.33.20 user=vagrant password=vagrant"`. The parameters **host**, **user** and **password** are used by Ansible to access the monitored hosts, through SSH, from the monitoring server. Once added, wait at least 1 minute and check if Ansible is properly getting the metrics from the new monitored server by executing the ad-hoc command `ansible monitor -m shell -a "cat /etc/ansible/playbooks/playbook-get-metrics.log"`. Repeat these steps for the **server2**, at your will.
22+
23+
### Automated tests
24+
25+
You can also test the solution automaticaly, by executing `./test.sh` or using [Molecule](https://molecule.readthedocs.io). With the latter, you can perform the test not only locally (the default), but in [AWS](https://aws.amazon.com) as well. During the Codeyourinfra's *continuous integration* process in Travis CI, the solution is tested on [Amazon EC2](https://aws.amazon.com/ec2).
26+
27+
In order to get your environment ready for using *Molecule*, prepare your [Python virtual environment](https://docs.python.org/3/tutorial/venv.html), executing `python3 -m venv env && source env/bin/activate && pip install -r ../requirements.txt`. After that, just run the command `molecule test`, to test the solution locally in a [VirtualBox](https://www.virtualbox.org) VM managed by [Vagrant](https://www.vagrantup.com).
28+
29+
If you prefer performing the test in AWS, bear in mind you must have your credentials appropriately in **~/.aws/credentials**. You can [configure it through the AWS CLI tool](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). The test is performed in the AWS region *Europe - London (eu-west-2)*. Just run `molecule test -s aws` and check the running instances through your [AWS Console](https://eu-west-2.console.aws.amazon.com/ec2/v2).

Vagrantfile

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# -*- mode: ruby -*-
2+
# vi: set ft=ruby :
3+
4+
Vagrant.configure("2") do |config|
5+
config.vm.define "monitor" do |monitor|
6+
monitor.vm.box = "codeyourinfra/monitor"
7+
monitor.vm.network "private_network", ip: "192.168.33.10"
8+
9+
monitor.vm.provision "ansible" do |ansible|
10+
ansible.playbook = "monitoring-configuration.yml"
11+
ansible.inventory_path = "inventory.yml"
12+
end
13+
end
14+
15+
(1..2).each do |i|
16+
config.vm.define "server#{i}" do |server|
17+
server.vm.box = "ubuntu/bionic64"
18+
server.vm.network "private_network", ip: "192.168.33.#{i+1}0"
19+
20+
server.vm.provision "ansible" do |ansible|
21+
ansible.limit = "server#{i}"
22+
ansible.playbook = "servers-configuration.yml"
23+
ansible.inventory_path = "inventory.yml"
24+
end
25+
end
26+
end
27+
end

ansible.cfg

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
[defaults]
2+
callback_whitelist = profile_tasks
3+
host_key_checking = False
4+
inventory = inventory.yml
5+
localhost_warning = False
6+
7+
[inventory]
8+
enable_plugins = yaml

files/monitor-datasource.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"name": "monitor",
3+
"isDefault": true,
4+
"type": "influxdb",
5+
"url": "http://localhost:8086",
6+
"access": "proxy",
7+
"database": "monitor"
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"name": "Slack notification channel",
3+
"type": "slack",
4+
"isDefault": false,
5+
"settings": {
6+
"url": "https://hooks.slack.com/services/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"
7+
}
8+
}

files/used_mem_pct-dashboard.json

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
{
2+
"dashboard": {
3+
"annotations": {
4+
"list": [
5+
{
6+
"builtIn": 1,
7+
"datasource": "-- Grafana --",
8+
"enable": true,
9+
"hide": true,
10+
"iconColor": "rgba(0, 211, 255, 1)",
11+
"name": "Annotations & Alerts",
12+
"type": "dashboard"
13+
}
14+
]
15+
},
16+
"editable": true,
17+
"gnetId": null,
18+
"graphTooltip": 0,
19+
"hideControls": false,
20+
"id": null,
21+
"links": [],
22+
"refresh": "1m",
23+
"rows": [
24+
{
25+
"collapse": false,
26+
"height": "250px",
27+
"panels": [
28+
{
29+
"alert": {
30+
"conditions": [
31+
{
32+
"evaluator": {
33+
"params": [
34+
95
35+
],
36+
"type": "gt"
37+
},
38+
"operator": {
39+
"type": "and"
40+
},
41+
"query": {
42+
"params": [
43+
"A",
44+
"5m",
45+
"now"
46+
]
47+
},
48+
"reducer": {
49+
"params": [],
50+
"type": "last"
51+
},
52+
"type": "query"
53+
}
54+
],
55+
"executionErrorState": "alerting",
56+
"frequency": "60s",
57+
"handler": 1,
58+
"message": "The last 5 used memory percentage values are greater than or equal to the threshold of 95%.",
59+
"name": "Used memory percentage alert",
60+
"noDataState": "no_data",
61+
"notifications": [
62+
{
63+
"id": 1
64+
}
65+
]
66+
},
67+
"aliasColors": {},
68+
"bars": false,
69+
"dashLength": 10,
70+
"dashes": false,
71+
"datasource": null,
72+
"fill": 1,
73+
"id": 1,
74+
"legend": {
75+
"avg": false,
76+
"current": false,
77+
"max": false,
78+
"min": false,
79+
"show": true,
80+
"total": false,
81+
"values": false
82+
},
83+
"lines": true,
84+
"linewidth": 1,
85+
"links": [],
86+
"nullPointMode": "null",
87+
"percentage": false,
88+
"pointradius": 5,
89+
"points": false,
90+
"renderer": "flot",
91+
"seriesOverrides": [],
92+
"spaceLength": 10,
93+
"span": 12,
94+
"stack": false,
95+
"steppedLine": false,
96+
"targets": [
97+
{
98+
"dsType": "influxdb",
99+
"groupBy": [
100+
{
101+
"params": [
102+
"host"
103+
],
104+
"type": "tag"
105+
}
106+
],
107+
"measurement": "used_mem_pct",
108+
"orderByTime": "ASC",
109+
"policy": "default",
110+
"refId": "A",
111+
"resultFormat": "time_series",
112+
"select": [
113+
[
114+
{
115+
"params": [
116+
"value"
117+
],
118+
"type": "field"
119+
}
120+
]
121+
],
122+
"tags": []
123+
}
124+
],
125+
"thresholds": [
126+
{
127+
"colorMode": "critical",
128+
"fill": true,
129+
"line": true,
130+
"op": "gt",
131+
"value": 95
132+
}
133+
],
134+
"timeFrom": null,
135+
"timeShift": null,
136+
"title": "Used memory percentage",
137+
"tooltip": {
138+
"shared": true,
139+
"sort": 0,
140+
"value_type": "individual"
141+
},
142+
"type": "graph",
143+
"xaxis": {
144+
"buckets": null,
145+
"mode": "time",
146+
"name": null,
147+
"show": true,
148+
"values": []
149+
},
150+
"yaxes": [
151+
{
152+
"decimals": null,
153+
"format": "percent",
154+
"label": null,
155+
"logBase": 1,
156+
"max": 100,
157+
"min": null,
158+
"show": true
159+
},
160+
{
161+
"format": "short",
162+
"label": "",
163+
"logBase": 1,
164+
"max": null,
165+
"min": null,
166+
"show": true
167+
}
168+
]
169+
}
170+
],
171+
"repeat": null,
172+
"repeatIteration": null,
173+
"repeatRowId": null,
174+
"showTitle": false,
175+
"title": "Dashboard Row",
176+
"titleSize": "h6"
177+
},
178+
{
179+
"collapse": false,
180+
"height": 250,
181+
"panels": [],
182+
"repeat": null,
183+
"repeatIteration": null,
184+
"repeatRowId": null,
185+
"showTitle": false,
186+
"title": "Dashboard Row",
187+
"titleSize": "h6"
188+
}
189+
],
190+
"schemaVersion": 14,
191+
"style": "dark",
192+
"tags": [],
193+
"templating": {
194+
"list": []
195+
},
196+
"time": {
197+
"from": "now-30m",
198+
"to": "now"
199+
},
200+
"timepicker": {
201+
"refresh_intervals": [
202+
"5s",
203+
"10s",
204+
"30s",
205+
"1m",
206+
"5m",
207+
"15m",
208+
"30m",
209+
"1h",
210+
"2h",
211+
"1d"
212+
],
213+
"time_options": [
214+
"5m",
215+
"15m",
216+
"1h",
217+
"6h",
218+
"12h",
219+
"24h",
220+
"2d",
221+
"7d",
222+
"30d"
223+
]
224+
},
225+
"timezone": "",
226+
"title": "used_mem_pct",
227+
"version": 1
228+
}
229+
}

0 commit comments

Comments
 (0)