Airbnb 工程与数据科学

此仓库包含干扰器 gem。此 gem 允许您在代码中存储警报配置。您应该创建自己的仓库，其中包含导入干扰器 gem 的 Gemfile。有关此类仓库的示例，以及示例配置和警报文件，请参见 https://www.github.com/airbnb/alerts_example

运行此 gem

此 gem 提供了一个名为 interferon 的可执行文件。

您应该像这样调用它：

$ bundle exec interferon --config /path/to/config_file

其他选项

-h, --help -- 打印使用信息
-n, --dry-run -- 运行干扰器，但不对警报目标进行任何更改

配置文件

配置文件以 YAML 格式编写。

它接受以下参数：

verbose_logging -- 是否打印更多输出
alerts_repo_path -- 警报仓库的位置，其中包含您的干扰器 DSL 文件
group_sources -- 可以返回要警报的人员组的来源列表
host_sources -- 可以读取清单系统并返回要监控的主机列表的来源列表
destinations -- 警报提供程序列表，可以监控指标并在您的警报 dsl 文件中指定的情况下分派警报

有关更多信息，请参见此仓库中的 config.example.yaml 文件。

移动部件

此仓库了解四种对象：

host_sources：这些查询各种清单系统并返回要对其发出警报的主机或实体列表
destinations：这些是指标系统，可以监视指标并向工程师发出警报
groups：这些是可以因出现问题而被警报的实际工程师组
alerts：这些是 Ruby DSL 文件，用于指定何时以及如何通过目标向工程师和组发出有关主机的警报

主机来源

optica：可以从 optica 读取 AWS 实例列表
optica_services：返回从 optica 解析的 SmartStack 服务信息
aws_rds：列出 RDS 实例
aws_dynamo：列出 DynamoDB 表
aws_elasticache：列出 Elasticache 节点和集群

目标

Datadog

目前，Datadog 是我们唯一的警报目标。Datadog 的警报语法规则如下：http://docs.datadoghq.com/api/#alerts 这是一个解释 Datadog 指标语法的图表（通过 asciiflow 生成）

    +---------+ alert condition +-------------------------------------------------+
    |                                                                             |
    |              +-----+ metric to alert on                                     |
    |              |                                                              |
    |              |    tags to slice the metric by +------+                      |
    |              |                                       |                      |
    v              v                                       v                      v
  |----------| |-------------------------||--------------------------|          |---|
  max(last_5m):avg:haproxy_count_by_status{role:<%= role %>,status:up} by {host} > 0
  ^      ^      ^                                                          ^
  |      |      |                                                          |
  |      | +----|------------------------------+                           |
  |      | | math on the metric over all tags  |                           |
  |      | |-----------------------------------|            +------------------------------------+
  |      | | * max, min, avg, sum              |            |trigger a separate alert for each   |
  |      + +-----------------------------------+            |different value of these tags the   |
  | +----|----------------------------------------------+   |entire `by {}` clause can be ommited|
  | | the interval to look at; always starts with last_ |   +------------------------------------+
  | |---------------------------------------------------|
  | | * 5m, 10m, 15m, 30m                               |
  | | * 1h, 2h, 4h                                      |
  + +---------------------------------------------------+
 +-------------------------------------------------------------------------------------------------+
 | metric condition, can be one of:                                                                |
 |-------------------------------------------------------------------------------------------------|
 | * max: the metric gets this high at least once during the interval                              |
 | * avg: the metric is this on average during the interval                                        |
 | * min: the metric is this small at least once during the interval                               |
 | * change: the metric changes this much between a value N minutes ago and now (raw difference).  |
 | * pct_change: the metric changes this much between a value N minutes ago and now (percentage).  |
 +-------------------------------------------------------------------------------------------------+

组

组实际上来自 group_sources。我们目前只有一个组来源，它从文件系统中的 YAML 文件中读取组。但是，我们希望添加其他组来源，例如基于 LDAP 的来源。