Using Metric Math in AWS Cloud Formation Alarms

10 min readJan 5, 2019

The result with comments section

//resouce name
 "CloudformationALARM": {
         //resoruce type
  "Type": "AWS::CloudWatch::Alarm",
         //properties of that resource type
  "Properties": {
   //describe your alarm
    "AlarmDescription": "Alert when successful write on the dynamo table but the expected lambda does not run. Should return less than 1 or null",
 //what happens when you alarm. in our case we reference an SNS notification topic   
 "AlarmActions": [
      {
        "Ref": "AlarmsSNSTopic"
      }
    ],
         //tell the alram what to do with mising data, be cool with it, freak out and alarm. or go into a "yellow"state
    "TreatMissingData": "notBreaching",
 //the trickyish parts---- put your expersiion (metric math) and metrics in here.    
 "Metrics": [
      {
              //experssion ID... tehy all need IDs
        "Id": "e1",
      //The math - use IDs of the other metrics to do math... see M1 and M2 definded below  
      "Expression": "(m1-m2) / m1",
       //You can label it
        "Label": "DynamoWritesandLambdaFires"
      },
      {
         // the metrics ID... stick this in your metric math above... 
        "Id": "m1",
         //make an actual metric here.
        "MetricStat": {
         //treat this basically the same as you define any metric
         //put in stuff like a table or ec2 instance, read capcity, IOPS, CPU usage
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
                  // put a period in as seconds - get the right stat and unit.
          //See your cloudwatch metrics console for help
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
                  //Who know how this works or really when you would want it... not I!
                  //Leave it false and/or look at the documentation
        "ReturnData": false
      },
      {
              // do it all again for your other metric that goes into your math....
      //do it as many times as you want for lots of metric MATH! fun!
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],
         //the hard part of defining metrics is done
         //now set how often to evaluate 
         //the threshold for the alarm. 
         //and the comaprison >= threshold, threshold
    "EvaluationPeriods": "1",
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold"
  }
}

The result without comments section

"CloudformationALARM": {
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmDescription": "Alert when successful write on the dynamo table but the expected lambda does not run. Should return less than 1 or null",
    "AlarmActions": [
      {
        "Ref": "AlarmsSNSTopic"
      }
    ],
    "TreatMissingData": "notBreaching",
    "Metrics": [
      {
        "Id": "e1",
        "Expression": "(m1-m2) / m1",
        "Label": "DynamoWritesandLambdaFires"
      },
      {
        "Id": "m1",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },
      {
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],
    "EvaluationPeriods": "1",
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold"
  }
}

The please explain this whole thing to me like you wish someone else had written for you section

Pretty often someone says Hey lets do this thing! and I’m supposed to do the thing. And unless it sounds really interesting i start by trying to explain how we may not really need that thing. Also if it’s my idea we definitely NEED that thing to happen. When I’m convinced we do need that thing, or again, interested. it usually means the thing has been don by someone else and will be really easy to find the person who did it. Their blog post, read what they did, change a few words, and see if it works. This was one of those times I couldn’t find a clear answer to the thing.

So cloudformation is nice if you have AWS infrastructure and you need to make sure it has nice labels and similar [ test, dev, eng-prod, qa, Prod, pre production, Jerry’s test area that everyone uses now] stacks. We’re set up to do that sort of thing which takes a big effort to move to. So if thats what you want to move to sure everyone is going to do it. Or your going to do it for everyone else. It doesn’t really work halfway at all. Like most things doing it halfway….

Our cloudformation builds are alarms and we had an issue we needed to alarm for. If this thing happens then this thing should happen and if not then we need an alarm. And you’d think the aws metrics would make it easy to smash those things together or easily make an alarm off of the state of two metrics. It doesn’t let you do that — which really suprised me theres a link from a forum.

https://forums.aws.amazon.com/thread.jspa?threadID=94984

And then I thought about it some more and it doesn’t surprise me too much. I havn’t been able to think through why, but I’m sure theres a number of cases and reasons doing that would be kind of a hassle that’s not worth it.

So Metric MATH!

Use Metric Math - Amazon CloudWatch

Metric math enables you to query multiple CloudWatch metrics and use math expressions to create new time series based…

docs.aws.amazon.com

which lets you take some metrics and do math with them and then you’ve got something to alarm on. NEAT.

Metric Math takes a little bit of thought to get through. Like your own little logic puzzle of metric 1 plus minus divided by times metric2 all over the sum of metric 3 and metric 4. There are some extra circumstances in this situation but figuring out what I needed felt a lot like looking at this.

But like than when you’re 20 years removed from your last math class and without ever having tried to understand math anyway.

So I needed to handle if my DynamoDB table is doing stuff, then my lambda should kick off…. So my dynamo table write units is a number like 0- a lot… or it’s null — which if you have something that when it’s null it’s ok and you have alarms set it to be cool with not having data — https://stackoverflow.com/a/43678745/3634734

That’s not a special secret but it’s worth the reminder.

So metric 1 is consumed write capacity units. Null or a number greater than 0.

And metric 2 for me is invocations of my lambda. I used the min for a period so it’s always 1 if it fires and 0 if it doesn’t. I’m sure there is a better way. I’m sure there is a way I could have made consumedwritecapacity — m2 — always 1 when it works. and then do something like m1+m2 and the threshold set to 2. IF it equals to it’s solid. As I’m writing this I may need to go back and do that but it’s done for now. Perfection takes too much time or something like that… AGILE!

so finally for me m1-m2/m1 ill always be less than 1 unless m2 doesn’t happen in which case it should be m1/m1 = 1. When I get a 1. It alerts. because my table is writing but it’s not kicking of my lambda — which is what I want to know. My table write needs to kick off my lambda… I don’t actaully know why. this is what my team said we needed and I said cool that sounds interesting and surely someone else has done it.

So that part was all fine. the math was a little tricky — but shouldn’t have been. Then I needed to put it all in cloudformation. Cloudformation is generally well documented and easy to find examples for. I could not find anyone with a complete example of an alarm using metric math. I found a lot of pieces. But I figured I’d find one like BOOM here is the whole thing. Change these words and identifiers and heres how ti works also.

So here is what I ended up with and how it works. with some different names and I’ll walk through it below.

"CloudformationALARM": {
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmDescription": "Alert when successful write on the dynamo table but the expected lambda does not run. Should return less than 1 or null",
    "AlarmActions": [
      {
        "Ref": "AlarmsSNSTopic"
      }
    ],
    "TreatMissingData": "notBreaching",
    "Metrics": [
      {
        "Id": "e1",
        "Expression": "(m1-m2) / m1",
        "Label": "DynamoWritesandLambdaFires"
      },
      {
        "Id": "m1",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },
      {
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],
    "EvaluationPeriods": "1",
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold"
  }
}

The end result isn’t to impressive. The key is using “Metric:” rather than “MetricName:”. And Metric takes a lot of info… which also isn’t to impressive once you know what it wants in there.

This first part is all pretty cake.

"CloudformationALARM": {
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmDescription": "Alert when successful write on the dynamo table but the expected lambda does not run. Should return less than 1 or null",
    "AlarmActions": [
      {
        "Ref": "AlarmsSNSTopic"
      }
    ],
    "TreatMissingData": "notBreaching",

You make your alarm resource, its a type alarm, it’s normal properties are there. Description. Actions, And that treat missing data thing is what I was talking about before… if it’s missing or null or whatever it doesn’t show as insufficient state. which is kind of nice when you look at your console and there are not 150 alarms that say insufficient data.

METRICS

"Metrics": [
      {
        "Id": "e1",
        "Expression": "(m1-m2) / m1",
        "Label": "DynamoWritesandLambdaFires"
      },
      {
        "Id": "m1",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },
      {
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],

So this first group…

{
        "Id": "e1",
        "Expression": "(m1-m2) / m1",
        "Label": "DynamoWritesandLambdaFires"
      },

That’s our expression. our math. and it takes m1 and m2… and you can name m1 and m2 anything so if you need it to make sense for someone else.. like in this case they could have been named. MyDynamoTableCapacityUsed and MyLambdaInvocationsCount. So whatever you went into in the console to figure out your math with… take that e1 expression and put it in the expression area… and label you can name whatever you want. The ID for this could also be something else. taking those examples we’d have.

{
        "Id": "MyExpressionID",
        "Expression": "(MyFirstMetric-MySecondMetric) / MyFirstMetric",
        "Label": "MyLabelForThisMetricMathThing"
      },

Now the metrics. These are defined a lot like a normal metric is in cloud formation alarms. They take an ID and then in the metric stat you put in info like a normal cloud formation metric reference.

{
        "Id": "m1",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },
      {
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],

The confusing part can be Name= like with dynamo it’s table name. with lambda it’s function. with EC2 it’s instance name. So be sure to look that up.

And you can pull it from parameters or other stuff in your CF — in this case the table is named the same thing except for the environment.

"Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]

The period state and unit are somewhat self explanatory and if you’re not sure which one it is go back into the console and select only that metric. The stat is min,max, avergage and so on. The graph in the console will display the unit. and the period is number of seconds.

"Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },

I don’t know what that return data stuff does because I didn’t have to read this paragraph enough to understand it…

ReturnData

When used in GetMetricData, this option indicates whether to return the timestamps and raw data values of this metric. If you are performing this call just to do math expressions and do not also need the raw data returned, you can specify False. If you omit this, the default of True is used.

When used in PutMetricAlarm, specify True for the one expression result to use as the alarm. For all other metrics and expressions in the same PutMetricAlarm operation, specify ReturnData as False.

Type: Boolean

Required: No

So that’s ReturnData and someday I may have to add to this because I may need to use it.

Metric 2 or M2

{
        "Id": "m2",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],

So if we changed those IDs like in that other example- to make it all kind of fit… changing m1 and m2 to look like this… MySecondMetric

{
        "Id": "MyExpressionID",
        "Expression": "(MyFirstMetric-MySecondMetric) / MyFirstMetric",
        "Label": "MyLabelForThisMetricMathThing"
      },
{
        "Id": "MyFirstMetric",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/DynamoDB",
            "MetricName": "ConsumedWriteCapacityUnits",
            "Dimensions": [
              {
                "Name": "TableName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_table"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Average",
          "Unit": "Count"
        },
        "ReturnData": false
      },
      {
        "Id": "MySecondMetric",
        "MetricStat": {
          "Metric": {
            "Namespace": "AWS/Lambda",
            "MetricName": "Invocations",
            "Dimensions": [
              {
                "Name": "FunctionName",
                "Value": {
                  "Fn::Sub": "${Environment}_My_LAMBDA"
                }
              }
            ]
          },
          "Period": 300,
          "Stat": "Minimum",
          "Unit": "Count"
        },
        "ReturnData": false
      }
    ],

The last bit is also pretty normal alarm stuff. evaluation periods takes from the periods in the “Metrics:” comparison operators are pretty easy to understand. And the threshold is that line where you are setting your limit. One thing I messed up is I’m used to seeing stat and count and period all nearby this Evaluation and threshold and comparison stuff… They are now in that Metrics[] object and you don’t need them twice. But it does do a good job of giving you a clear error when you try to update your template in cloudformation.

"EvaluationPeriods": "1",
    "Threshold": 1.0,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold"
  }
},

Hopefully that helps get metric math into more cloudformation templates. You could stick those metrics in dashboards and stuff. Personally I think I’d use it to avoid some put-metric stuff in cloudwatch and it’s probably cheaper to get some “fancier” metrics out of.

If you get through this and are still stuck reach out.