Export workflow metrics
This MR adds the following three metrics.
First, some general metrics about route performance and calls, which looks like this:
# TYPE http_request_timing summary
http_request_timing_count{route_name="get_healthcheck",status_code="200"} 41.0
http_request_timing_sum{route_name="get_healthcheck",status_code="200"} 0.10483574867248535
http_request_timing_count{route_name="create_workflow_request",status_code="200"} 1.0
http_request_timing_sum{route_name="create_workflow_request",status_code="200"} 0.0783846378326416
http_request_timing_count{route_name="submit_workflow_request",status_code="200"} 1.0
http_request_timing_sum{route_name="submit_workflow_request",status_code="200"} 0.1897449493408203
http_request_timing_count{route_name="create_and_submit_workflow_request",status_code="200"} 5.0
http_request_timing_sum{route_name="create_and_submit_workflow_request",status_code="200"} 0.5307431221008301
In the Capability service, a new metric is exported about the state of the capability queues, which look like this:
capability_queue_total{capability="restore_cms",status="executing"} 0.0
capability_queue_total{capability="std_restore_imaging",status="waiting"} 0.0
capability_queue_total{capability="std_restore_imaging",status="executing"} 0.0
capability_queue_total{capability="null_dag",status="waiting"} 0.0
capability_queue_total{capability="null_dag",status="executing"} 0.0
capability_queue_total{capability="std_calibration",status="waiting"} 0.0
capability_queue_total{capability="std_calibration",status="executing"} 2.0
capability_queue_total{capability="std_cms_imaging",status="waiting"} 0.0
capability_queue_total{capability="std_cms_imaging",status="executing"} 0.0
capability_queue_total{capability="test_download",status="waiting"} 0.0
capability_queue_total{capability="test_download",status="executing"} 0.0
capability_queue_total{capability="null",status="waiting"} 0.0
capability_queue_total{capability="null",status="executing"} 2.0
Prometheus appears not to really know the difference between ints and floats, but this lets us collect on every reporting interval a new copy of this report, which could be useful for debugging.
Finally, in the Workflow service, we have the number of running wf_monitors:
# TYPE wf_monitors_running gauge
wf_monitors_running 0.0
This was obtained by wrapping the Popen
call to wf_monitor and dispatching a thread to wait on them to decrement the number. I have watched this work with 5 concurrent workflow requests, so it seems to be legit.