How to monitor your FastAPI service

How to monitor your FastAPI service

Code

In this article, I'll discuss how to monitor the latency and code performance of a FastAPI service.

API Monitoring vs API Profiling

Monitoring is essentially collecting data in the background of your application for the purpose of helping diagnosing issues, helping debugging errors, or informing on the latency of a service.

For example, at the infrastructure level, you can monitor CPU and memory utilization. For example, at the application level, you can monitor errors, code performance or database querying performance. For a more complete introduction to monitoring and why it's necessary, see this excellent post from Full Stack Python.

In this post, we fill focus on Application Performance Monitoring (APM) for a FastAPI application.

Error Tracking

In this post, I will not talk about monitoring application errors and warnings. For this purpose, check Sentry, it has great ASGI support and will work out of the box with your FastAPI service.

API Profiling

Profiling is a code best-practice that is not specific to web development. From the python docs on profiling we can read :

the profilers run code and give you a detailed breakdown of execution times, allowing you to identify bottlenecks in your programs. Auditing events provide visibility into runtime behaviors that would otherwise require intrusive debugging or patching.

You can of course apply profiling in the context of a FastAPI application. In which case you might find this timing middleware handy.

However, with this approach, the timing data is logged to stdout. You can use it in development to to find bottlenecks, but in practice looking at the logs in production to get latency information is not the most convenient.

Available Tools for Application Performance Monitoring (APM)

As will all things, there are many options. Some are open source, some are SaaS businesses. Most likely you or your organisation are already using one or more monitoring tools, so I'd suggest starting with the one you know. The tools on the list below don't do only APM, and that's what makes it harder to understand sometimes. Example application monitoring tools you might have heard of:

  • New Relic (commercial with parts open source)
  • Datadog (commercial with parts open source)
  • StatsD (open source)
  • Prometheus (open source)
  • OpenTelemetry (open source)

This list is not exhaustive, but let's note OpenTelemetry which is the most recent on this list and is now the de-facto standard for application monitoring metrics.

At this point, choosing a tool doesn't matter, let's rather understand what an APM tool does.

The 4 Steps of Monitoring

4 steps of monitoring

  1. It all starts with your application code. You instrument your service with a library corresponding to your app's language (in our case python). This is the monitoring client library. Monitoring client library examples:

  2. Then the monitoring client library sends each individual call to the monitoring server daemon over the network (UDP in particular, as opposed to TCP or HTTP).

  3. The monitoring server daemon is listening to monitoring events coming from the applications. It packs the incoming data into batches and regularly sends it to the monitoring backend.

  4. The monitoring backend has usually 2 parts: a data processing application and a visualisation webapp. It turns the stream of monitoring data into human-readable charts and alerts. Examples:

    • app.datadoghq.com
    • one.newrelic.com

monitoring backend

The problem with monitoring ASGI webapps

ASGI is a relatively new standard for python web servers. As with every new standard, it will take some time for all tools in the ecosystem to support it.

Given the 4 steps of monitoring laid out above, a problem arise if the monitoring client library doesn't support ASGI. For example, this is the case with NewRelic at the moment (see ASGI - Starlette/Fast API Framework · Issue #5 · newrelic/newrelic-python-agent for more details). I looked at Datadog too and saw that ASGI is also not supported at the moment.

On the open source side of the tools however, OpenTelemetry had great support for ASGI. So I set out to instrument my FastAPI service with OpenTelemetry.

Update - Sep 19th, 2020: There seems to be support for ASGI in ddtrace

Update - Sep 22th, 2020: There is now an API in the NewRelic agent to support ASGI frameworks, with uvicorn already supported and starlette on the way.

Update - Oct 23th, 2020: The NewRelic python agent now supports Starlette and FastAPI out of the box.

Instrumenting FastAPI with OpenTelemetry and Jaeger

OpenTelemetry provides a standard for steps 1 (with Instrumentors) and 2 (with Exporters) from the 4 steps above. One of the big advantages of OpenTelemetry is that you can send the events to any monitoring backend (commercial or open source). This is especially awesome because you can use the same intrumentation setup for development, staging and production environments.

Update - May 30th, 2021: Github is now adopting OpenTelemetry

Note that depending on the language you use for your microservice, your mileage may vary. For example, there is no NewRelic OpenTelemetry Exporter in Python yet. But there are OpenTelemetry Exporters for many others, see the list here: Registry | OpenTelemetry (filter by language and with type=Exporter).

One of the available backends is Jaeger: open source, end-to-end distributed tracing. (Note that Jaeger is also a monitoring client library that you can instrument your application with, but here that's not the part of interest).

opentelemetry jaeger

Although it's open source and worked really easily, the issue I had with Jaeger was that it doesn't have any data pipeline yet. This means that, in the visualisation webapp, you can browse traces but you cannot see any aggregated charts. Such a backend is on their roadmap though.

Still, Jaeger is my goto tool for monitoring while in development. See the last part for more details.

Instrumenting FastAPI with OpenTelemetry and Datadog

I couldn't find any open source monitoring backend with a data pipeline that would provide the features I was looking for (latency percentile plots, bar chart of total requests and errors ...).

It became apparent that that's where commercial solutions like NewRelic and Datadog shine. I hence set out to try the OpenTelemtry Datadog exporter.

opentelemetry datadog

With this approach, you get a fully featured monitoring backend that will allow you to have full observability for your microservice.

The 2 drawbacks are:

  • you need to deploy the Datadog agent yourself (with docker or on Kuberetes or on whatever environment fits your stack) and this can get a bit involved
  • Datadog being a commercial product, this solution will not be free. You will have to pay extra attention to the pricing of Datadog (especially if you deploy the Datadog agent to Kubernetes 😈).

Example FastAPI instrumentation using OpenTelementry, Jaeger and DataDog

So how does it look in the code ? This is how my application factory looks. If you have any questions, feel free to reach out on twitter or open a github issue. I will not share my instrumentation because it is specific to my application, but imagine that you can define any nested spans and that those traces will sent the same way to Jaeger or to DataDog. This makes it really fast to iterate on your instrumentation code (e.g. add or remove spans), and even faster to find performance bottlenecks in your code.

main.py
"""FastAPI Application factory with OpenTelemetry instrumentation
sent to Jaeger in dev and to DataDog in staging and production."""
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.exporter.datadog import DatadogExportSpanProcessor, DatadogSpanExporter
from opentelemetry.exporter.jaeger import JaegerSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor

from my_api.config import generate_settings
from my_api.routers import my_router_a, my_router_b


def get_application() -> FastAPI:
    """Application factory.

    Returns:
        ASGI application to be passed to ASGI server like uvicorn or hypercorn.

    Reference:
    - [FastAPI Middlewares](https://fastapi.tiangolo.com/advanced/middleware/)
    """
    # load application settings
    settings = generate_settings()

    if settings.environment != "development":
        # opentelemetry + datadog for staging or production
        trace.set_tracer_provider(TracerProvider())
        datadog_exporter = DatadogSpanExporter(
            agent_url=settings.dd_trace_agent_url,
            service=settings.dd_service,
            env=settings.environment,
            version=settings.dd_version,
            tags=settings.dd_tags,
        )
        trace.get_tracer_provider().add_span_processor(
          DatadogExportSpanProcessor(datadog_exporter)
        )
    else:
        # opentelemetry + jaeger for development
        # requires jaeger running in a container
        trace.set_tracer_provider(TracerProvider())
        jaeger_exporter = JaegerSpanExporter(
            service_name="my-app", agent_host_name="localhost", agent_port=6831,
        )
        trace.get_tracer_provider().add_span_processor(
            BatchExportSpanProcessor(jaeger_exporter, max_export_batch_size=10)
        )

    application = FastAPI(
        title="My API",
        version="1.0",
        description="Do something awesome, while being monitored.",
    )
    # Add your routers
    application.include_router(my_router_a)
    application.include_router(my_router_b)

    FastAPIInstrumentor.instrument_app(application)
    return application


app = get_application()

Conclusion

I hope that with this post you've learned:

  • the difference between profiling, monitoring, tracking errors
  • the architecture of application monitoring
  • some of application monitoring tools out there
  • that OpenTelemetry allows you to reuse the same instrumentation setup for all your environments, which speeds up the speed at which you can find performance bottlenecks in your application

I've used this setup to get a 10x speed up on one multi-lingual NLP fastapi service I built at OneFootball.

Resources

  1. StatsD, What It Is and How It Can Help You | Datadog
  2. Monitoring - Full Stack Python
  3. ASGI | Sentry Documentation
  4. Debugging and Profiling — Python 3.9.0 documentation
  5. Timing Middleware - FastAPI Utilities
  6. APM | New Relic Documentation
  7. APM & Distributed Tracing - Datadog
  8. OpenTelemetry
  9. newrelic/newrelic-python-agent: New Relic Python Agent
  10. DataDog/dd-trace-py: Datadog Python APM Client
  11. open-telemetry/opentelemetry-python: OpenTelemetry Python API and SDK
  12. Registry | OpenTelemetry
  13. Jaeger: open source, end-to-end distributed tracing
  14. Getting Started with OpenTelemetry Python — OpenTelemetry Python documentation