In this article, I’ll discuss how to monitor the latency and code performance of a FastAPI service.
Monitoring is essentially collecting data in the background of your application for the purpose of helping diagnosing issues, helping debugging errors, or informing on the latency of a service.
For example, at the infrastructure level, you can monitor CPU and memory utilization. For example, at the application level, you can monitor errors, code performance or database querying performance. For a more complete introduction to monitoring and why it’s necessary, see this excellent post from Full Stack Python.
In this post, we fill focus on Application Performance Monitoring (APM) for a FastAPI application.
In this post, I will not talk about monitoring application errors and warnings. For this purpose, check Sentry, it has great ASGI support and will work out of the box with your FastAPI service.
Profiling is a code best-practice that is not specific to web development. From the python docs on profiling we can read :
the profilers run code and give you a detailed breakdown of execution times, allowing you to identify bottlenecks in your programs. Auditing events provide visibility into runtime behaviors that would otherwise require intrusive debugging or patching.
You can of course apply profiling in the context of a FastAPI application. In which case you might find this timing middleware handy.
However, with this approach, the timing data is logged to stdout. You can use it in development to to find bottlenecks, but in practice looking at the logs in production to get latency information is not the most convenient.
As will all things, there are many options. Some are open source, some are SaaS businesses. Most likely you or your organisation are already using one or more monitoring tools, so I’d suggest starting with the one you know. The tools on the list below don’t do only APM, and that’s what makes it harder to understand sometimes. Example application monitoring tools you might have heard of:
- New Relic (commercial with parts open source)
- Datadog (commercial with parts open source)
- StatsD (open source)
- Prometheus (open source)
- OpenTelemetry (open source)
This list is not exhaustive, but let’s note OpenTelemetry which is the most recent on this list and is now the de-facto standard for application monitoring metrics.
At this point, choosing a tool doesn’t matter, let’s rather understand what an APM tool does.
It all starts with your application code. You instrument your service with a library corresponding to your app’s language (in our case python). This is the
monitoring client library. Monitoring client library examples:
monitoring client librarysends each individual call to the
monitoring server daemonover the network (UDP in particular, as opposed to TCP or HTTP).
monitoring server daemonis listening to monitoring events coming from the applications. It packs the incoming data into batches and regularly sends it to the
monitoring backendhas usually 2 parts: a data processing application and a visualisation webapp. It turns the stream of monitoring data into human-readable charts and alerts. Examples:
ASGI is a relatively new standard for python web servers. As with every new standard, it will take some time for all tools in the ecosystem to support it.
Given the 4 steps of monitoring laid out above, a problem arise if
monitoring client library doesn’t support ASGI. For example,
this is the case with NewRelic at the moment (see ASGI - Starlette/Fast API Framework · Issue #5 · newrelic/newrelic-python-agent for more details). I looked at Datadog too and saw that ASGI is also not supported at the moment.
On the open source side of the tools however, OpenTelemetry had great support for ASGI. So I set out to instrument my FastAPI service with OpenTelemetry.
Update - Sep 19th, 2020: There seems to be support for ASGI in ddtrace
Update - Sep 22th, 2020: There is now an API in the NewRelic agent to support ASGI frameworks, with uvicorn already supported and starlette on the way.
Update - Oct 23th, 2020: The NewRelic python agent now supports Starlette and FastAPI out of the box.
OpenTelemetry provides a standard for steps 1 (with
Instrumentors) and 2 (with
Exporters) from the 4 steps above. One of the big advantages of
OpenTelemetry is that you can send the events to any monitoring
backend (commercial or open source). This is especially awesome because you can use the same intrumentation setup for
Update - May 30th, 2021: Github is now adopting OpenTelemetry
Note that depending on the language you use for your microservice, your mileage may vary. For example, there is no NewRelic OpenTelemetry Exporter in Python yet. But there are OpenTelemetry Exporters for many others, see the list here: Registry | OpenTelemetry (filter by language and with type=Exporter).
One of the available backends is Jaeger: open source, end-to-end distributed tracing. (Note that Jaeger is also a monitoring client library that you can instrument your application with, but here that’s not the part of interest).
Although it’s open source and worked really easily, the issue I had with Jaeger was that it doesn’t have any data pipeline yet. This means that, in the visualisation webapp, you can browse traces but you cannot see any aggregated charts. Such a backend is on their roadmap though.
Still, Jaeger is my goto tool for monitoring while in
development. See the last part for more details.
I couldn’t find any open source monitoring backend with a data pipeline that would provide the features I was looking for (latency percentile plots, bar chart of total requests and errors …).
It became apparent that that’s where commercial solutions like NewRelic and Datadog shine. I hence set out to try the OpenTelemtry Datadog exporter.
With this approach, you get a fully featured monitoring backend that will allow you to have full observability for your microservice.
The 2 drawbacks are:
- you need to deploy the Datadog agent yourself (with docker or on Kuberetes or on whatever environment fits your stack) and this can get a bit involved
- Datadog being a commercial product, this solution will not be free. You will have to pay extra attention to the pricing of Datadog (especially if you deploy the Datadog agent to Kubernetes 😈).
So how does it look in the code ? This is how my application factory looks. If you have any questions, feel free to reach out on twitter or open a github issue. I will not share my instrumentation because it is specific to my application, but imagine that you can define any nested spans and that those traces will sent the same way to Jaeger or to DataDog. This makes it really fast to iterate on your instrumentation code (e.g. add or remove spans), and even faster to find performance bottlenecks in your code.
"""FastAPI Application factory with OpenTelemetry instrumentation sent to Jaeger in dev and to DataDog in staging and production.""" from fastapi import FastAPI from opentelemetry import trace from opentelemetry.exporter.datadog import DatadogExportSpanProcessor, DatadogSpanExporter from opentelemetry.exporter.jaeger import JaegerSpanExporter from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchExportSpanProcessor from my_api.config import generate_settings from my_api.routers import my_router_a, my_router_b def get_application() -> FastAPI: """Application factory. Returns: ASGI application to be passed to ASGI server like uvicorn or hypercorn. Reference: - [FastAPI Middlewares](https://fastapi.tiangolo.com/advanced/middleware/) """ # load application settings settings = generate_settings() if settings.environment != "development": # opentelemetry + datadog for staging or production trace.set_tracer_provider(TracerProvider()) datadog_exporter = DatadogSpanExporter( agent_url=settings.dd_trace_agent_url, service=settings.dd_service, env=settings.environment, version=settings.dd_version, tags=settings.dd_tags, ) trace.get_tracer_provider().add_span_processor( DatadogExportSpanProcessor(datadog_exporter) ) else: # opentelemetry + jaeger for development # requires jaeger running in a container trace.set_tracer_provider(TracerProvider()) jaeger_exporter = JaegerSpanExporter( service_name="my-app", agent_host_name="localhost", agent_port=6831, ) trace.get_tracer_provider().add_span_processor( BatchExportSpanProcessor(jaeger_exporter, max_export_batch_size=10) ) application = FastAPI( title="My API", version="1.0", description="Do something awesome, while being monitored.", ) # Add your routers application.include_router(my_router_a) application.include_router(my_router_b) FastAPIInstrumentor.instrument_app(application) return application app = get_application()
I hope that with this post you’ve learned:
- the difference between profiling, monitoring, tracking errors
- the architecture of application monitoring
- some of application monitoring tools out there
- that OpenTelemetry allows you to reuse the same instrumentation setup for all your environments, which speeds up the speed at which you can find performance bottlenecks in your application
I’ve used this setup to get a 10x speed up on one multi-lingual NLP fastapi service I built at OneFootball.
- StatsD, What It Is and How It Can Help You | Datadog
- Monitoring - Full Stack Python
- ASGI | Sentry Documentation
- Debugging and Profiling — Python 3.9.0 documentation
- Timing Middleware - FastAPI Utilities
- APM | New Relic Documentation
- APM & Distributed Tracing - Datadog
- newrelic/newrelic-python-agent: New Relic Python Agent
- DataDog/dd-trace-py: Datadog Python APM Client
- open-telemetry/opentelemetry-python: OpenTelemetry Python API and SDK
- Registry | OpenTelemetry
- Jaeger: open source, end-to-end distributed tracing
- Getting Started with OpenTelemetry Python — OpenTelemetry Python documentation