Pamela Fox shares actionable best practices for building LLM-powered apps on Azure, emphasizing asynchronous Python frameworks to ensure speed and reliability when integrating Azure OpenAI Service and Azure AI Search.

Concurrency Best Practices for LLM-Powered Apps with Azure OpenAI and Python

Author: Pamela Fox

Introduction

Through her work on the Python advocacy team at Microsoft, Pamela Fox has maintained several open-source AI sample applications, most notably the RAG chat demo. She discusses critical lessons in making LLM-powered apps both fast and reliable—chief among them: using an asynchronous backend framework.

Why Concurrency Matters for LLM Apps

  • LLM applications (e.g., apps powered by Azure OpenAI Service) frequently handle multiple API calls and database queries concurrently.
  • Synchronous backend frameworks like Flask can block worker threads during long-running requests (e.g., to OpenAI APIs), reducing throughput and wasting resources.
  • Asynchronous frameworks allow the Python runtime to pause coroutines waiting for I/O so other requests can be processed in parallel, maximizing efficiency.

Example

Running a Flask app on Gunicorn can result in each worker being tied up during slow API calls to Azure OpenAI, forcing serial processing. Switching to an async framework keeps all workers busy by interleaving waiting requests.

Diagrams

  • Synchronous: Each worker handles one request at a time, waiting during API calls.
  • Asynchronous: Workers can process other requests while waiting for I/O.

Asynchronous Python Backends

Several Python frameworks support asynchronous patterns:

  • Quart: An async version of Flask.
  • FastAPI: Async-first, built on Starlette.
  • Litestar: Batteries-included async framework.
  • Django: Offers async view support.

Pamela describes the decision-making process for choosing among these frameworks and shares detailed considerations in a separate blog post.

Porting Flask to Quart: Asynchronous Handlers

Converting from Flask to Quart involves prefixing handlers with async, enabling them to return coroutines. This permits concurrent processing of requests.

async def chat_handler():
    request_message = (await request.get_json())["message"]

For deployment:

  • Continue using Gunicorn for production, but run it with the Uvicorn worker (ASGI compatible).
  • Alternatively, run Uvicorn or Hypercorn directly.

Asynchronous API Calls with Azure OpenAI

To fully leverage async frameworks, API calls themselves must be asynchronous. Using the OpenAI Python SDK in async mode:

openai_client = openai.AsyncOpenAI(
    base_url=os.environ["AZURE_OPENAI_ENDPOINT"] + "/openai/v1",
    api_key=token_provider
)

chat_coroutine = await openai_client.chat.completions.create(
    deployment_id=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": request_message}
    ],
    stream=True,
)

For Azure services like AI Search, use their async clients:

from azure.identity.aio import DefaultAzureCredential
from azure.search.documents.aio import SearchClient

r = await self.search_client.search(query_text)

With all outbound calls as asynchronous coroutines, your app can efficiently handle multiple user sessions without idle worker time.

Real-World Sample Applications

Here’s a list of open-source, async-enabled sample applications suitable for various tech stacks:

Repository App Purpose Backend Frontend
azure-search-openai-demo RAG with AI Search Python + Quart React
rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React
openai-chat-app-quickstart Simple chat (OpenAI models) Python + Quart JS
openai-chat-backend-fastapi Simple chat (OpenAI models) Python + FastAPI JS
deepseek-python Chat (AI Foundry models) Python + Quart JS

Conclusion

To build robust, responsive LLM-powered apps with Azure’s AI services, always leverage asynchronous backend frameworks and async API clients. See the above samples for production-grade architectures.

This post appeared first on “Microsoft Tech Community”. Read the entire article here