Replicating OpenAI’s Chat Completions API with Python FastAPI
In the realm of Gen AI, OpenAI stands tall as a pioneer, offering the widely acclaimed GPT-4 model and an accessible API for developers. However, the market’s reliance on OpenAI has led to a need for alternative solutions. Whether due to cost considerations, data privacy concerns, or the desire to work with open-source models, developers are seeking ways to integrate various LLMs into their projects.
To address this demand, I embarked on a weekend project to create a Python FastAPI server that mirrors OpenAI’s Chat Completions API. By doing so, any LLM, whether managed like Anthropic’s Claude or self-hosted, can seamlessly interact with tools designed for the OpenAI ecosystem.
Building an OpenAI-Compatible API
The first step in this endeavor was to model a mock API that emulates the functionality of OpenAI’s Chat Completions API. Using Python and FastAPI, I crafted a simple yet robust solution that could be easily adapted to other programming languages like TypeScript or Go.
The core of the implementation revolves around defining a request model that aligns with OpenAI’s specifications. The ChatCompletionRequest
model encapsulates essential parameters such as the LLM model to be used, the chat messages exchanged, maximum tokens allowed, and temperature settings for text generation.
from typing import List, Optional
from pydantic import BaseModel
class ChatMessage(BaseModel):
role: str
content: str
class ChatCompletionRequest(BaseModel):
model: str = 'mock-gpt-model'
messages: List[ChatMessage]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False
Testing the Implementation
After setting up the server and defining the request model, the next phase involved testing the functionality. By leveraging the Python OpenAI client library, I verified that the server could successfully respond to requests as expected. Through meticulous testing and validation, the compatibility between the mock server and the client library was confirmed.
Enhancing Streaming Support
Recognizing the computational demands of LLM generation, I extended the server’s capabilities to support streaming responses. This enhancement allows users to receive generated content incrementally, facilitating a smoother user experience. By incorporating a StreamingResponse
mechanism, the server can now cater to clients requesting real-time data delivery.
Conclusion
In a landscape marked by diverse LLM providers and varying API structures, standardization remains a challenge for developers. By abstracting LLMs behind the framework of established APIs like OpenAI’s, we can streamline integration efforts and foster interoperability across platforms.
For the full code implementation and further insights, refer to the GitHub Gist.
References