Asynchronous and Real-Time Inference Mode

Asynchronous Inference Mode

Asynchronous Inference Mode is a near-real-time inference option that queues incoming requests and processes them asynchronously. This mode is suitable when you need to handle large payloads as they arrive or run models with long inference processing times that do not require sub-second latency. By default, the predict method operates in asynchronous mode, which will poll the status endpoint until the result is ready. This is ideal for batch processing or tasks where immediate responses are not critical.

Asynchronous Inference Mode Usage

api_response = vps_model_client.predict(model_id=model_id, input_data="Test input", async_mode=True)

Real-Time Inference Mode

Real-Time Inference Mode is designed for use cases requiring real-time predictions. In this mode, the predict method processes the request immediately and returns the result without polling the status endpoint. This mode is ideal for applications that need quick, real-time responses and can afford to handle potential timeouts for long-running inferences. It is particularly suitable for interactive applications where users expect immediate feedback.

Note

In Real-Time Inference mode, the timeout is 29 seconds. The caller will receive a 504 Gateway Timeout error after 29 seconds if the inference takes longer.

Real-Time Inference Mode Usage

api_response = vps_model_client.predict(model_id=model_id, input_data="Test input", async_mode=False)