Asynchronous Inference Mode is a near-real-time inference option that queues incoming requests and processes them asynchronously. This mode is suitable when you need to handle large payloads as they arrive or run models with long inference processing times that do not require sub-second latency. By default, the predict method operates in asynchronous mode, which will poll the status endpoint until the result is ready. This is ideal for batch processing or tasks where immediate responses are not critical.
api_response = vps_model_client.predict(model_id=model_id, input_data="Test input",
async_mode=True)
Real-Time Inference Mode is designed for use cases requiring real-time predictions. In this mode, the predict method processes the request immediately and returns the result without polling the status endpoint. This mode is ideal for applications that need quick, real-time responses and can afford to handle potential timeouts for long-running inferences. It is particularly suitable for interactive applications where users expect immediate feedback.
Note
In Real-Time Inference mode, the timeout is 29 seconds. The caller will receive a 504 Gateway Timeout error after 29 seconds if the inference takes longer.
api_response = vps_model_client.predict(model_id=model_id, input_data="Test input",
async_mode=False)