OpenAI Says Deployment of Telemetry Service Caused 3-Hour Outage
OpenAI said its deployment of a new telemetry service caused a three-hour outage of all its services Wednesday (Dec. 11).
The company’s ChatGPT, API and Sora services were degraded or unavailable from 3:16 p.m. to 7:38 p.m. Pacific time, OpenAI said in an incident report.
The incident was caused by the new telemetry service overwhelming the Kubernetes control plane and producing cascading failures across the company’s critical systems, according to the report.
“This event was the result of an internal change to roll out new telemetry across our fleet and was not caused by a security incident or a recent launch,” the company said in the report.
OpenAI deployed the new telemetry service to improve reliability, as the service would collect detailed Kubernetes control plane metrics and boost the company’s visibility into the state of its systems, per the report.
Four minutes after the telemetry service was deployed, the outage occurred as it caused the execution of resource-intensive Kubernetes API operations that overwhelmed the Kubernetes API servers and took down the Kubernetes control plane in most of OpenAI’s large clusters, according to the report.
OpenAI detected and identified the issue within minutes and began fixing it, per the report.
The company is implementing and prioritizing several measures to prevent similar incidents, including improved phased rollouts with better monitoring for infrastructure changes, according to the report.
“We apologize for the impact that this incident caused to all of our customers — from ChatGPT users to developers to businesses who rely on OpenAI products,” the report said. “We’ve fallen short of our own expectations.”
OpenAI experienced a three-hour outage of all ChatGPT-related services in June and a short but “major” outage of ChatGPT two days after its high-profile announcement of a new store in November 2023.
The company reported Dec. 4 that ChatGPT now has 300 million weekly active users, that 1 billion user messages are sent on the artificial intelligence chatbot every day and that 1.3 million developers have built on OpenAI in the United States.
It was reported Nov. 30 that OpenAI aims to get to 1 billion users over the next year.
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.
The post OpenAI Says Deployment of Telemetry Service Caused 3-Hour Outage appeared first on PYMNTS.com.
