A healthcare analytics company received notice on a Tuesday afternoon that their primary AI infrastructure vendor was filing for Chapter 7 bankruptcy. The platform hosted their patient risk stratification models, their clinical trial matching service, and their population health analytics pipeline. The vendor’s service would shut down in ninety days. The company had ninety days to migrate every workload or lose it.
The company’s engineering team had six people. The platform hosted fourteen production models, eight data pipelines, and three customer-facing APIs. The vendor had been selected three years earlier because it offered a managed ML platform that abstracted away infrastructure complexity. The team had never provisioned their own GPU instances, managed their own container orchestration, or configured their own model serving infrastructure. The abstraction that made the vendor attractive now made the migration harder, because the team had no muscle memory for the infrastructure layer.
The lock-in anatomy
The vendor’s platform had three layers of lock-in. The first was infrastructure lock-in. Models were trained and served using the vendor’s proprietary orchestration layer. The team submitted training jobs through the vendor’s SDK. Model artifacts were stored in the vendor’s registry. Inference endpoints were exposed through the vendor’s API. None of these interfaces were portable. The team could export model weights, but the serving configuration — batch sizes, timeout settings, autoscaling rules — was encoded in the vendor’s proprietary format.
The second was data pipeline lock-in. The ETL pipelines were built using the vendor’s visual pipeline designer. The pipeline definitions were stored as JSON documents that described the vendor’s specific operator vocabulary. The operators were wrappers around standard transformations — read from S3, filter rows, join tables — but the wrapping was proprietary. Exporting the pipeline definitions did not produce runnable code. It produced a proprietary configuration that only the vendor’s runtime could execute.
The third was integration lock-in. Customer-facing APIs were routed through the vendor’s API gateway, which handled authentication, rate limiting, and request logging. The company’s API keys, customer configurations, and usage tracking were managed in the vendor’s system. Migrating the APIs meant rebuilding the integration layer from scratch.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
The migration plan
We designed a ninety-day migration with three parallel workstreams: model serving, data pipelines, and API gateway. Each workstream had a dedicated engineer from the company’s team and a consultant. The total migration team was twelve people.
The model serving workstream had the most time pressure. The fourteen production models needed to be re-deployed on a portable serving infrastructure. We selected a Kubernetes-based serving stack using open-source tooling: KServe for model serving, MLflow for model registry, and Prometheus for monitoring. The critical step was extracting model weights and converting the serving configurations from the vendor’s proprietary format to the open-source format. This took three weeks for the fourteen models, with the most complex model requiring four days of configuration mapping.
The data pipeline workstream had the highest volume of work. The eight pipelines, described in the vendor’s proprietary JSON format, had to be reconstructed as code. We chose to rebuild them as Python scripts orchestrated by Airflow, because Python was the most readable option for the company’s team and Airflow was operationally simple. The team reverse-engineered each pipeline’s logic from the JSON definitions, wrote equivalent Python code, and validated output equivalence against the vendor’s platform. This process took five weeks.
The API gateway workstream had the most customer impact. Three customer-facing APIs needed to maintain their existing endpoints, authentication mechanisms, and response formats during the migration. We deployed a proxy layer that routed requests to either the vendor’s gateway or the new gateway based on a feature flag. This allowed a gradual cutover with the ability to roll back instantly if the new gateway produced errors.
The timeline
Days 1 through 14: assessment and planning. Inventory every model, pipeline, and API. Map each to its dependencies. Identify the migration order based on business criticality and technical complexity.
Days 15 through 45: parallel migration. The three workstreams operated simultaneously. Weekly integration tests verified that models served through the new infrastructure produced outputs matching the vendor’s platform. Pipeline outputs were compared row-by-row against the vendor’s outputs.
Days 46 through 60: integration testing. The three workstreams were connected end-to-end. Customer-facing APIs were pointed at the new model serving infrastructure. Pipelines fed data to the new model registry. End-to-end latency and accuracy were measured against the vendor’s baseline.
Days 61 through 75: staged cutover. One customer-facing API was migrated per week, starting with the lowest-traffic endpoint. Each cutover was monitored for forty-eight hours before proceeding to the next.
Days 76 through 90: cleanup and hardening. The vendor’s platform was used as a read-only fallback while the new infrastructure was hardened. On day 89, the final API was cut over. On day 90, the vendor’s platform went offline.
What we gave up
The vendor’s platform had automated model retraining that the new infrastructure did not replicate during the migration. Model retraining was paused for the duration of the migration. The team accepted model staleness as a cost of the accelerated timeline. Retraining was restored six weeks after the migration completed.
The vendor’s platform had a visual pipeline designer that non-engineers could use. The new Airflow-based pipelines required Python skills that not everyone on the team had. Two business analysts who had maintained pipelines through the vendor’s visual tool could no longer do so. The team hired an additional data engineer to absorb this work.
The vendor’s platform provided managed infrastructure with automatic patching and scaling. The new Kubernetes-based infrastructure required the team to manage upgrades, security patches, and capacity planning. Ongoing operational overhead increased by roughly twenty hours per month.
The lasting change
After the migration, the company adopted a vendor portability policy. Every new vendor engagement required that all model artifacts, pipeline definitions, and integration configurations be exportable in a non-proprietary format. Vendors that stored customer assets in proprietary formats were disqualified regardless of feature advantages.
The policy cost the company access to some best-of-breed platforms that used proprietary abstractions. It also meant that the team maintained more infrastructure knowledge in-house rather than outsourcing it to a managed platform. The team accepted this trade-off. Having lived through a ninety-day emergency migration, they preferred operational overhead to existential vendor risk.
The decision heuristic
If your AI infrastructure is hosted on a proprietary platform, audit your portability today. Export a model. Export a pipeline definition. Export an API configuration. If any of these exports cannot run without the vendor’s runtime, you have lock-in. Quantify the lock-in in weeks: how many weeks would it take to rebuild these assets on open infrastructure? If the answer is more than twelve weeks, you are one vendor bankruptcy away from a crisis. Fix the portability gap before you need it, because the day you need it, you will not have time to do it carefully.