FHIR APIs, HL7 feeds, and billing systems look straightforward in testing. Here are the five failure modes that appear only after go-live, and how to design around them.
Every EHR integration project looks manageable in the scoping phase. You have API documentation, a sandbox environment, and a working prototype in two weeks. Then you go live, and the failure modes that didn't exist in testing start appearing one by one. After building data pipelines for multi-site hospital networks, we have seen the same five problems come up repeatedly. None of them are exotic. All of them are avoidable if you design for them upfront.
Epic, Cerner, and most major EHR vendors publish rate limits in their developer documentation. What they don't document is how those limits behave under real load, specifically how they interact with bulk data exports, patient search queries, and concurrent requests from multiple pipeline workers.
In testing, you're pulling a few hundred records from a sandbox. In production, you're backfilling three years of patient encounters across five clinical sites while the nightly batch job also runs. The rate limiter doesn't care about your timeline.
Design for this upfront: build exponential backoff into every API client, implement request queuing with configurable concurrency limits, and test your pipeline under production-representative load before go-live. Budget for rate limit headroom. If the limit is 100 requests per minute, design your pipeline to run at 60.
HIPAA requires PHI to be encrypted in transit. Most engineering teams know this and configure TLS on their pipeline connections. What they miss is the intermediate stops: log files that capture request payloads for debugging, message queues that persist records between pipeline stages, and error notification systems that include record details in alert messages.
A common production failure: an error occurs in the pipeline, the on-call engineer gets paged, and the alert message contains patient identifiers in plain text because the error handler was written during development before PHI controls were in scope.
Audit every path that data can take through your system, including error paths. Implement PHI scrubbing at the pipeline boundary, before data touches any logging or alerting infrastructure, not after. Treat your error handling code with the same scrutiny as your main pipeline.
EHR vendors release updates. Hospitals upgrade to new versions. Custom fields get added or renamed. In a well-designed pipeline, an upstream schema change should cause a controlled failure with a clear error message. In a poorly designed one, it causes silent data corruption. Records that appear to load successfully but are missing fields, or worse, have fields mapped to the wrong columns.
We have seen pipelines running for months that appeared healthy but were producing incorrect clinical data because a source field was renamed in an EHR upgrade and the mapping silently failed over to a null value.
Implement schema validation at the ingestion layer, not just at load time. Use schema registries for HL7 and FHIR message formats. Write tests that verify field-level presence and type, and run them as part of every pipeline execution. Alert on schema drift, don't silently ignore it.
OAuth 2.0 access tokens expire. In testing, your pipeline runs in minutes and the token is still valid when it finishes. In production, a full historical backfill of a large hospital's patient records can run for hours, and the access token expires partway through.
The pipeline stops. Depending on how it's designed, it either fails noisily with an auth error, or it silently stops processing records and marks the job as complete. The latter is the dangerous case.
Implement token refresh logic in your pipeline clients. Don't just fetch a token at job start and assume it will be valid throughout. Build refresh handling into every API call, with proper error differentiation between auth failures (refresh and retry) and other errors (log and fail).
Most hospital EHR systems still run on-premise or in private data centers. Your cloud-based data pipeline connects to them over a VPN or dedicated network link. That link has latency, packet loss, and occasional outages that your sandbox connection (usually a direct API call over the public internet) doesn't simulate.
Long-running queries against on-premise databases time out. Large HL7 message batches get partially transmitted. Connection pool exhaustion occurs during peak clinical hours when the network is congested.
Test your pipeline against a network with realistic latency and packet loss characteristics before production deployment. Configure connection timeouts and retry logic with network conditions in mind, not just API response time. Implement checkpointing for long-running jobs so they can resume from where they left off after a network interruption, rather than restarting from scratch.
In summary
None of these failure modes are surprising in retrospect. They're all the result of testing under conditions that don't reflect production: sanitised sandbox data, fast network connections, short-running jobs, and happy-path error handling. The gap between a working prototype and a production EHR pipeline is mostly about designing for the failure cases you know will happen, before they happen in front of clinical staff.
Talk to us. We will scope an engagement before any work begins.