How to weather the storm when disaster hits home
With hurricane season approaching and myriad other potential threats looming, health systems and hospitals need to be prepared for the worst.
George Carion, chief technology officer at Cedars-Sinai Health System, shares strategies in this Q&A. Cedars-Sinai was named a top 10 CHIME HealthCare’s Most Wired recipient in 2018 and received a special shout-out for their disaster recovery program.
What makes a strong disaster recovery program?
It is not just one thing; there are lots of things. The foundational piece is ensuring your technology infrastructure is resilient. A hospital can spend a lot of time and effort creating an application disaster recovery (DR) plan when nine times out of 10, its infrastructure technology that gets you. You’ll find yourself falling back to a disaster plan because of a hardware problem, but if you don’t have solid infrastructure, your disaster plan will probably fail. Bad infrastructure architecture can defeat the best DR plans.
Lay the groundwork to provide reliable site diversity. Don’t forget services like DNS, DHCP, your directory services and VMware. While not hardware, those things are part of your infrastructure, too.
Next, divide and conquer. Separate your applications into criticality tiers. Find your highest tier applications first—call them Tier 0 or Tier 1. Have discussions with application stakeholders and business units to better understand the criticality, and ensure you’re ranking them appropriately. They’ll be your EHR, PACS, lab systems, your interface engine, your supply chain system and so on. Then, you start working on plans for those; forget about everything else in less critical tiers.
How do you determine highest priorities?
The important work is categorizing your applications into those tiers based on a recovery time objective. For us, it’s Tier 1 (must be available within two hours), Tier 2 (under 24 hours), Tier 3 (under 72 hours) and Tier 4 (more than 72 hours). Depending on the type of disaster, recovering Tier 3 and Tier 4 apps becomes progressively more optional—it just may not be doable. Have conversations with stakeholders about these RTOs. Does the app have to be up without fail in under two hours, or could it be down a little longer?
It is not about whether or not your application is important; it is about the operational impact when it’s down, and the effectiveness of manual downtime procedures.
It is pretty easy to agree on those tiers when you play out a potential, unplanned downtime with your customers. What happens to clinical care, what happens to the hospital, what happens to payroll? When we started our DR program, we had a lot of tier 1 apps. We learned that many of our clinical apps can be down for longer than two hours. While still very important, staff can function without direct access to tier 2 apps for longer than a couple of hours.
How do you test this?
In an environment where we have some 750-odd applications, we have 60 tier 1s. We build plans for all of those tier 1s in bundles of five. Each bundle is overseen by a project manager, someone assigned to interact IT staff assemble and review architectures and help create the recovery runbooks. Essentially, everything needed to fail an application over to another site or another set of hardware systems. If you are missing infrastructure or some other capability to support a recovery plan, create a proposal that will help your leaders and other executive stakeholders understand the costs and risks involved. It’s ultimately a risk vs. expense conversation.
At Cedars-Sinai, we are fortunate to have more than one data center, and our secondary data center is outfitted with enough hardware to support our tier 1 applications. We run regular recover drills by failing applications over to our secondary data center. In our drills, at a high level, we’re testing for two things. Using our plan, can we fail over the application to the secondary data center? And, can we perform the recovery work within the recovery time objective?
Are these lessons applicable to someone like a rural hospital?
The big constraint a smaller organization will face is likely financial. You don’t need a large DR team; we don’t have one. We’ve organized this work around our IT managers’ operational responsibilities. People-wise, I think most organizations should be able to do a decent job in this area if they can prioritize appropriately.
Financial resources, however, are another matter. We are fortunate to have a second data center a state way from our primary. If you don’t have one, then there are some cost-effective cloud models you could look at, especially for DR workloads that sit idle when not being used.
Some CIOs talk about the challenges of getting their boards to invest in intangibles like DR and cybersecurity. What are your thoughts?
I don’t think we should look at cybersecurity and disaster recovery as two separate, independent things. They are very similar, in that they both require solid, practical response and recovery plans. A cyber issue can be your disaster. Do the planning and preventative work so you will create opportunities to weather something bad. You have to approach it with similar tools, architectures and program management.
These days cyber is on most people’s radar. Should that be a lever for getting buy-in?
My advice is to not scare your boss or your board about cybersecurity. They are probably already scared. If you put the “breach” aside, cybersecurity is not scary. What is scary is having really important systems go unavailable. Get buy-in through describing cyber and DR as data protection and availability programs: This is building resiliency into critical systems and services that are necessary to operate a health system or hospital. Cyber just has that extra little problem of privacy.
This is a tough challenge for a CTO, especially if their IT organization is perpetually underfunded. IT leaders will need to steer the conversation toward risk vs. cost. It becomes a business math problem. “We should spend X dollars to reduce the likelihood of an extended outage to our EHR.” It creates an opportunity cost problem. But for most of us, doing nothing isn’t an option.