A big data effort to fight Zika and other infectious diseases
The Zika virus is sending a chill down the collective spine of healthcare providers and government agencies. So far, Brazil has confirmed nearly 3,000 cases of pregnant women infected with the virus, and the disease is spreading through the Americas.
Kamran Khan says there’s a singular truth about the spread of infectious diseases: “If you start to analyze the situation when an outbreak occurs, you’re already too late.” Khan, an infectious disease physician and scientist at Toronto-based St. Michael’s Hospital, has spent his career combating the likes of Zika, Ebola, Lassa Fever and other lethal and long-simmering infectious diseases that have reared up in unexpected places and sown public panic and death.
After seeing that dynamic play out in his hometown of Toronto during an infectious disease outbreak in 2003, Khan became the founder and driving force behind BlueDot, a Toronto-based for-profit social enterprise company focused on combining web-based technologies and big data with epidemic expertise to get ahead of the curve on infectious diseases and give public health and government officials time to anticipate when, where and how hard they’ll be hit.
The BioDiaspora platform that fuels BlueDot’s efforts was developed in 2008 at the Li Ka Shing Knowledge Institute at St. Michael’s Hospital. In 2013, the platform was incorporated as a for-profit company, and in 2014, it was rebranded as BlueDot. The company, which received funding from the Li Ka Shing institute and tech investors, has 40 staff members and combines the efforts of infectious disease specialists, data scientists, researchers and computer engineers with reams of real-time data on some 4 billion commercial flight itineraries; human, animal and insect population data; climate data from satellites; and news reports of disease outbreaks.
BlueDot has inked a five-year cooperative agreement to work with the CDC and also received funding from Foreign Affairs, Trade and Development Canada to support WHO’s Ebola response in West Africa and to build capacity among the 10 countries of the Association of Southeast Asian Nations to prepare for infectious disease threats. BlueDot develops research reports and risk map models that are distributed to public health agencies and other health providers to show the potential spread of infectious diseases and the estimated impacts on at-risk populations. It’s also developing web-based and mobile tools that can help first responders and others report on infectious disease cases and help health public health agencies deploy resources faster.
BlueDot, along with collaborators from Harvard and Oxford University, was recently in the news for warning about the spread of the Zika virus in Brazil and beyond in a paper published in the Lancet medical journal in January — a month before the World Health Organization classified it as a public health emergency in February and the Centers for Disease Control and Prevention issued travel warnings that same month.
That underscores one of Khan’s biggest concerns—the likelihood that public health agencies and the medical community would be caught flat-footed by the newest illness. “Much of what we do in the infectious disease realm is in academia, and in that realm one of the limitations to fighting emerging threats is that good ideas on how to address them is disseminated through peer-reviewed academic journals, and it takes a year to six months to get that information out,” he says. “That system is not particularly responsive when dealing with an emergency.”
Khan’s big wake-up call into the impact of human mobility and the spread of infectious disease was the 2003 outbreak of severe acute respiratory syndrome (SARS) in his hometown of Toronto. The disease first popped up in Guangdong Province in China in November 2002; a doctor who treated SARS victims there later flew to Hong Kong to attend a wedding and developed symptoms while at the Metropole Hotel. During his stay, he infected 12 other hotel guests, including a 78-year-old woman from Canada who later flew back to Toronto. The one case of SARS led to an outbreak beginning in March 2003 that resulted in 44 people in Canada dying from SARS, approximately 400 becoming ill, and 25,000 Toronto residents placed in quarantine.
Khan felt SARS was a harbinger of a future in which infectious diseases would ride on the back of increased globalization and find new opportunities to spread. In the case of the Zika virus, he notes, Aedes aegypti mosquitoes are not spreading across the globe and infecting people; people traveling domestically and internationally are infecting the mosquitoes and bringing the disease to new areas. Indeed, predicting the spread of the disease requires researchers and data scientists to collect and analyze large amounts of data about insect behavior, climate, travel, demographics, health infrastructure and other factors to get an idea where and when the disease can gain a foothold.
But the data challenge for BlueDot is not the volume nor velocity of data, but the variety, says Steve Hockema, BlueDot’s director of data engineering. The datasets being utilized—from real-time satellite feeds of climate data, clinical reports and travel itineraries—had never been mashed together. The biggest obstacle has been to organize, curate and clean the data so it’s ready to be “taken off the shelf” when needed for different analyses. With the foundation set, researchers can then layer on additional datasets specific to an infectious disease—for example, to analyze Zika required specialists to include information on the lifecycle of the mosquito, and climate data on where and when it would be in Brazil, in addition to the data about human mobility and demography.
To store its data, BlueDot currently has two central repositories—a data warehouse built on a SQL server and a shared disk drive. However, the company recently started transitioning to a big data infrastructure as the volume of data increases and its staff needs more visualization tools to analyze the data, which is heavily skewed toward spatial information, Hockema says.
To that end, it’s set up a Spark cluster utilizing Databricks, a cloud-based, open source platform where it’s moving most of its data. Databricks is used to manage the data clusters from the current warehouse, create a workspace for visualization, and provide a pipeline scheduler for delivering data to those clusters.
Hockema and his team uses scripts written in-house to pull data from various sources, clean it and transform it, and import it into its data warehouse for ETL to its data clusters. Previously, it used handwritten SAS code to do data import and cleaning.
“We have historically used mostly SAS,” Hockema says, adding “we recently have begun to use R more” R is the open source statistical programming language that many organizations are adopting because of the rapid release of new statistical packages and techniques by developers committed to open source.
One of the first steps in analyzing its data was understanding human mobility, Khan says. BlueDot licenses private sector data on nearly 4 billion annual flight itineraries, which is run through its infrastructure to provide researchers with a look at how people move a disease across the world. The data is anonymized and doesn’t provide information on citizenship or gender, but does provide critical data on points of origin and connecting flights, Khan says.
That information is combined with other data streams, including two critical epidemic data feeds from the HealthMap system and the Program for Monitoring Infectious Diseases, or ProMED. HealthMap, from Boston Children’s Hospital, is a freely accessible, automated electronic information system for monitoring, organizing and visualizing reports of global disease outbreaks according to geography, time, and infectious disease agents. ProMED is a publicly available emerging diseases and outbreak reporting system designed to promote communication among the international infectious disease community, including scientists, physicians, veterinarians, epidemiologists, public health professionals and others. The system is maintained by the International Society for Infectious Diseases.
All the incoming data is time-stamped and geocoded, and feed a web-based geographic information system from ESRI. Researchers at BlueDot then can create visualizations of the spread of disease as well as use that data to run predictive models to understand when and where further outbreaks could occur, based on information about incubation periods, the number of travelers, insect activity and numerous other variables.
In regard to Zika, BlueDot’s risk map and Lancet paper predicted the spread of the disease in Brazil and also warned that that the disease will likely spread into the United States via Florida, based on the volume of travel as well as the state having the right climate and “right” mosquitoes for transmission. However, Khan also noted in the Lancet paper that better housing and less stagnant water in Florida, compared with affected Brazilian areas, would likely result in limited transmission in Florida.
Khan says BlueDot is pouring resources into the underlying conundrum of addressing infectious diseases—contextualizing the information to understand the real threats and possibilities of each of the microbes it’s analyzing. “When we talk about infectious disease and epidemics, it sounds like we’re talking about one thing, but we are really talking about very heterogeneous things.
“Zika, for example, is a disease spread by specific mosquitoes, and it’s very dependent on climate and temperature. Ebola is a disease that probably originated via contact between humans and primates; when looking at Middle East Respiratory Syndrome (MERS) it’s relevant to understand the interaction between humans and camels.
“Each of these diseases and the microbes that cause them have a unique life cycle, and you need certain types of data to understand the significance and context around the outbreaks. It’s akin to playing with Legos—if you don’t know how to assemble all those pieces of data, they are not necessarily very meaningful,” says Khan.
For example, while travelers returning to Chicago or Toronto may be infected with the Zika virus, the absence of Aedes aegypti mosquitoes means there’s little to fear about widespread outbreaks of the disease—although it can be spread by sexual transmission
Another example from Khan drives home the importance, and complexity, of his work: In 2010, Haiti was devastated by a magnitude 7.0 earthquake that killed more than 160,000 people and left its infrastructure, including the sanitation and water systems, in shambles. A single case of cholera that entered the country led to more than 10,000 deaths.
“Every microbe has a different consequence in different settings,” Khan says. “The biggest challenge is to understand what epidemiologists call the infectious disease triangle, which comes down to understanding the pathogen itself, the population it’s being introduced to and the environment it’s being introduced into. The question is always, ‘What’s going to be the consequence?’ “
While big data is helping spur more aggressive responses to infectious disease outbreaks, that question is still difficult to answer. In the case of Zika, health experts are working to help state and local governments in the south and Gulf Coast—where conditions create a vector for the disease—create Zika response plans for the summer. Federal officials say 312 cases have been confirmed in the U.S., but those infected traveled to Zika-afflicted areas or contracted the disease through sexual contact.