Social Media Analytics Tracking of Infectious Diseases

The medias attention on Ebola has highlighted the fact that epidemiologists can be limited not only by the data they capture, but also by the traditional methods of analysis available, especially when trying to function in real time.

Nov 17 146 min read

Walter Boyle

Senior Analytical Consultant, SAS

The recent outbreak of the Ebola virus has focused public opinion on the analytical methods used by institutions such as the Centers for Disease Control and Prevention (CDC) to manage infectious diseases. In a recent opinion piece on CNNs website, Sen. Rob Portman wrote that the CDC should get proactive about Ebola by switching from passive monitoring to active monitoring of the outbreak. The medias attention on Ebola has highlighted the fact that epidemiologists can be limited not only by the data they capture, but also by the traditional methods of analysis available, especially when trying to function in real time.

Generally speaking there are two forms of data capture (i.e., surveillance) when studying the spread and prevalence of an infectious disease: active and passive tracking. Active tracking is where researchers capture observation subjects from the wild (in the US this might be from the shopping mall or a phone list) with targeted instruments in a timely fashion. With passive tracking, researchers must wait until someone chooses to report information (e.g., a patient chooses to go to the doctor), and therefore theyre potentially unable to influence what or when information is shared.

For the most part, we are more comfortable with passive tracking, choosing to share this kind of sensitive information as we see fit. I will admit, far more often than not, when someone approaches me with some kind of active tracking method (e.g., a telephone survey) I disengage without a second thought. Were happy to accept the benefit of information captured by others, but dont seem to want to contribute information on our own.

Further, we dont always know what we should actively provide to the health care and provider systems. Becoming drowsy from an allergy medication would be considered an adverse event, but how many of us would simply start taking the medicine at night and not bother calling our doctor? Add to this lapses in memory (do you remember to tell your doctor about the drowsiness when you finally do go see them again) and selective reporting (we may at times decide which details are important and choose not to share those which we deem irrelevant).

Given all this, its sometimes a miracle that we can understand the spread and progression of a disease that is epidemic rather than endemic. Yet for the past few years, weve been successfully doing so for the flu (search social media flu to see some examples). Social media has taken the conversations you might have only had with your friends in the past and published them for the world to see. We might not get specific details like sputum or discharge, but we get tons of information about the health of ourselves, friends, families and those around us.

OMG, this guy on the subway just sneezed on me! Yuck!

Working from home today, Timmy has a nasty cough.

Feeling queasy, never eat the food on Aardvark Airlines!

To protect the innocent, these arent exact quotes from Twitter, but offer an example of the information out there. A sudden uptick in tweets mentioning sneezing, coughing, sore throats, sick days and other trigger words can indicate an increase or outbreak of disease. This is something that I would call active-passive tracking. The information isnt always specific, but the social media pulpit entices us to share with the world our thoughts and ideas (and a large audience of followers is a status symbol).

The work doesnt end with a distribution of hot-topic tweets, but it can provide a foundation that people trained in advanced analytics can build on. Natural language processing, contextual analysis and sentiment analysis can be used to filter the wheat from the chaff. Its important we dont confuse the sarcastic (New video game release tomorrow, cough, cough, I feel a sick day coming on ) from the serious (Can this cough get any worse? Another day at home in bed, hope Jerry Springer is more interesting than yesterday.)

Add information like flight schedules to the mix and you can start to understand migration patterns of individuals. Add new trigger words to the search like flight, airport and trip, and the picture becomes more robust. Did someone tweet about being sick before going on a trip, or after coming back? Add in Yelp and Foursquare, as well: Did several people check in at the same restaurant prior to getting sick? If so, we can look at their social network to see if others they come in contact with are getting sick, or if we need to send a health inspector down for a quick look around.

These types of methods, which are only getting more sophisticated, have been successful with the flu, and there is reason to believe they can also be successful for something like Ebola. The trigger words and phrases may be different, and lack of education and paranoia may cause enough noise to start to drown the signal, but these are bumps in the road, not dead ends.

These passive signals can provide direction for active methods. Increased extraneous activity in social media (e.g., paranoid noise) may indicate the need for a heightened education campaign in some areas. The value of these interventions can be measured via social media and adjusted in near-real time to optimize the impact and utility. Key signals showing up in new areas or at higher rates may indicate the need for new protocols in hospitals and guidelines for when to stay home from work, school, etc.

Keep in mind that evidence found via social media is but a symptom of symptoms. The greatest value in these sources is not in their validity but in their timeliness. We need not wait for a data collection cycle, doctors visit or lab assay; we are provided the information as quickly as it takes for someone to figure out how to express it in 140 characters. However, given this, there are a couple things to keep in mind as we employ these methods:

1)The value of the information is only as good as the source from which it comes. Some of these open, public sources can be great, but they should be used with care. Ask any college student who cited Wikipedia without double-checking the validity of the information.

2)This is but one source of information and can be augmented by other sources to gain a more complete picture.

3)Information only has value through use. Infographics, heat maps and word clouds may be beautiful, but if they dont provide useful information, they dont provide value.

By harnessing the information available through these nontraditional sources, epidemiologists can expand the traditional definition of passive monitoring to include information that was not directly given to a health care provider, but was instead shared via publicly available social media. The new data sources, combined with the right analytical tools, can vastly improve our monitoring capability and result in a more robust public health surveillance system.

More for you

Loading data for hdm_tax_topic #better-outcomes...