Big data. On the one hand, it’s information overload. On the other, it’s gold in the quest for faster, smarter tools that can be applied to automation – and just about everything else.
That deluge of data is ripe for enabling artificial intelligence, machine learning and prediction analytics, collectively known as data science, which can find patterns that would have otherwise been missed by human beings.
Data science uses computational techniques that blend historical knowledge with uncertain elements to provide sophisticated inferences.
Data scientists build complex models that can make predictions in dynamic conditions, recognize speech and faces, transform images, improve drug development, and support investment and business decision-making. UWM data scientists work in disciplines across many of the university’s schools and colleges.
And now, UWM is poised to make a bigger commitment to big data with the establishment of the Northwestern Mutual Data Science Institute. Northwestern Mutual and its foundation will contribute $12.5 million, while UWM and Marquette University will each contribute in kind and fundraising of $11.25 million over the next five years. The investment will support endowed professors, data science faculty positions, research projects and expanded student programming, which will begin this fall.
Here’s a look at some of the UWM scientists already using data science in their work.
Chiang-Ching (Spencer) Huang
Chiang-Ching (Spencer) Huang is using big data to search for a better way to detect and treat cancer.
The ultimate goal of the work is to bring liquid biopsies – tests of bodily fluids, such as blood – into clinical practice. These tests would be less intrusive and provide more information than current biopsies.
Scientists already know fragments of tumor DNA are released into the blood. “We want to know how accurate tests are in detecting cancer cells in the blood,” said Huang, an associate professor biostatistics in the Joseph J. Zilber School Of Public Health.
Using DNA extracted from the blood of healthy patients and those with cancer, Huang is using data mining to sort through “humongous” amounts of genomic data to identify robust biomarkers linked to certain types of cancerous cells.
Being able to detect tumor cells earlier gives doctors an indicator of potential cancers before symptoms appear, and can also help enhance the effectiveness of treatment.
“We measure glucose and cholesterol in routine screenings,” Huang said. “When we are able to find cancers in a routine check in the very early stages, they are much easier to treat.”
Toddlers can easily recognize their parents, says Zeyun Yu (pictured with graduate student Reihaneh Rostami), but teaching a machine to recognize faces is difficult. Yu’s lab applies machine learning, 3D scanning and geometric processing techniques to facial recognition and other medical decision-making tasks.
“We are trying to mimic how the human brain works in both recognition and interpretation,” said Yu, an associate professor of computer science.
In collaboration with the Marshfield Clinic Research Institute and several hospitals, the Yu lab has modeled the shape of tumors in 3D to provide surgeons with information that’s missing in 2D views, such as how the tumor changes over time.
In another project, Yu’s work has the potential to allow doctors to remotely monitor the wound healing process. Patients with a smartphone could scan the wound anywhere, anytime and send images to cloud servers for 3D reconstruction and data analysis. The information about the wound extracted is then sent to a doctor, who can determine proper treatment.
Meteorologists can effectively predict the weather in the short range using computer models that solve the atmospheric equations of motion.
Long-term outlooks are another matter, says Paul Roebber, UWM distinguished professor of atmospheric sciences.
To improve predictions further into the future, forecasters rely on “ensemble” statistical models, which average many different weather models and provide a range of solutions. But getting complete and exact data from dynamic conditions is impossible.
“When we measure the current state of the atmosphere, we are not measuring every point in three-dimensional space,” Roebber says. To interpolate the gaps, he engineers new ways of “diversifying” limited data in the models.
Without the addition of new information, models used in a group tend to agree with one another rather than the actual weather. In other words, the information in each model is often too similar, and in the absence of more diversity, it’s hard to distinguish the “signal,” or relevant variables, from the “noise,” or irrelevant ones.
Colleen Janczewski gets monthly data deliveries that contain some of the most sensitive information being archived in Wisconsin: partial records of 7,000 children who have been referred to the state’s Child Protective Services (CPS) system. “Improved child safety outcomes is our ultimate goal,” said Janczewski, an assistant professor in the Helen Bader School of Social Welfare.
The state rolled out an “alternative response” intervention for at-risk families in 2011. Janczewski, working under the Institute for Child and Family Well-Being, looks for patterns in the data to evaluate the intervention’s effectiveness in identifying and helping families before a child is placed in state custody.
“Alternative response is a better fit for families that come into contact with CPS because of low- to moderate-risk safety behaviors,” she says. “It’s more likely they will have good relationships with their case workers and benefit from support services.”
The data allows Janczewski to do what individual caseworkers can’t: systemically analyze thousands of cases to find red flags that help identify children most at risk as their home situations destabilize.
Purush Papatla creates models that mine the ocean of data available from social media and from digital services to gain insights into consumer behavior. He wants to know, for example, why and how consumers engage with brands on Facebook, why they retweet brands’ tweets and whether consumer ratings of restaurants affect sales.
What he is discovering is that consumers don’t always react to digital avenues of marketing the way they respond to traditional forms.
Papatla, a professor of marketing, has investigated how marketers can more effectively harness “likes” on Instagram, where more than 95 million images are posted every day. Most retailers have an account but don’t know how to use it to boost sales, he says.
Currently he is studying how traditional service providers, such as taxis and hotels, and new providers, like Uber and Airbnb, can better compete for customers. For the study, he has tapped into massive datasets on taxi and Uber rides – about 1 billion in New York City – and nearly 625,000 Airbnb searches.
Paul Auer doesn’t have a scalpel or a stethoscope, but his work is having an impact on the treatment of diseases.
Auer, an associate professor of biostatistics in UWM’s Joseph J. Zilber School of Public Health, uses mathematics and computers to study millions of genes to try to isolate variations that are linked to heart disease, cancer, sickle cell anemia and other diseases.
This type of big data statistical detective work is becoming an increasingly important part of health care as researchers comb through genetic information and large-scale, long-term compilations of medical records to establish links between genes and disease.
Most recently, he has been part of an international team publishing findings on genetic variations related to a certain type of estrogen-negative breast cancers. He’s also part of a study, recently published in Nature, on protein-altering variants associated with body mass index. Those findings could eventually help lead to better therapies for obesity.
Collaborating with Aurora Health Care, UWM computer scientist Rohit Kate used variables found in hospital records to build a model that could predict which patients admitted for a different health concern were likely to also suffer from acute kidney injury.
Acute kidney injury comes on quickly and without warning. Potentially fatal, the condition is treatable if detected early. That’s where Kate comes in. His model, based on data such as patient demographics, medications, laboratory tests and comorbid conditions, raises the alarm even before detection.
The beauty of this, says Kate, is that most diseases and conditions respond to treatment successfully if caught early, so his model could have other medical uses. Already, he has adjusted the model so that it continually predicts as changes occur during a patient’s hospital stay.
Called machine learning, these techniques are effective at predictive analytics, he says, because they allow patterns to emerge from the historical data. Kate also uses them in improving computers’ ability to understand and respond to human languages.
Rebecca Headley Konkel
Wicked problems are the domain of social scientists. Child safety, crime prevention, opioid addiction and trauma-informed care are among the topics that Helen Bader School of Social Welfare faculty address in their teaching and research. “We’re identifying trends and environments that are preventive rather than conducive to these kinds of issues,” explained criminal justice assistant professor Rebecca Headley Konkel.
How she does that work — by analyzing millions of chunks of data provided by the Pennsylvania Department of Corrections — distinguishes her as a computational social scientist. More than 900,000 lines of data offer insight into the lives, criminal records and housing patterns of 15,000 parolees. Looking at structural data systematically offers insight, Konkel said, into where former prisoners go after they’ve been released and their likelihood for recidivism.
“My findings will be used to inform state policies to identify services and resources that allow people to succeed after incarcerations,” said Konkel, who also conducts research for police departments in central Wisconsin.
Developing new clinical interventions to treat diseases is an expensive and time-consuming task. Jake Luo is using data science, artificial intelligence and a team science approach to address the cost, efficiency and patient outcomes in medical research.
He has developed novel tools that leverage patient health records and data from previous clinical studies to improve models that include disease prediction and risk detection.
For example, people who experience dizziness from vestibular disease are often misdiagnosed. Using hospital records data, Luo has constructed models that identify risk in people who haven’t been diagnosed yet.
Luo, an assistant professor of health informatics, also has created an algorithm that identifies the risk of severe adverse effects, including death, in subjects who participate in clinical drug studies. By linking, extracting and structuring complex data, he can deliver intelligent analysis, which yields results such as targets for potential new drugs.
Professor Jin Zhang specializes in information retrieval, conducting research on questions and answers that health consumers post on internet forums. Sorting through this data helps him, and us, better understand how consumers discuss their health concerns online.
“Through data mining, we find that the vocabulary of health consumers talking about health issues, like diabetes, is quite different from that of medical professionals,” Zhang said. “Only 30 percent of the terminology is the same. It’s like they are speaking two different languages about the same disease.”
Also a specialist in information visualization, Zhang uses the technique as a tool to map connections among abstract items and create a visual structure to help people understand how the pieces of a whole fit together. Internet users moving from web link to web link, for example, are creating a map of their internet activity and preferred sites. Information visualization allows Zhang to extract and then visually represent this information. Companies and individuals use his research to build more efficient websites that are easier for users to find.
Abbas Ourmazd and Peter Schwander
People are biological factories abuzz with nanoscale “machines” – molecules, such as proteins, that carry out the body’s multitudes of functions. Each nanomachine performs a specific job upon receiving appropriate signals, and disease is often the result when something goes awry.
Our understanding of how the body’s nanomachines work has been based on looking at their static atomic structure. In fact, they change their shape as they execute a task, and that information would help advance new treatments for disease – if we could see it.
Now we can. A team of UWM scientists, consisting of Ali Dashti, Ghoncheh Mashayekhi, Peter Schwander and Abbas Ourmazd, has devised a machine-learning technique capable of making movies of biological nanomachines at work.
The algorithm they created uses data culled from millions of random snapshots of these molecules to compile the first atomic-level movies that show them in action. This has opened the floodgates to understanding, and ultimately fixing, the biological workhorses we all depend on.
When patients are diagnosed with illness, they often go to public online health forums and community websites to get their questions answered.
Using the text from these sites, Susan McRoy applies machine learning methods to identify unmet information needs so that health care practitioners can improve communication with their patients.
The information she finds is often a more accurate gauge of needs than clinicians can get from traditional patient surveys or focus groups. Patients who talk to each other online are less likely to withhold opinions and perceptions.
McRoy (pictured with computer science student Catelyn Scholl) has worked on projects with the City of Milwaukee Health Department and the Mayo Clinic to improve computer understanding of language related to cancer treatment and survivorship. Computers can search and find words, but they are poor at understanding the meaning in language.
She has recently begun applying data science to deciphering forum participants’ intent, feelings and perceptions about treatment for chronic pain and the prevention of addiction.
No standard code of ethics dictates how the data science community can scoop up and use the personal information we share online. Michael Zimmer is looking out for us.
“You might have clicked a box five years ago, but does that mean you’ve given anyone free rein to use your online statements any way they want?” asked Zimmer, a professor in the UWM School of Information Studies and a principal investigator on the recently, nationally launched PERVADE project.
PERVADE researchers are reviewing data science courses nationally to see what’s being taught about data science ethics. They’re surveying Twitter and Reddit users about privacy settings and info gathering practices and sharing that feedback (with permission) with the data science research community.
Back in Milwaukee, Zimmer is finishing up curriculum for UWM’s first data ethics course and updating data science research guidelines. “Innovative data science research can happen,” he said, “but people need to be protected. Working with a large data set, you can sometimes forget there are people attached to that data.”