Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.


The powers and perils of using digital data to understand human behaviour

Seated passengers on the subway using their mobile phones

Computational social scientists have been using data from mobile phones to study the coronavirus pandemic.Credit: Paul Seheult/Eye Ubiquitous/Universal Images Group/Getty

What are the causes of vaccine hesitancy? How can people be encouraged to exercise more? What can governments do to improve the well-being of citizens?

Social scientists researching these questions observe how people behave, record data on those behaviours and then augment this knowledge by interviewing and/or polling those whom they are studying. Carrying out research in this way is a time-consuming and manual process. Moreover, it is difficult to obtain large amounts of data simultaneously.

But now, researchers have access to an unprecedented amount of social data, generated every second by continuous interactions on digital devices or platforms. These include data that trace people’s movements, purchases and online social interactions — which are all proving extraordinarily powerful for research. As a result, work weaving large data analysis with social questions, known as computational social science, has witnessed huge growth in recent years.

During the course of the coronavirus pandemic alone, researchers have been able to access millions of mobile-phone records to study how people’s movement changed during the pandemic and the impact of those changes on how SARS-CoV-2 spread. They have been able to access anonymized credit-card purchase histories to study how people are spending money during the pandemic — information which is then used to understand how COVID-19 is affecting various sectors of the economy.

Using computers to analyse large data sets dates back to the earliest mainframe computers — and has been central to the work of actuaries and national statistics offices, both of which have long been important resources for studies of society and people. But the wealth of real-time and individual-level information is now unparalleled in its power to track trends, make predictions and inform decisions. And its availability puts it in reach of practically every social-science discipline: researchers in fields from psychology to economics and political science can now rely on data to enhance investigations of key societal questions.

Power and responsibility

At the same time, researchers need to remember that gathering and sharing such personal data — practices that are currently largely unregulated — pose many challenges to society. These include risks from increased surveillance, and the danger that people could be reidentified from otherwise anonymized data.

There are also concerns that people whose data are being used have not fully consented to this — and wider worries about the economic monopoly of tech corporations that own the majority of the data. These digital traces tend to be left disproportionately by relatively wealthy people in developed countries, biasing attempts to draw global conclusions. Acknowledging and working with these issues is key to ethical computational social science that promotes real societal progress.

The need to blend expertise in the social sciences with the skills required to collect, clean and analyse large data sets means that computational social science requires teams of researchers who can field a remarkably diverse set of expertise and skills. But with collaborations across disciplines come other challenges.

This week, Nature is publishing a special collection of articles with the objective of bridging the research disciplines and perspectives on doing science that underpin computational social science. We’re highlighting ways in which communities of social, natural and computational scientists can learn to better work together, to complement each other and overcome shared challenges.

Stronger bridges

To begin with, the varied disciplines need to overcome language barriers in which the same terms have different meanings. For example, in many of the social sciences (such as psychology and sociology), ‘prediction’ often refers to a correlation; in the physical sciences (such as physics, computer science and engineering), it usually means a forecast. True transdisciplinary research requires scientists first to learn each other’s languages, and then to develop a shared understanding of terms.

But the divide can run deeper than language, into how to curate, analyse and interpret data to explain a phenomenon. Jake Hofman at Microsoft Research in New York City and colleagues argue that computational social science could most effectively answer research questions by combining complementary approaches. For example, researchers building a numerical forecast on, say, the causes of traffic jams would assemble data on traffic flows, with insights from drivers on their reasons for taking particular routes.

The results of any study are determined by not only the analytical strategies used, but also the quality of the data — and this becomes particularly delicate when dealing with social data. The vast amounts of available data that make computational social science possible — such as tweets or location data from phones — are usually not gathered for research purposes and so can easily be misinterpreted.

That is why, as David Lazer at Northeastern University in Boston, Massacusetts, and colleagues write, researchers who work with large data sets must resist drawing conclusions from just the trends or patterns seen in the numbers — and should account for factors that could affect a result. To extract real meaning from data, researchers need to ensure that they carefully define the objects of their measurement according to theory, validate them and interpret them appropriately.

The widespread influence of algorithms is another source of potential error, as Claudia Wagner at the Leibniz Institute for the Social Sciences in Mannheim, Germany, and colleagues explain. They note that the algorithms that pervade our societies influence individual and group behaviour in many ways — meaning that any observations describe not just human behaviour, but also the effects of algorithms on how people behave. They argue that the theories that inform social science need to be updated to acknowledge these influences; without these theories and a clear understanding of the impact of algorithms on the available data, researchers will not be able to draw meaningful conclusions.

Yet another complicating factor for computational social science is that large data sets are often the private property of commercial enterprises. Academic scientists need to liaise with corporations to obtain access, and this might introduce even more bias. This is partly because, for companies, data are valuable — and therefore sharing data is a risk to their bottom line. That is among the reasons why firms tend to restrict what they share, as Jathan Sadowski at Monash University in Melbourne, Australia, and colleagues highlight. But in light of the potential of these data to provide societal benefits, companies — together with academic researchers and public bodies — need to collectively engage with these questions and set standards for quality, access and data ownership.

Ways forward

There are ways to obtain data that are can be useful and reliable, as Mirta Galesic at the Santa Fe Institute in New Mexico and colleagues describe in an article on ‘human social sensing’. This is the study of how individuals gather information on others in their social networks. For instance, researchers could predict a swing in political opinions by interviewing people and asking them what their friends are talking about. Gathering data about people from other people can help to avoid some of the biases seen in self-reported data, and has the added benefit of generating anonymous data: the researchers never need to know any personal or sensitive details about the people whom they are receiving information about.

Another area ripe for growth lies in the intersection of infectious-disease modelling and behavioural science. As Caroline Buckee of the Harvard T. H. Chan School of Public Health in Boston and colleagues argue, an accurate model of contagion and infection requires researchers to understand the cultures and behaviours of people who have been — or might be — infected. It is hard to predict a disease’s path without considering these and other social aspects of transmission. Structured and widespread collaborations cutting across disciplines are key to achieving this.

The pandemic has shown how lives can be saved when large-scale data sets are harnessed for science. This potential is only starting to be realized as researchers with backgrounds in computer science or applied mathematics join with social scientists. These relationships must deepen and encompass researchers in more fields — such as ethics, responsible research and science and technology studies — to ensure that we avoid known pitfalls and that we use these data in a way that maximizes gained knowledge and minimizes potential harm.

Transdisciplinary co-working is rarely easy, but it is essential for both better decisions and robust outcomes. Nature is committed to fostering this conversation, helping scientists to learn each other’s languages so that researchers can together make more progress on some of societies’ most pressing problems.

Nature 595, 149-150 (2021)



Nature Careers


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing


Quick links