Over the past three decades, the internet has become an integral part of contemporary societies. Online content is growing at a tremendous scale and changing dynamically. In spite of that, social sciences and social scientists have paid little attention to the kind of account of social change the World Wide Web can provide. This article provides an introduction to the subject matter of web archives, which can serve as sources of data that help us draw a picture of the dynamic change of contemporary society and communication. It seeks to discuss the different problems faced by social scientists in utilising web archive data and to propose, or at least sketch, their solutions.
Over the past three decades, the internet has become an integral part of contemporary societies. Online content is growing at a tremendous scale and changing dynamically. In spite of that, social sciences and social scientists have paid little attention to the kind of account of social change the World Wide Web can provide. This article provides an introduction to the subject matter of web archives, which can serve as sources of data that help us draw a picture of the dynamic change of contemporary society and communication. It seeks to discuss the different problems faced by social scientists in utilising web archive data and to propose, or at least sketch, their solutions.
In the first section of the article, we explain the purpose of web archives and their current institutional framework both in the Czech Republic and abroad. In the second section, we discuss issues of accessing web archive data. We distinguish technological access limitations, where the researcher is faced with large amounts of data and computing requirements, legal, and ethical limitations. Legal limitations break up to ones related to copyright and personal data protection. The article emphasises that there are ethical limitations in addition to technological and legal ones, which the researcher should always keep in mind and which should be respected in his/her approaching the data. As a partial solution to data access limitations, the article proposes creating and operating an analytical interface through which researchers could obtain aggregate web archive data without direct access to primary data.
Finally, the third section of the article deals with the methodological limitations of web archive data. It primarily focuses on issues of representativeness, incompleteness and heterogeneity of such data. It presents three methods of data collection for web archiving: so-called selective, thematic and full-domain harvesting. In selective harvesting, curators decide which websites to include and, from the perspective of quantitative social research, such samples are only representative of selected sources. Thematic harvesting focuses on specific topics and, as such, provides highly representative and complete collections of data as long as the researcher focuses on any of the topics covered. However, the number of topics is extremely limited. Full-domain harvesting is the most relevant method for social researchers, yet it is affected by partial non-representativeness and incompleteness. As a partial solution to the problem of limited representativeness of full-domain harvests, the authors propose implementing weighted random sampling of web archive data. Currently, however, such a solution is faced with the absence of suitable sampling frames.
The third section concludes by focusing on the issue of interpreting web archive results. An important finding is that when studies are based exclusively on such data, the researchers do not know the intentions of the actors who created the content and can only speculate about their motivations. Here, the authors see an opportunity for integrating traditional sociological research with web archive data. Furthermore, the article stresses that observed changes to online content are based not only on changes in actors’ behaviours but possibly also shifts in the population of internet users, technological innovations and, last but not least, modifications of data collection methodology. It is, therefore, important for web archives to document their data collection efforts carefully and complete any analytical interfaces they provide with a precise description of the methods available to researchers.