Where and how to find data definitions

 

Tjerk Timan, TNO – written for the BDVE

 

image: Wolfgang Stief Follow Control Data NOS – http://www.cray-cyber.org/ – taken from flickr.com the commons, rights-free

Data: what’s all the fuss about?

Not aiming to (re)produce here the countless debates in policy-and academic circles surrounding the upcoming GDPR [1], there are underlying questions related to some assumptions made in regulations such as the GDPR about what data actually is. But are they really accurate and stable assumptions to build far-reaching regulation on? Perhaps we need to take a step back and try to unpack a bit this concept of data. The reason for bringing up a debate that might seem like a purely academic exercise is that before (or next to) regulating something like a data single market, Europe also needs to develop a clear understanding of data. In an environment in which innovations and focus points related to data alter rapidly (from big data to AI and blockchains, and who knows what’s around the corner…), some basic starting points or building blocks would provide stability, or at least a common language when we try to regulate and stimulate data-driven innovation.

That data plays a pivotal role in current economic development in Europe seems evident, if the outlooks are to become real. The fuss in data-land seems to evolve around a global race for quantity of data (the more the better), controlling the flows of data (who can have access to what) but also about the quality of data–workers (who can educate and/or hire the best data scientists or AI experts?). On a data-level, the fuss seems to be around the protection of personal data and how to somehow protection EU citizens’ digital rights in a data- environment in which data ‘flows’ cross-boarders and cross-continents all the time. But what exactly counts as personal data in a digital environment and how to regulate data flows, is not entirely clear yet…

Can we define it?

Let’s us then look at the core of the problem: what exactly do we mean by data? When thinking of data flows, one can imagine a stream of 1’s and 0’s traveling from one physical place to another through some kind of data-tube or hose [2]. And indeed, in popular digital-tech-lingo, such metaphors are in use. Think of the notion the Twitter-hose [3], which could be ‘tapped’ into via an API [4], or streaming on-demand video services. The use of such metaphors in popular language also shapes and influences the policy debate – or at least they should. However, many data metaphors used in regulation are wholly inaccurate at the moment. To even arrive at a useful notion of data-flow from a policy perspective we first need to understand the entity that is flowing, i.e. the data itself. Increased interest in data-driven innovation prompted new attempts in the academic discourse at defining the notion of data. Floridi states that information = data plus meaning. It tries to capture that data alone and in itself is nothing – it needs context and a purpose, a meaning, for it to become ’information’. Kitchin & Dodge argue that data in itself is etymologically wrong, and instead explain that what we are all concerned with should actually be called ‘capta’, a captured representation of reality rather than reality itself. 

Horizontal versus vertical data

A more accessible explanation and classification of types of data in the context of the Web and social media stems from Menchen-Trevino. In her 2013 article she makes a distinction between vertical and horizontal data – horizontal data being trace data that we all, humans or machines, leave behind simply by being and acting digitally. This can be transaction data, web-browsing data or messenger metadata, for example. They are horizontal because they are comparable; they sit on a similar level in terms of information-value or explanatory power. Vertical data is the insight- and content-data that moves from superficial trace data to more in-depth content-driven data of a higher explanatory power. Where horizontal data is similar data on many subjects (be they things or events or people, e.g. ‘facebook timestamp data’ or ‘politicians’) vertical data refers to heterogeneous data sources on one subject (e.g. photos, videos and tweets on ‘hurricane Irma’). Data here is classified in terms of its analytical purpose. The value of this classification is that it leaves room for different sorts of data to enter without the model of what data is falling apart. What we can learn from the above-mentioned scholars is that online and digital culture brings about new types of data, but also new ways of thinking about data. Regulation should connect to how society lives with data. 10-year long (or more) trajectories of law-making will probably not suffice in that respect.

What is new about digital data?

But is that really true? Aren’t we ‘over-stating’ the newness of digital and the influence of digitisation? The point made nicely here as well, is that data in themselves, as a collection of fact to infer something from, is of course nothing new. However the entrance of computers and the Internet also brought along novel types of data that did not exist previously. A while ago, in Internet terms at least, Rogers delineated a Web Epistemology, arguing that natively digital objects such as Tweets or hyperlinks are fundamentally different from the ‘real’ world data that we have digitized. If one looks at a computer interface, this is immediately clear: we use a lot of metaphors in our daily digital spaces that nonetheless still adhere to concepts we know and recognize from the real world (see your digital waste-bin or a folder or file, for instance). However, Web 2.0 has brought about a range of digitally – native objects and concepts that do not have a counterpart in the non-digital world [5]. Again, this approach talks about data in terms of its context and purpose, rather than aiming to define it. Yet, it does capture novel languages and formats of expression that have become the bread and butter of daily-life media use. In a legal or policy context, however, the above definitions have limited application; written law for the moment hinges on unambiguous definitions.

What does the dictionary say?

In search of a precise and unambiguous definition, let us turn to the dictionary. Merriam-Webster defines data as:

  • Factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
  • Information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful
  • Information in numerical form that can be digitally transmitted or processed

Also here, the dictionary does not provide us with clear solutions, but at least with some directions. Three different conceptualisations of data are given, where the latter definition relates most strongly to the computer science perspective on data. The second definition hints at an important aspect in data driven processes: the fact that databases in themselves are never clean, neat and complete, nor are they always provided with the right metadata and tags about the data. Metadata about when and where something was collected, how and by whom, and if and how many previous owners or users the dataset has had and so forth, is often lacking. The first definition seems to support the idea that data in themselves are nothing, or are at least useless as separate ‘things’ – rather, data (plural) form the basis for insight, for reasoning, for meaning.

Data and moving markets

But how to give meaning to data in (economical) practice? In an attempt to make things a bit more concrete, the Swedish chamber of commerce has proposed a categorisation of data in terms of sector or part of society in their publication on the Free Flow of Data (p8): Corporate data, End-customer data, Human resources, Merchant data and Technical data. They make this distinction after reasoning that data is an ancillary freedom, supportive of the other 4 European Freedoms, yet not grown-up or clear enough to deserve its own category of freedom (not in the least because regulations such as the GDPR ‘trump’ some of the freedom). The point is that indeed there are novel, virtual markets, which need data to run them. Moreover, data in itself is becoming a tradable good with its own marketplaces popping up, also within Europe. Beyond marketplaces where ‘raw’ data is traded, or where matchmaking of dataset with problem or challenge takes place, there is added value to be found as well as in the trade of algorithms or data models. It is in such marketplaces where not only economic value of data will be determined; they also represent good places for regulators and policymakers to dig for definitions and meaning attributed to data.  In trying to capture and co-shape a Digital Single Market, regulators should more actively learn from data ‘practices’ and the data-innovators on the ground that try to make sense (and profit) from data. In a next blog, we will take a look at how start-ups positions themselves on data markets to provide a first glance at what can be learned from data practices.

Notes 

[1] Just see https://duckduckgo.com/?q=gdpr+implementation&atb=v85-3&ia=web&fexp=a to understand how GDPR implementation has sprung into life an entire industry of GDPR-reports and advisors.

[2] See for instance this image, which is overly represented in almost every presentation on data flows or digital highways etc:  https://cdn4.dualshockers.com/wp-content/uploads/2014/01/digital-tunnel-wallpaper1.jpg. Visual metaphors are extremely effective in shaping ideas on ungraspable concepts within science-and technology (See McDermott, R. (2000). Why information technology inspired but cannot deliver knowledge management. In Knowledge and communities (pp. 21-35), or Ruivenkamp, M., & Rip, A. (2011). Entanglement of imaging and imagining of nanotechnology. Nanoethics5(2), 185.). However, once a concept has become a recognizable image, it can lead to mis-interpretations of that concept (mainly because the representation becomes the presentation; the fact).

[3] See f.i. https://thenextweb.com/dd/2015/04/11/twitter-cuts-off-firehose-resellers-as-it-brings-data-access-fully-in-house/ on how Twitter closing the data flow (their ‘firehose’) has had an enormous impact on the data ecosystem, from journalism to third party services dependent on this hose.

[4] API stands for Application Programming Interface, which is a part of a software application that allows for interfacing with other software applications.

[5] Rogers claims this also begs for a new set of research methods that should be natively digital (as opposed to, for example, an online survey, which is a virtualized method – meaning it existed before but we ‘copied’ the concept in the digital realm) – see dmi.uva.net