Data standards: What are they and why do they matter? The Analysis

Data standards: What are they and why do they matter? The Analysis

With a great pleasure we’re introducing an analysis of what data standards are and what they mean in practice.  You can read the report on the transparencee.org website where we divided it into 2 parts: The Introduction and The Analysis. This report is a part of our efforts to formulate regional standards for data and data related process (law included).

The ultimate goal of the TransparenCEE initiative is to strengthen the civic tech sector in Central and Eastern Europe (CEE). We build foundations for collaboration, in part, by suggesting data standards to be used in joint projects.

 

 

iso_8601

Comic available at xkcd.com/1179/ under the Creative Commons Attribution-NonCommercial 2.5 License.

What is a data standard?

Think of a Master’s thesis that you have to write at the university. It consists of a title, abstract, the thesis itself and a bibliography. Oh, and it’s written as part of your curriculum and verified by other people. So you should specify the university, the faculty, the supervisor and the reviewer. As well as some audit logs: date of creation, last modification date, date of acceptance by reviewer (and his/her opinion), date of acceptance by supervisor. Probably also a bit of translation for the global community: a title and an abstract in English if the thesis is in some other language.

Now let’s say that you want to create a tool to browse theses on any given subject. You need to gather a substantial number of theses and feed them into a computer. For that to happen you need to transform each thesis into a single data record.

Planning how this data record will look like is called modelling. We model real life examples into data records. Modelling can drop some unimportant (arbitrarily) details, like did you publish your thesis in paperback or hardcover, or the color of the hair of your supervisor (unless you want to analyze this factor). Apart from that, it’s mostly about specifying requirements by making decisions like “should supervisor, reviewer, author be specified by providing given name, family name, academic title, university represented?”, “would the title and abstract be obligatory fields?” or “should dates include just year, month, day, or maybe the an hour is necessary as well.” Stakeholder collaboration is essential in this phase as different contexts need to be grasped. In one country, an independent review may not be necessary, while in another you may need two reviewers.

From modelling to representation and interoperability

Creating data standards is all about interoperability: the ability to exchange standardized data between systems owned by different subjects. For that to happen one more step is required: representation – making the decision which file formats to use, how to format dates (look at the last picture again ), how to store images, etc. In the end, you can land with the same information represented (or “serialized”) in possibly different file formats. The resulting files carry the same information, and a preference for one or the other is mostly a matter of preference, if you have the resources all of them can be used in parallel.

Here are a few examples of the same content represented in some popular formats.

JSON (preferred by scripted solutions)

{

“author”: {“given_name”: “Krzysztof”, “family_name”: “Madejski”},

“title”: “Data standards: What are they and why they matter”,

“date_of_final_accept”: “2016-01-29”

}

 

CSV (anyone is able to view it in spreadsheet, but embedding objects (ie. author in the thesis) is not possible)

author_given_name, author_family_name, title, date_of_final_accept

Krzysztof, Madejski, Data standards: What are they and why they matter, 2016-01-29

 

XML (preferred by bigger institutions)

<thesis>

 <author>

    <given_name>Krzysztof</given_name>

    <family_name>Madejski</family_name>

 </author>

 <title>Data standards: What are they and why they matter</title>

 <date_of_final_accept>2016-01-29</date_of_final_accept>

</thesis>
Or any other format for which so-called “serialization” is defined. And these files can then be processed by computers.

Standardizing a Standard

Now, what if I create and announce such a standard as “Madejski Thesis Standard 1.0”? Well… most likely no one  would care.

The power of the standard comes from the power of all the stakeholders using it. If it’s not really common then it isn’t really a standard.

standards

The comic is published on xkcd.com/927/ under the Creative Commons Attribution-NonCommercial 2.5 License.

There is also one key element to standards: their openness. However, there is no single standard for what constitutes an open standard:

There are a number of definitions of open standards which emphasize different aspects of openness, including the openness of the resulting specification (is it published online? do you have to pay to get it?), the openness of the drafting process (who can propose changes? who decides?), and the ownership of rights to the standard. link

Coming from the internet community, we suggest using World Wide Web Consortium’s definition that stresses open process of standards creation, transparency, relevance and royalty-free usage (you don’t have to pay to use it):

[…] we define the following set of requirements that a provider of technical specification must follow to qualify for the adjective Open Standard:

  • transparency (due process is public, and all technical discussions, meeting minutes, are archived and referenceable in decision making)
  • relevance (new standardization is started upon due analysis of the market needs, including requirements phase, e.g. accessibility, multi-linguism)
  • openness (anybody can participate, and everybody does: industry, individual, public, government bodies, academia, on a worldwide scale)
  • impartiality and consensus (guaranteed fairness by the process and the neutral hosting of the W3C organization, with equal weight for each participant)
  • availability (free access to the standard text, both during development and at final stage, translations, and clear IPR rules for implementation, allowing open source development in the case of Internet/Web technologies)
  • maintenance (ongoing process for testing, errata, revision, permanent access)

Investing in standards by civil society (mainly by using them, but also participating in their development) should go in parallel with ensuring that the community has a voice on these standards. Please go through the definition above as a checklist before using any standard. When working on data that is not yet standardized, we propose that you involve other international stakeholders and create a W3C community group devoted to working on data standardization in a given field.

How to serve the data?

Data opening is a quite costly process. When we’re doing it for a social cause, let’s be sure that apart from creating a tool operating on data, we also publish the data itself. How do we do it?

One option is the data export. You export all the data and publish it online as a file. Anyone can access it just by downloading it. When data changes you should set-up automatic periodical data exports (each month or day, depending on the data source).

That option works quite well when data is small in size and doesn’t change very often.

The second common option is to serve data through so-called APIs. API is a piece software with detailed specification which acts as a socket through which the data can be pulled by other computer programs. Think of it as a power socket. For example, if your hair dryer has a matching plug, you just plug it in and the electricity flows to your hairdryer, much like data flows to your data-propelled computer program or website.

API is a slightly more complicated option than data export, though if you’ve build your website on any modern web framework, you probably have an API with sufficient basic functionality built-in. When you have a vast amount of data that changes quite frequently, it’s more efficient to publish an API rather than do data exports.

P.S. APIs should be standardized as well as the data they serves. Think of the power adapters you need to take with you while going to UK from Continental Europe. With software these adapters cost even more than the plastic ones at Heathrow!

the UK plug adapter

The UK power plug adapter by Adafruit Industries published under the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 license.

What aspect do we analyse?

As part of this project, we will analyse and recommend data standards to be used in the tech for transparency field.

You will find them below, divided by their scope – the real-life situations that can be modelled by these standards. For each we will mention:

  • open source tools that work with these standards, so you know what you can deploy to process your data, or what other projects can make use of your data;
  • coverage of the standard: who, where and how uses the standard. The bigger coverage the more established the standard;
  • contacts to people responsible for existing deployments so you can consult with them;
  • challenges in data modelling in existing projects (ie. is this parliamentary body closer to a committee, a commission or a  board? how was it modeled in countries with a  similar parliamentary system);

finally, what kind of data is covered (data types, data classes) and links to specifications.

Krzysztof Madejski