Estonian Information Society Yearbook 2011/2012. Karin Kastehein
Чтение книги онлайн.
Читать онлайн книгу Estonian Information Society Yearbook 2011/2012 - Karin Kastehein страница 6
• developing a website at opendata.riik.ee (beta) as Estonia’s information gateway for access to and use of open data;
• creating infrastructure for publishing data (repository; beta);
• specifying, in cooperation with open data communities, the preliminary organizational, technical and semantic requirements for open data;
• the cloud solution CKAN was recommended to power the central repository (http://ckan.net);
• the cloud-based Drupal search engine was recommended as the front-end system (http://drupal.org);
• Apache SOLR was recommended as the search engine (http://lucene.apache.org/solr);
• interfaces that support RDF and SPARQL standards were required;
• a LAMP platform was required;
• interoperability with other repositories was required;
• it was assumed that institutions could establish their own repositories, but the central repository had to be capable of picking metadata from them;
• it was presumed that institutions could load datasets directly to the central repository.
Front page of the pilot application of the open data website
The pilot version of the central open data site can be found at http://opendata.riik.ee. The site consists of three integrated systems:
• A site for news, questions, discussions and manuals where manuals and news can be posted, questions brought up and discussions on the open data topic can be held.
• A CKAN-based database of open data links, specifications and key metadata (see http://ckan.org), which can be linked to from the menu item Open Data on the site’s upper menu bar. The following can be retrieved from this database:
1) open data can be searched and downloaded: without access restrictions;
2) new open data can be added (registration and user privileges from administrator required).
• A repository for datasets, which is one of the possible places where a government department can save open data.
Technically, preconditions have been created for developing open-data infrastructure. But technical solutions are not enough. It will be necessary to staff and train a team to be capable of administering and developing infrastructure and performing supervision; their activity should also encompass public sector data generators as well as open-data communities that develop services.
How to publish?
In what format? The main principle is that it is much better to publish data in an inconvenient encoding than to not publish them at all on the consideration that it is planned at some unspecified time to improve the encoding. Secondly, a published dataset can always later be published in a new, better encoding.
In the context of open data, we recommend evaluating the user-friendliness of formats and coding formats based on Tim Berners-Lee’s five-star system19 principles, which are described in the previous article. Publishing of datasets is best done in formats that can be opened and processed using freeware applications. This includes .odt format document files as well as some of the most common formats of structured data, such as .csv, json and .xml.
Formats that can be opened and modified by freeware applications are well-suited to re-use.
The use of one-star formats for opening data is to be avoided. But on the other hand publishing them as such is certainly better than not publishing them at all.
Two-star formats are used primarily for data where all that users need is access to the data. Re-use means, above all, viewing and cutting and pasting of data. To ensure development of services, the open data should be presented in a three-, four- or five-star format.
Three-star formats. Three-star data should advisably be in one of the following formats, depending on whichever is more convenient for the data publisher. From the user standpoint, there is not much of a difference between these formats, but it would likely be most convenient to use .json.
Preconditions have been created for developing open-data infrastructure.
• csv files. The documentation must specify the alphabetical encoding, whether comma/semicolon is used as delimiter and whether a period or comma is used as the decimal point. Files should advisably have a header listing the names of the fields. And certainly the official20 csv format should be used as the basis, along with nuances as regards quotation marks etc (see http://en.wikipedia.org/wiki/Comma-separated_values).
• json files. Same requirements for language encoding standards.
• xml files.
Four-star formats. The principles are the same as in the case of three-star data, but the primary difference here is that globally unique identifiers – URIs or uniform resource identifiers – are to be used to identify objects. The use of uniform identifiers makes it much easier to use data across different systems.
To adopt URIs, a dataset prefix is to be added to each object-identifier during data export, for instance http://institution.ee/nameofdataset/objects/, where the full URI would be http://institution.ee/nameofdataset/objects/45321 and 45321 the object’s original ID in the dataset. If the IDs are not unique for their own data set (which is the most usual situation) the easiest would be to express URIs during export in a form where the name of the relevant table is added instead of “objects”, for instance http://institution.ee/nameofdataset/naturalpersons/.
Once objects have begun to be presented as URIs, it would be suitable, besides use of csv/json/xml, to express data in the form of RDF – as entity-attribute-value triplets21.
As data can be appear in various syntaxes in RDF, we advise using one of the following two:
• Microdata22, which is for encoding data into html to be read by humans: information can simultaneously be easily parsed by humans and is also readily machine-readable.
• RDFa23, which is analogous to and has the same objectives as Microdata, but is slightly more complex.
Data in Microdata format can always be converted into RDFa with little trouble. The next question after the Microdata/RDFa issue is the selection of field names – the names of object properties. There are two main approaches.
• The simplest way would be to express pairs of table/field names of the original dataset in the form of URIs, for instance: http://www.institution.ee/nameofdataset/nameoftable/nameoffield, an example being http://www.institution.ee/permitrecipients/naturalpersons/dob (www.institution.ee/permitrecipients/naturalpersons/dob).
• A slightly more complicated but
19
http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/
20
http://en.wikipedia.org/wiki/Comma-separated_values
21
http://en.wikipedia.org/wiki/Resource_Description_Framework
22
http://en.wikipedia.org/wiki/Microdata_%28HTML%29
23
http://en.wikipedia.org/wiki/RDFa