Wednesday, October 1, 2014

The Metadata Bubble

In an ideal world, scholars deposit their papers in an Open Access repository, because they know it will advance their research, support their students, and promote a knowledge-based society. A few disciplinary repositories, like ArXiv, have shown that it is possible to close the virtuous cycle where scholars reinforce each other's Open Access habits. In these communities, no authority is needed to compel participation.

Institutional repositories have yet to build similar broad-based enthusiastic constituencies. Yet, many Open Access advocates believe that the decentralized approach of institutional repositories creates a more scalable system with a higher probability for long-term survival. The campaign to enact institutional deposit mandates hopes to jump start an Open Access virtuous cycle for all scholarly disciplines and all institutions. The risk of such a campaign is that it may backfire if scholars should experience Open Access as an obligation with few benefits. For long-term success, most scholars must perceive their compelled participation in Open Access as a positive experience.

It is, therefore, crucial that repositories become essential scholarly resources, not dark archives to be opened only in case of emergency. The Open Archives Initiative (OAI) repository design provided what was thought to be the necessary architecture. Unfortunately, we are far from realizing its anticipated potential. The Protocol for Metadata Harvesting (OAI-PMH) allows service providers to harvest any metadata in any format, but most repositories provide only minimal Dublin Core metadata, a format in which most fields are optional and several are ambiguous. Extremely few repositories enable Object Reuse and Exchange (OAI-ORE), which allows for complex inter-repository services through the exchange of multimedia objects, not just metadata about them. As a result, OAI-enabled services are largely limited to the most elementary kind of searches, and even these often deliver unsatisfactory results, like metadata-only placeholder records for works restricted by copyright or other considerations.

In a few years, we will entrust our life and limb to self-driving cars. Their programs have just milliseconds to compute critical decisions based on information that is imprecise, approximate, incomplete, and inconsistent: all maps are outdated by the time they are produced, GPS signals may disappear, radar and/or lidar signatures are ambiguous, and video or images provide obstructed views in constantly changing environments. When we can extract so much actionable information from such "dirty" information, it seems quaint to obsess about metadata.

Databases automatically record user interactions. Users fill out forms and effectively crowdsource metadata. Expert systems can extract, from any document in any format and in any language, author information, citations, keywords, DNA sequences, chemical formulas, mathematical equations, etc. Other expert systems have growing capabilities to analyze sound, image, and video. Technology is evaporating the pool of problems that require human intervention at the transaction level. The opportunities for human metadata experts to add value are disappearing fast.

The metadata approach is obsolete for an even more fundamental reason. Metadata are the digital extension of a catalog-centered paper-based information system. In this kind of system, today's experts organize today's information so tomorrow's users may solve tomorrow's problems efficiently. This worked well when technology changed slowly, when experts could predict who the future users would be, what kind of problems they would like to solve, and what kind of tools they would have at their disposal. These conditions no longer apply.

When digital storage is cheap, why implement expensive selection processes for an archive? When search technology does not care whether information is excruciatingly organized or piled in a heap, why spend countless hours organizing and curating content? Why agonize over potential future problems with unreadable file formats? Preserve all the information about current software and standards, and start developing the expert systems to unscramble any historical format. Think of any information-management task. How reasonable is the proposition that this task will require direct human intervention in two years? In five years? In ten years?

For content, more is more. We must acquire as much content as possible, and store it safely.

For content administration, less is more. Expert systems give us the freedom to do the bare minimum and to make a mess of it. While we must make content useful and enable as many services as possible, it is no longer feasible to accomplish that by designing systems for an anticipated future. Instead, we must create the conditions that attract developers of expert systems. This is remarkably simple: Make the full text and all data available with no strings attached.

Real Open Access.