Friday, December 21, 2007

Current CMA Documentation Available

Coinciding with the availability of Fedora 3.0 Beta 1 this week, the first round of semi-official CMA (formerly called CMDA) documentation is now available: The Fedora Content Model Architecture. As Dan points out, we'll be doing some name changes before it's all said and done, but so far this is the most up-to-date diagram of the supporting object-object relationships:

As implemented, the BDef and BMech objects are basically unchanged. Here's what the new CModel control object looks like:

The DS-COMPOSITE-MODEL datastream specifies the structural requirements of member objects. The dsCompositeModel.xsd schema describes the expected format. For example, here's the DS-COMPOSITE-MODEL of info:fedora/fedora-system:ContentModel:

<dsCompositeModel xmlns="...">
<dsTypeModel ID="RELS-EXT">
<form MIME="text/xml">
<dsTypeModel>
<dsTypeModel ID="DC">
<form MIME="text/xml">
<dsTypeModel>
<dsTypeModel ID="DS-COMPOSITE-MODEL">
<form MIME="text/xml">
<dsTypeModel>
</dsCompositeModel>

Pretty simple. It says, member objects must have at least these datastreams, and each be in the form specified. If multiple forms are listed in a single dsTypeModel, the datastream may be in any of those forms.

Sunday, December 16, 2007

Fedora 3.0 - Where's the Binding Map?

Okay, I'm excited.

After several months of effort, Fedora Commons 3.0 Beta 1 should go live sometime this week. For most Fedora users, this Beta will be their first real exposure to the Content Model Dissemination Architecture, or CMDA. (this name is subject to change before 3.0-final)

Among other things, the CMDA allows people to attach runtime behaviors to digital objects at a class level. This architectural change has been a long time coming for Fedora, and we've worked hard to get the design right. Dan is working on the official design doc for publication with the software, but here's a simple overview of how it works:

The Fedora-defined CMDA relationships are expressed in RDF in the RELS-EXT datastream of each referring object. As long as all the necessary relationships exist, Fedora will use them to provide the desired behaviors for each data object. By design, the Resource Index does not need be enabled for this to work.

One question that will inevitably arise for those familiar with Fedora's traditional disseminators is, "Where's the Binding Map?". The short answer is, they no longer exist. For the long answer, continue reading.

Background
To support extensible "views" or "behaviors" on digital objects, prior versions of Fedora required each object to include a special piece of metadata called a disseminator. The disseminator included a reference to a "Behavior Definition" (an object that defines the behaviors), a "Behavior Mechanism" (an object that grounds the behaviors to a specific implementation), and lastly, a "Datastream Binding Map". The binding map's purpose was to map the datastream IDs in the object to specific input requirements of the BMech.

CMDA Implementation of Behaviors

With the CMDA, behavior subscription is now done at the content model level. Among other useful properties, this design allows people to significantly change behaviors for whole classes of objects without making changes to (or visiting) every single one.

Since the content model object would now appear to occupy the role of the old per-object disseminator, if a datastream-to-BMech-input mapping existed, it would go in the content model, right?

Actually, I don't think so. In general, a content model is intended to be a sharable object that survives through time. It a) describes a class of objects by their structure, and b) indicates which operations/behaviors they should have within a repository. In order for it to be as sharable and survivable as possible, the content model must not dictate *how* the operations are to be executed. That's the job of the BMech.

Part of the "how" is deciding which (if any) of the datastreams defined by the content model actually need to be given as input to the code that executes the behavior. At a high level, BMechs are bound to content models, and not vice-versa. The direction of the relation is important. It's the BMech's job to pick apart the content model it works with and decide how it's going to fulfill the contract with the given pattern of data.

Therefore the mapping, if necessary, is really a BMech implementation detail. But if a BMech only isContractor for one content model, then there's really no point to having the extra indirection...just make the part names in the BMech match the datastream IDs and be done with it. That's the simplest approach, and the one that I think will get people "up and running" with the CMDA the quickest.

But, you ask, what if you want to use the same BMech for content models that differ only in their datastream IDs? First, if possible, consider merging those content models. It'll make life easier for you in the long run. If that's impractical or doesn't make sense for your use case, then just create a BMech for each -- one that only differs in the part names used.

For Fedora 3.0b1, what this means in a practical sense is that people who have lots of variance in their datastream IDs will either need to "bring them in line" (which is a very practical thing to do in its own right, for ease of management), or will need to define different content models for them, which use different BMechs, even if they formerly used the same BMechs.

The migration tools (which I'm writing the docs for now) will do the latter automatically, creating Content Models and BMech copies with appropriate IDs automatically. If people want a "cleaner" upgrade, they need to invest some sweat in getting their datastream IDs consistent prior to running the analysis (the first of three phases of migration) so they don't end up with too-unmanageable a set of BMech copies.

3.0-final and Beyond
Two things absent from the Beta 1 release, which should be present 3.0-final are 1) the ability to assert object-object relationship constraints as part of the formal definition of a content model, and 2) a basic validator that can take a content model and an object that claims to adhere to it, and tell whether it actually complies or not.

For 3.0b1, we've kept the "Fedora Object Type" idea around. Viewed through this old lens, there are only four basic kinds of Fedora digital objects. We know that there is some overlap with the "typing" introduced by the CMDA. As the CMDA takes hold, I think the idea of "Fedora Object Type" can be gracefully subsumed by content model.

In future releases, the BMech will also evolve to something more flexible. We know people have got a lot of mileage out simple web service HTTP GET bindings, but other methods, protocols, and even in-VM code bindings are definitely called for. With the CMDA, we are now in a much better position to do these things.

Another idea that keeps popping up in CMDA discussions is, can an object be it's own content model? Or from a slightly different angle: Can a content model play the role of a Data Object, and thus act as a template? Also, what about multiple content models per object? Inheritance?

Blue Skies - CC Licensed - by Sybren Stüvel - http://www.flickr.com/photos/sybrenstuvel/520362534/
These questions hit on design, implementation, and best practices issues, all of which we are now in a much better position to discuss with the release of 3.0b1. I'm looking forward to it.

Tuesday, August 28, 2007

Fedora Commons Launched

For those who haven't heard yet.... this is great news.

Carol also has some pictures from the launch celebration over at NSDL Road Reports.

Here's the text of the official announcement:

FEDORA COMMONS AWARDED $4.9M GRANT TO DEVELOP OPEN-SOURCE SOFTWARE FOR BUILDING COLLABORATIVE INFORMATION COMMUNITIES

(Ithaca, New York, August, 2007) - Fedora Commons announced the award of a four year, $4.9M grant from the Gordon and Betty Moore Foundation to develop the organizational and technical frameworks necessary to effect revolutionary change in how scientists, scholars, museums, libraries, and educators collaborate to produce, share, and preserve their digital intellectual creations. Fedora Commons is a new non-profit organization that will continue the mission of the Fedora Project, the successful open-source software collaboration between Cornell University and the University of Virginia. The Fedora Project evolved from the Flexible Extensible Digital Object Repository Architecture (Fedora) developed by researchers at Cornell Computing and Information Science.

With this funding, Fedora Commons will foster an open community to support the development and deployment of open source software, which facilitates open collaboration and open access to scholarly, scientific, cultural, and educational materials in digital form. The software platform developed by Fedora Commons with Gordon and Betty Moore Foundation funding will support a networked model of intellectual activity, whereby scientists, scholars, teachers, and students will use the Internet to collaboratively create new ideas, and build on, annotate, and refine the ideas of their colleagues worldwide.

With its roots in the Fedora open-source repository system, developed since 2001 with support from the Andrew W. Mellon Foundation, the new software will continue to focus on the integrity and longevity of the intellectual products that underlie this new form of knowledge work. The result will be an open source software platform that both enables collaborative models of information creation and sharing, and provides sustainable repositories to secure the digital materials that constitute our intellectual, scientific, and cultural history.

Recognizing the importance of multiple participants in the development of new technologies to support this vision, the Moore Foundation funding will also support the growth and diversification of the Fedora Community, a global set of partners who will cooperate in software development, application deployment, and community outreach for Fedora Commons. This network of partners will be instrumental for making Fedora Commons a self-sustainable non-profit organization that will support and incubate open-source software projects that focus on new mechanisms for information formation, access, collaboration, and preservation.

According to Sandy Payette, Executive Director of Fedora Commons, "the new Fedora Commons can foster technologies and partnerships that make it possible for academic and scientific communities to publish, share, and archive the results of their own work in a free, open fashion, and make it possible to analyze and use content in novel ways."

"Establishing a sustainable open-source software system that provides the basic infrastructure for on-line communities of scholars will have enduring impact. The unanticipated cross-disciplinary uses of this open platform are the hallmark of this revolutionary infrastructure," said Jim Omura, technology strategist with the Gordon and Betty Moore Foundation.

Payette also noted, "The open-source software that is developed and distributed by Fedora Commons can impact the entire lifecycle of what is often referred to as "e-Research" and "e-Science," including storage of experimental data, analysis of experimental results, peer review, publication of findings, and the reuse of published material for the next generation of scholarly works. We will also continue our work with libraries and museums to facilitate the sharing of digitized collections, making previously locked away material available to wide audiences. Also, building on our attention to digital preservation in the Fedora open-source repository system, Fedora Commons will continue to stress the importance of the sustainability of digital information in applications of our work."

About Fedora Commons
Fedora Commons is a non-profit organization whose purpose is to provide sustainable open-source technologies to help individuals and organizations create, manage, publish, share, and preserve digital content upon which we form our intellectual, scientific, and cultural heritage. Since 2001, with support from the Andrew W. Mellon Foundation, Cornell University and the University of Virginia have collaborated on the Fedora Project which has developed, distributed, and supported innovative open-source repository software that combines content management, web services, and semantic technologies. The Fedora software has been adopted worldwide to support an array of applications including open-access publishing, scholarly communication, digital libraries, e-science, archives, and education.

Fedora Commons will initially be located in the Information Science Building at Cornell University, Ithaca, New York. The Executive Director of Fedora Commons is Sandy Payette, who co-invented the Fedora architecture and led the Cornell arm of the open-source Fedora Project. The Board of Directors of Fedora Commons provides leadership from multiple communities, including open-access publishing, digital libraries, sciences, and humanities. For more information, visit http://www.fedora-commons.org.

About the Gordon and Betty Moore Foundation
The Gordon and Betty Moore Foundation, established in 2000, seeks to advance environmental conservation and cutting-edge scientific research around the world and improve the quality of life in the San Francisco Bay Area. The Foundation's Science Program seeks to make a significant impact on the development of provocative, transformative scientific research, and increase knowledge in emerging fields. For more information, visit http://www.moore.org.

CONTACT:
Fedora Commons: Sandy Payette
(607) 255-9222, payette@cs.cornell.edu
http://www.fedora-commons.org
Gordon and Betty Moore Foundation: Greg Nelson
(415) 561-7427, greg.nelson@moore.org

Thursday, March 29, 2007

FTP ASCII unmangler

FTP text mode is evil.

I made the mistake of transferring several important binary files from OS/X to windows last night, using FTP. Actually, a few mistakes were made along the way. 1) I didn't check that I was in BIN mode first, 2) I didn't verify the integrity of the files after the transfer, and 3) I deleted the sources.

Luckily, it was Unix-to-Windows, which means all #10 octets were replaced with #13#10. First I tried dos2unix with no luck. Then I wrote a program to replace all #13#10 sequences with #10 and crossed my fingers.

It worked. Here it is in all it's inefficient glory. Maybe this will help someone else someday. No guarantees, but it's worth a shot if you're desperate.

import java.io.*;

public class Unmangle
{
public static void main(String[] args) throws Exception
{
InputStream in = new FileInputStream(args[0]);
OutputStream out = new FileOutputStream(args[1]);
int prev = 0;
int b = in.read();
while (b != -1)
{
if (prev == 13 && b != 10)
out.write(13);
if (b != 13)
out.write(b);
prev = b;
b = in.read();
}
if (prev == 13)
out.write(13);
out.close();
in.close();
}
}

Thursday, February 01, 2007

Social Bookmarking == Free Metadata

Metadata is expensive. Librarians aren't the only ones privy to this fact.

I remember the pain we went through in bringing HP's FTP site to the web. The first step was converting the old README files to HTML. Perl made this a snap, but somehow it didn't address the now-more-apparent quality problem. So we slurped it all into a Paradox database had a big metadata entry party.

Ok, the word "party" might be a stretch. There was technically pizza involved, but it was more of a bribe. It lasted days, and nobody really celebrated 'till it was over.

I distinctly remember the phrase "metadata monkey" entering my vernacular at that point.

I don't have anything against monkeys. Monkeys are cute. Monkeys at keyboards are even cuter. But metadata entry has long been viewed as a thankless job.

Now, sites like del.icio.us have figured out a way to get metadata monkeys to work for free. The incentive? Not pizza, not even bananas: just the ability to store our own descriptions, share those descriptions with others, and access it all from anywhere.

It makes me wonder about the role of the library in creating authoritative, versus curating social, metadata.

Sunday, January 28, 2007

Resources, Representations, Repositories, and RDF

Last week, Carl Lagoze gave an update on the OAI-ORE work at Open Repositories '07. ORE is a new project that intends to specify how heterogeneous repositories can exchange information about the digital objects they hold. Although they're not necessarily going after a new protocol, I still think of it as taking OAI-PMH to the next level. It's not just about metadata anymore.

For me, the most interesting parts of the talk were webarch-related. It all started with the statement (to paraphrase) "we must build on the web architecture". Carl then pointed out how representations are essentially second-class citizens on the web.

That got me thinking. At the most basic level, repositories are all about managing bitstreams (whether they're considered data or metadata). In webarch, bitstreams seem to equate to what they call "representation data". And a representation is defined by how it relates to a resource:
"A representation is data that encodes information about resource state."
So, in w3c-speak, a repository manages representation data. Okay, that's just a terminology change. But what about this statement:
"For robustness, Web architecture promotes independence between an identifier and the state of the identified resource."
That makes a whole lot of sense for the web when you consider how often web pages change. But what does it mean for repositories? How do we manage bitstreams if we can't identify them? The answer must be one of the following:
  • Indirect identification. Identify the associated "resource" in order the work with the bitstream(s).
  • Reification. Elevate the bitstream to a "resource" so we can talk about it.
How about if we want to model the repository as an RDF graph? Well, we know that representations can have metadata in addition to the payload. So in order to do this modeling, we need to reify. Internal to a repository, representation triples might look something like:

representationA represents urn:example:someTextFile
representationA contentType "text/plain"
representationA payloadLocation "/path/to/someTextFile.txt"

I think the OAI-ORE work is going to attempt something like the above: a model (and maybe a format?) for expressing resource-representation information in a repository-neutral way. It will be interesting to see what pops out.