Sunday, August 29, 2010

On metadata, Google, and why we still need librarians and libraries

This article in the Chronicle of Higher Education takes Google to elaborate task for its inadequate metadata.

What is metadata?  It's a word that I was utterly unfamiliar with up until about 6 months ago.  And my grasp of its meaning is still that of a non-expert.  My definition of "metadata" is:  the descriptive data attached to the electronic records of library materials, be they books, articles, or other documents/items, that allow for search engines to find those materials.  (I am sure my colleagues here will correct me when I am wrong--or elaborate on that definition if they find it inadequate).  In short, if there is bad or incomplete metadata, the best materials for the searching patron may never turn up, when they type in subject categories, author names, book titles, or publication dates.

Google's massive book-scanning project inspires a great many alarmist conversations among academics, and I have my own reservations about the Google-ization of academic publishing.  But it seems clear to me that, if Google is going to do this, with the purpose of making nearly all of academic publications accessible and searchable, they should at least do it correctly.  This article points to the errors already embedded in Google's metadata, and highlights the potential trouble for scholars if those errors are allowed to persist.

Do you expect to be able to find useful information when you Google things?  How can you expect it, if there are such serious flaws in their methods?  Perhaps this is a hint that Google is not actually the be-all end-all of academic search, but remains just a starting point.

Among the messages I take from the article today is the one that states: Google should be employing actual metadata experts in their books project.  And also that, for now, the need for librarians and their expertise is not going away anytime soon.


  1. One of the serious drawbacks of Google searching is lies in the nature of language and meaning (keyword with a high reliance on what are essentially early 1960s technologies - keyword and the use of Boolean operators). While we may have something very specific in mind, the multiplicity of significances that a particular term may have can be numerous and contradictory. Finding the proper term to describe the "aboutness" of an item can easily lead to early hair loss (by tearing it out). One has only to think of bugs (entymology) vs. bugs (computer programming) vs. bugs (psychological term for annoys) to bugs (criminology for eavesdropping devices) to bugs (a colloquial term in some circles for being insane). Merely typing in the term `bugs` can end up with a whole range of items, many of which may have nothing to do with what the user really wants.

    This is where the value of librarians come in. It has to do with pre-coordinated vs. post-coordinated search terms. By constructing controlled vocabularies (pre-coordinated subject terms), librarians (specifically catalogers) help to reduce the ambiguities of language that can bring to the searching experience.

    Note that this is particularly important in the "softer" sciences (social sciences and humanities) where terms definitions can be very much discipline-specific as opposed to the hard sciences where terms are "hard wired". For example, the phrase "multi-processor" has a very specific meaning in computer science. I would dare say that in communications theory or anthropology, the signification would be far more ambigous (and perhaps mutually contradictory.

    The whole question of "aboutness" (coming up with a term that describes a certain work that adequately covers all of the possible "subjects" of that work) is an extremely complex and difficult subject.

    The problem becomes even more acute when one is searching against a MARC catalog (where access to a staggering proportion of the world's academic materials resides). In this case, we can't rely on brute force use of terminology to extract needed information (the MARC record known for a greater or lesser degree of subject-based terms, the usual record often limited to three terms, hardly enough to describe the works that they are trying to represent).

    I would therefore hold that a multifacted approach to searching, including document clustering, probabiilitic retrieval, and other advanced information retrieval techniques developed over the past fifty years would - in counjunction with Google-type searches - bring patrons closer to the ability to retrieve the articles that they need in their research.

  2. It may seems a bit silly and somewhat counter intuitive, but what would you say to developing our "Google Fu?"

    Growing up in a generation where I have had to derive information from computer systems, instead of having someone pre-coordinate the question, it has been invaluable to understand how to structure my arguments/questions in terms of keywords/boolean values.

    While I agree with using advanced methodologies such as mutating algorithms, node clustering, etc. Even after a question is evaluated, if the information still resides in an older system - digital or analog, the query must always be deconstructed and inputed properly.

    Librarians seems very well facilitated at this within both the analog and propriety (i.e. paid databases) methods. General search structure and access to free materials seem to be lost sometimes.

    To put it simply I use Google Scholar / Books because the project has merit and thought the search structure may not be as sound, it is much more familiar and consistent with normal search methods.

    As for meta-data entry, I believe that curation of the string/boolean/number values will be come less relevant as more advances come in comparative data sets based on contextual language.

    You gotta love how you can Engineer anything.. Go Google.

  3. As if librarians provide complete metadata in their own catalog records. People who live in glass houses should not throw stones.

  4. A number of the issues addressed in the article were brought up in a January 2010 conference. Google is bringing into play library-generated metadata, including MARC 21 records (although one Google engineer apparently quipped that the machine-readable part of MARC was "a lie"). Here's the link:

  5. The Go To Hellman blog post on Google Books and metadata is (in the January 2010 part of the blog, in case cutting and pasting the link above does not work)is very interesting. I think that another thing that this discussion reveals is that the creation of metadata is a subjective process, and as such reveals a disconnect between the people coming up with metadata categories and content, and the people doing the searches that require metadata.
    Librarians complaining about the quality of Google’s metadata (and how it makes it hard to find things) sound very similar to patrons complaining about how hard library catalogs are to figure out (and how hard it makes it to find things). That disconnect is inherent to the process. Enter: social science. The more we know about what it looks like when people search for particular kinds of information, in theory, the more effectively we can work towards programming searches and creating metadata that will get people where they need to go.

  6. I think that social science is only part of the answer. Much of the problem stems from the fact that, for the most part (at least within MARC-based systems), we are still relying on 1960s technology (Boolean logic) to retrieve information from our library catalogs. That works well with database systems (as opposed to "informationbase" systems) where you are looking for very concrete items (average salary, number of people in a certain department, address lists, etc.).

    The problem is that we are using tools meant to retrieve data (individual pieces of clearly defined data, such as name, date of birth, years of employment, etc.) do not work well in information retrieval systems where aspects of "aboutness" are FAR less hard-coded. In such cases, it is true that the creation of metadata is by its very nature a subjective process. Different people WILL use different terms in describing what a book or other source of information is "about". It can't be otherwise.

    Such a situation is bound to make the use of library catalogs in particular a difficult task. While I agree that ethnographic research can help us build better systems, there are already established technologies (although not within commercial ILS front-ends) that have been known for forty years or more that can be used in a library catalog to overcome a number of these difficulties.

    I thus see this as an area where ethnographers and information retrieval specialists can work iteratively to develop more usable systems that can provide the patron with catalogs that serve their needs and not that produce frustration and a lack of results.


This blog is being used for research purposes, and your comment may be used in discussions and/or publications regarding research on patron work habits in the Atkins Library.
Please keep comments clean and constructive.