A Glamor Make-Over for Data? OpenRefined and Air(table)Brushed

This week, we are working on completing a deliverable inventory to MITH, as well as getting into the contributor guide and the final report. In the process of polishing the inventory, I’ve monkeyed around in OpenRefine, which is indeed a usefull tool for crafting and editing a controlled vocabulary quickly and applying it to existing documents. The facets provide an easy-to-parse list, while existing side by side with the main sheet. It is an interesting contrast to Airtable. Airtable has invested much more in its GUI and so looks slick. But Airtable expects users to follow certain pathways, and does not easily allow for deviate without considerable backtracking. OpenRefine is so much more flexible and in some ways, intuitive. It lets you process the data simply to extract what you want. Still, despite it’s limitations in data manipulation Airtable presents a variety of visualizations that are useful for presentation and searching the finished product. The subject tags in particular shold be powerful search tools, and possibly useful in nudging future contributors toward conformity with the existing collection. By using both tools, I’m satisfied that the files I’ve processed are pretty close to (or at) their final form to turn over to MITH and LCHP.

Jenny and I are also preparing to talk to Prof. Sies tomorrow morning. I’d like to talk about how the Omeka site was set up, the function of the Collections that seem to have fallen into disuse, and the instructions received by the students about how to ingest their records into the collection. The Folklife Guide and the eBlackCU manual emphasize the importance of clearly articulating the goals of the project, and of understanding the community that your collection represents. From working with the Omeka collection, it is clear that Prof. Sies’s class has played a foundational role in the creation of the records. It may not do so in the future, but the shape of the repository has been set by these outsiders to the community.

Blind leading the Blind, or How we are writing a contributor guide

We have been spending the week cleaning up the metadata in the inventory and setting up the controlled vocabularies for the Places/Institutions (Authority Records) and Subjects (Authority Records). Looking over the tags that we’ve entered, I definitely see some areas for discussion. While a location like Block 4 lot 12 is a concrete place, tags like “University of Maryland” or “First Baptist Church” are more indeterminate. They are technically institutions, with a physical location, but in the context of records in question seem to act more as a subject of discussion rather than a physical body. Likewise, there are many people named in the Subject (Authority Records) section, in spite of the existence of a People (authority records) section. Perhaps it is because these individuals are being discussed, rather than generating these records? We need to meet as a group to discuss how we systematize these categories, especially since we anticipate that these are the major metrics through users will search for records. But on the whole these are nuts-and-bolts issues that I think we will get through without having to reconceive the project.

I’m a little more nervous about writing the contributor guide. I think we are still very much at the blobby, nascent stage of this part of the project. I have looked at the resources that Dr. Johnston has sent us. The Folklife guide and eBlackCU project, in particular, have given me some promising ideas about what to recommend for students conducting oral history interviews.However, these ideas are still contingent upon understanding the particular needs and capabilities of LCHP’s members and supporters.

To that end, I have contacted Mary Sies and hope to meet with her soon to understand the process by which these records were generated. It bothers me that we still know nothing how the edit mode of the LCHP Omeka is set up and who has access to it. If we want to create an improved, streamlined workflow for ingest, shouldn’t we at least have an idea of how it is currently being done? The idea of us writing a contributor guide feels a little like the blind leading the blind. But, after all, archivists ingest records left by deceased or unreachable creators all the time, so perhaps I’m being too rigid in how I’m thinking about this.

The Mystery of the Missing Files

As with the rest of my group, my work on the project this week has been to complete entering metadata for the Omeka collection into our inventory. Like Maya and Lauren, I look forward to working on the controlled vocabulary, which I think will provide good backing for recommendations that we make for MITH and LCHP. As I entered more and more data, I wondered about the usefulness of broadly defined Subject headings like “Housing” that are so ubiquitous as to comprise apply to at least half of the records. Like Jenny, I think that my entries will benefit from a second scrubbing to impose consistency and eliminate errors. The current state of the Omeka collection definitely illustrates the importance checking over your work, as well as creating a consistent protocol for preserving metatdata!

Which brings me to my main concern of the week: missing files. I need to check, but out of 746 file, we must have at least 50 missing, which is a big chunk! I think that we need to make a recommendation to LCHP about what to do with ghost records with metadata…but no data. Eliminating them all together means that information contained in the title will be lost. Furthermore, I think that some of the records can still be tracked down. Many of the missing records from my chunk of work have come from the city of College Park; it is entirely possible that they are still accessible for persistent researchers. Some of the oral histories that are missing seem to exist as audio cassettes, perhaps in the possession of LCHP members.  On the other hand, the metadata don’t seem very reliable if there is no record attached to them; how useful can that be? Some of the missing records not only have data missing, but also lack meaningful metadata. Perhaps a contingent approach to records can be applied on an individual basis.Ultimately, this is a decision for LCHP and MITH, but I do think that we have a responsibility to offer an actionable suggestion.

I had originally been skeptical about preserving the contributor/creator metric, but now it’s sort of become the basis of a kind of “naughty or nice” list in my mind. There were definitely some students who were very thorough (although still not consistent) at entering metadata (Hello Jocelyn Knauf and Gregory McCampbell!), while others were…not. I really want to know who the “wilmer” was who entered all of the City of College Park Urban Renewal Authority appraisal photos. They were prolific, so I can’t call them lazy, but the quality control on those documents was extremely spotty, and most of my missing records come from this contributor.

Puzzling out Lakeland

As Maya, noted, right now the whole group is putting in time entering the metadata of Lakeland’s Omeka collection. The metadata coverage is erratic, to say the least, with very spotty coverage of traditional Dublin Core categories and inconsistent terminology. Perhaps the most difficult and frustrating part of data entry is keeping myself from “improving” the data too much. We have agreed as a group to add short descriptions, to correct obvious typos, and to note our additions and edits by using italics. Beyond these minor improvements, I sense that even within my own record keeping, there is some drift in the subject and place/institution categories, and I assume that everyone is creating slightly varied wordings for the same concepts or organizations. We tried to minimize differences by using tags  within Airtable, but as I create more and more tags, I wonder if I should have used some of these new ones in earlier records. The next time we meet, I’d like to talk about creating a controlled vocabulary for the Subject (Authority Records) and the Place/Institution (Authority Records). Part of putting together a puzzle is making sure you are working with the pieces from the right set.

Still, despite the tedium of data entry and my concerns about consistency, I feel like I am learning a lot about Lakeland and its history. Bit by bit, I am seeing the foundational importance of the railroad, the fraught relationship between the university and Lakelanders, the complex manifestations of racism, and the ravages of urban renewal. Certain families–the Brookses, the Grosses, the Braxtons–and certain individuals–the developer John Kleiner, the assessor John Shank, councilman Leonard Smith–resurface again and again. It makes me nostalgic for my old historian work in the archive, to be honest. I would love to see the stories that Mary Sies and her students, and Maxine Gross and her community weave together with the records.

Day jobs and third shifts

Last week at this time, I was happy just to have secured an initial meeting with our client, the Lakeland Community Heritage Project. This week, as we try to jump into the work of the project, it is clear that we need so much more information than we were able to gain in a 90-minute meeting. We haven’t actually been able to get into the Omeka site through which the collection is accessed. We only just got read-only access to the Airtable that MITH has started. We still need to explore and discuss as a group the options for how to create an usable inventory.

During the meeting, I was impressed by the passion and dedication that the member of the working group brough to the project. They really believe in the importance of preserving community by creating and preserving community archives. Yet, this project is still an avocation for everyone involved. From the president of LCHP to the MITH team members to ourselves, everyone has a day job. LCHP, no matter how important in principle, gets lost in more urgent day to day demands. So scheduling meetings is like pulling teeth. Email and online chatting through Slack are better but still dependent on catching people at the right moment. I am noting this not (just) to vent, but to illuminate one of the key challenges of community archives. Largely supported by volunteer labor, the community archives often exist only through the gift of time and labor that can only be given intermittently and often inconsistently. And we should be grateful! Most of people involved in LCHP (and I imagine most other public-supported community archives) are already working day jobs and second shifts beyond what they give back to the community.  Since curation involves provision for the sustainability of repositories and records, we will need to account for this reality in whatever solution we propose. Quick comprehensibility and ease of use are crucial for effective use of volunteer/amateur labor.

 

Knowing and doing: Epistomologies of curation (week 5)

I agree with Andy that Yakel and Dallas are not as different in their conceptualization of digital curation as Dallas wants them to be. Both push for a more comprehensive and inclusive view of digital curation (actually, curation in general) than had previously been in use to that point. Specifically, they emphasize looking beyond preservation (or custodianship) to include actions from all points in the life cycle of data—what Dallas calls the records continuum. I would guess that the main differences between Yakel and Dallas occur due to the audiences for which they are writing. Despite Dallas’s scoffing, models can be useful to begin understanding a topic. As a novice I appreciate the prescriptive definition Yakel gives. I find it helpful to have starting point in the succinct and suggestive conceptual model Yakel offers, rather than diving head-first into the complex and ever-expanding web of examples, exceptions and jumping-off points that Dallas presents. In addition, I would imagine that for existing institutional repositories looking to update and expand their collections’ impact, Yakel’s 5-point list of digital curation activities could serve as framework for an action plan (338).

When it comes to following through with an action plan, however, Dallas’s challenging and inclusive approach may be more useful. By clearly identifying stakeholders beyond established institutions and the professionals who staff them, and bringing context to the forefront of the responsibility of digital curators, Dallas allows for more symbiotic, reciprocal (and hopefully more productive) processes of digital curation. Constant redefinition of models can become a tautological dead-end that misses important potential in digital curation to put information to practical use. So I see what Dallas is pointing toward in suggesting pragmatic embrace of the specificity of particular problems.

Yet, this nuance is only helpful in situations where people actually know what they are doing. Assuming competence is very humanistic and lovely, and Dallas does point to studies where passionate amateurs are startlingly good at what they do (432-433). But in our age of “fake news” and propaganda bots, the devaluation of trustworthiness seems dangerous. Projects may not be maintained or followed through from a lack of a clear management hierarchy or set of precedents. There must be a happy medium between an authoritarian “taming of the wild frontier” and fractally proliferating “contact zones.”

I am really glad to have read both articles as I start the group project and work on the Wikiproject. The clarity of Yakel’s model prepares me to implement the ethos of Dallas’s theory. In a day-to-day sense, Dallas’s approach will probably resonate most often, as the Lakeland Community Heritage project depends on community participation to provide its collection and to ensure its relevance. But I’m glad that Yakel defined the terms and concepts she did, because otherwise, I wouldn’t be able to understand what Dallas’s article meant.

 

Week 4: Initial impressions of the TRAC article

I am taking on the article on TRAC (Trustworthy Repository Audit and Certification). It’s just a stub of an article of interest (but not much) of the WikiProjects Libraries and WikiProject Digital Preservation working groups. What I find interesting about it is that trustworthiness ought to be of concern but isn’t. Why not? It is deemed of Low Importance to the WikiProjecs Libraries group, and unrated beyond “stub” by the WikiProject Digital Preservation Group. Despite its stub status, TRAC is still more documented on Wiki than other measures of trustworthiness such as the Nestor system (an alternative) or the Trusted Digital Repository (TDR) Checklist (its replacement). It also lacks any named person as a major contributor; instead, it was developed in committee by organizations (OCLC and CRL/NARA). Last, it really needs a section on how TRAC has been implemented and adapted in real life contexts. There are a number of scholarly articles that explain how different repositories and scholars have received and interpreted TRAC (including this one by class favorites E. Yakel and A. Kriesberg!)

Ironically, the TRAC article itself has a trustworthiness problem. The strongest and most useful part of the article appears to be not its text, but a infographic illustrating the “family tree” of standards to which TRAC belongs. It clearly and succinctly lays out the influences on and of TRAC. That is problematic in and of itself because it was created by a user without any references or guide. A further look at the references also shows a shallow pool of sources that rely on institutional blogs as citations. The information doesn’t look wrong per se, but it does not conform to the standards of quality reference that Wikipedia lays out.

Coming from a history background, it feels a bit off to rate primary sources, such as the institutional web site or blogs describing the process of making the standard, as lesser than secondary sources. However, this article illustrates why building directly off of the raw evidence is problematic: it necessitates analysis that is unvetted (like the TRAC family tree). Citing academic publications potentially insulates Wikipedia from biased analysis and misinformation.

I’m still unsure of the approach I should take in expanding this article. For example, CRL (Center for Research Libraries) has the full checklist on its website; should I cite it? On one hand, it’s literally the most direct piece of information a user could find to let them know what TRAC is. On the other hand, it seems to violate Wikipedia’s preference against using institutional websites as sources. Another tack that I am debating is creating pages for Nestor and TDR so that TRAC could be contextualized more fully. However, I don’t know if that goes beyond the desired ambit of assignment, or if there is a good reason why this hasn’t been done by the invested WikiProject groups. I think I still need to do more to familiarize myself with the topic and the page’s culture before I feel comfortable mucking around with it.

Wiki-editing: A first impression

I was struck by how easy it was to write and edit the Wikipedia entries; like many others in this class, I didn’t have much difficulty with the procedural aspects. It left more time to wonder about varying quality of the articles. Some subjects were clearly hubs of conversation and active dissemination of reliable information, such as the article on Digital Preservation and the biography of Margaret Hedstrom. Others, such as the community archives article or the Elizabeth Yakel biography seem disproportionately lacking, given the lively scholarly conversations around them that I have seen. I ended up adding a few citations to the community archives article because it seemed so obviously skimpy, with several assertions that lacked citations. I actually found that the community archives article was so neutral vis-à-vis the uses of community archives for activism and challenging institutional power structures as to be unhelpful. On a more hopeful note, I still don’t think I saw any obviously misleading or incorrect information even in the less substantive articles.

One issue that I saw as I compared the meaty articles to the thin ones was the level of connectivity within Wikipedia. The Digital Preservation article was robust because of the active community conversing within its edit page, but also presumably because of the network of articles in which it is embedded. It referred back to other pages that could lead readers and editors through multiple pathways to find the article. As Andy pointed out, the Digital Preservation WikiProject at least pays lip service to improving related articles. The community archives article, by contrast, was included in the archives category in Wikipedia, but had no other links to other Wikipedia articles that could lead readers there, or contextualize the term’s significance. As a WIkipedia use, I love falling down a rabbit-hole of linked associations and find my understanding of the subject compounds, rather than just adds up.