Week 12: Where Metadata Ends and Contributor Guide Begins

It’s been satisfying this week to see our combined controlled vocabulary go from 346 Places/Institutions down to 261 with the elimination of duplicates. I’ve also been looking through the community heritage resources Jesse sent us, and they seem like they’ll be very helpful for my part of the Contributor Guide describing how to input and process future files. Although we haven’t heard back from MITH yet (as Maya pointed out), we did just hear back pretty quickly from Dr. Sies, the American Studies professor whose classes have worked on the Lakeland Community Heritage Project in the past. We’re working on coordinating a meeting with Dr. Sies in the near future to get the missing context we discussed last week. Hopefully next week we’ll be reporting back on a successful meeting!

Controlled Vocabulary and Moving on to the Contributor Guide

This week, I cleaned up my metadata and added my sections to our controlled vocabulary. I tagged each item not in a collection with our “None” tag and copy-and-pasted my Places/Institutions and Subject columns into our Controlled Vocabulary sheet. Today, I deleted blank rows (items without subject tags or locations) and duplicates. Per Suzanne’s suggestion, our next step with the controlled vocabulary is alphabetize our terms and delete duplicates again.

We will use this controlled vocabulary as recommended terms for our contributor guide. We have not heard back from MITH and still do not have access to the backend of the Omeka website. Regardless, we need to start writing the contributor guide as there is not much time left in the semester. We initially planned to also write a user guide for the LCHP, but I think it is now a better idea to include it as a suggestion in our Final Project Report and Sustainability Plan.

My task for the contributor guide is to write an overview of the collection. This upcoming week I plan to review everyone’s metadata to see a greater idea of the document types, themes, and people in the collection. I will also look at collection overviews for other archives to see what kind of information should be included. I feel that I will enjoy this part of the project more now that the inventory is finally done.

Week 12: Moving Towards Final Projects

I worked this week on our Preservation Plan, pulling from our group notes and emails to house pertinent information, brainstorming for possible digital museum tools, and trying to wrap up our work. Last week I reported some averages, but Andy had some good questions: Are we able to get exact numbers by year, and how accurate could these be given the wholes in their archives?

I revisited the inventory list and tallied some very rough year-by-year figures of number of files (Juli is working on a more precise tally). I created a graph to show visually what CreativeWorks is generating a year so we can look at the situation in another light.

Temporary graph of files in CreativeWorks archives.

Temporary graph of files in CreativeWorks archives.

Andy asked why 2017 is such an outlier. Which raises some other questions: Is 2018 going to be similar since this chart only includes 3 months of files? And if this is the case, then file creation has more than doubled over previous years, and so what does this mean storage and back up issues moving ahead? Or maybe the broken hard drive is housing sets of files that we don’t have access to? Whatever the answers are, I feel like the visuals make a stark statement of what is at stake without a back up and preservation plan in place.

I plan to also create a bar graph for the disk space being used once that information is available.

I am also revisiting the inventory list to make it as useful a document it can be while trying to balance the fact it’s likely most of the content won’t need to be recalled—various version of the same document exist and many of the files lack descriptive names. But at the very least, I think these can be grouped by year then by parent folder.

Options for the “Digital Museum”

This week I concentrated on documenting several of the activities and processes our team has performed for Joe’s Movement Emporium. In particular, I have focused on providing a step-by-step process for activities that Creative Works staff will have to maintain on their own, such as updating their keyword taxonomy. We have also provided a hard copy of the taxonomy itself, and the folder structure outline that we are suggesting they use for both their in-progress files as well as their long term storage.

The inventory analysis continues to take shape, and I’m optimistic that we may already have some data that will illuminate the necessity for Joe’s to put more resources into asset management. I’m still hoping that we can find a way to provide yearly file size totals, but even the file counts that we have already are compelling.

Lastly, I have begun doing a bit of high-level research on possible tools that Creative Works could potentially look at for the “digital museum” goal that they want to work towards. I know this is outside the scope of our project for the semester, but I also see the value in having some idea of what the possible end state might look like. It is a given that we will recommend looking at Omeka. From a cost-benefit standpoint there are few (if any) competitors that offer the same features, ease of use, and flexibility of application. Once again, the sticking point for me has been what a more short-term-achievable alternative might look like for them? Initially I started exploring what capabilities are built into platforms that they already own, or could add onto for minimal extra cost, such as OneDrive, or Adobe Creative Cloud. In short, neither seems to be a good option. OneDrive doesn’t appear to offer much in the way of a publicly available and searchable presentation-quality interface. It is primarily a workflow collaboration tool. Adobe does offer some cloud capabilities for sharing images, such as Adobe Portfolio, which comes with Creative Cloud. However, this does not appear to offer searchability. Adobe also has something called Adobe Experience Manager that appears to be an end-to-end workflow and digital asset management system. It just happens to be the most expensive one on the market, with a total implementation cost around $2 million. That’s clearly meant for large enterprises. Not Joe’s.

In the end I suspect my teammate Lauren is correct that the best short term solution will be to encourage Joe’s to look at how they might use readily available social media outlets in a smarter, more organized way. In fact, if Creative Works staff adhere to the metadata practices we have developed for them these improvements will set the stage for more accurate search results in platforms like YouTube and Flickr.

The Mystery of the Missing Files

As with the rest of my group, my work on the project this week has been to complete entering metadata for the Omeka collection into our inventory. Like Maya and Lauren, I look forward to working on the controlled vocabulary, which I think will provide good backing for recommendations that we make for MITH and LCHP. As I entered more and more data, I wondered about the usefulness of broadly defined Subject headings like “Housing” that are so ubiquitous as to comprise apply to at least half of the records. Like Jenny, I think that my entries will benefit from a second scrubbing to impose consistency and eliminate errors. The current state of the Omeka collection definitely illustrates the importance checking over your work, as well as creating a consistent protocol for preserving metatdata!

Which brings me to my main concern of the week: missing files. I need to check, but out of 746 file, we must have at least 50 missing, which is a big chunk! I think that we need to make a recommendation to LCHP about what to do with ghost records with metadata…but no data. Eliminating them all together means that information contained in the title will be lost. Furthermore, I think that some of the records can still be tracked down. Many of the missing records from my chunk of work have come from the city of College Park; it is entirely possible that they are still accessible for persistent researchers. Some of the oral histories that are missing seem to exist as audio cassettes, perhaps in the possession of LCHP members.  On the other hand, the metadata don’t seem very reliable if there is no record attached to them; how useful can that be? Some of the missing records not only have data missing, but also lack meaningful metadata. Perhaps a contingent approach to records can be applied on an individual basis.Ultimately, this is a decision for LCHP and MITH, but I do think that we have a responsibility to offer an actionable suggestion.

I had originally been skeptical about preserving the contributor/creator metric, but now it’s sort of become the basis of a kind of “naughty or nice” list in my mind. There were definitely some students who were very thorough (although still not consistent) at entering metadata (Hello Jocelyn Knauf and Gregory McCampbell!), while others were…not. I really want to know who the “wilmer” was who entered all of the City of College Park Urban Renewal Authority appraisal photos. They were prolific, so I can’t call them lazy, but the quality control on those documents was extremely spotty, and most of my missing records come from this contributor.

Data analysis and standardization

I’ve been spending this week working with the Excel inventory Lauren generated during our previous site visit. The initial analysis was pretty simple, though it was basically a manual review of the Excel doc based on the easily accessible information. However, we’d like to drill down into the information further in a few ways. 1) We want to present the inventory by year, so CreativeWorks can get some concrete data about the amount of digital content they generate. 2) We want to identify more of the file types, since the initial inventory does not provide all the data we need in each column. 3) It would be helpful to standardize the file sizes, since the inventory combines bytes, KB, GB, MB, etc., into a single column. 4) There is an odd scenario where Folders seem to be counted towards the total MB, but we suspect that may mean Folders and their contents are contributing duplicative info to the total size.

Working with Excel and some internet research, I created a MB column that should account for the file size standardization using these kinds of formulas.
– bytes to MB: # / 1048576
– GB to MB: # * 1024
– KB to MB: # / 1024

Next, I tried to use the Text-To-Columns feature to isolate the file names that used extensions to help identify some of the unknown file types. There is also a helpful CODEC column in our inventory that lets me isolate information even further.

I have been more stymied trying to isolate the information by year. There is the expected difference between file creation and file modified dates, further complicated by the fact that some of these files have metadata that sets them in the 1960s… I need to massage this more to see if I can get better results.

Of course, as I’ve been wrangling this data, I learned more about OpenRefine from this week’s reading and now I want to give that some additional thought as an option. While my manual processes will get us a one-time analysis that’s helpful for the purposes of the report we will present CW, a more repeatable method would be infinitely more useful to them.

The Invader Zim Shirt Was Appropriate

As Andy already stated, I popped over to Joe’s this afternoon since I live so close to it and knew that he was going to be there. I have no experience with Adobe Bridge, so I wanted to see him import the keyword taxonomy. We ended up being there for around 1.5 hours and covered such topics as the taxonomy, our proposed folder structure, and our desire for a presentation to Joe’s staff about the progress that has been made, ideas for the future, and how they can help.

One of goals from the start has been to empower Sierra to make decisions related to past and future digital assets. She is the staff member who works the closest with them, but she has not been there for very long. When we first met her, we could tell that she was feeling overwhelmed by the chaos and that she did not feel empowered to take charge (for example, no one specifically tasked her with cleaning the mess up or directed her towards resources to show her how she might). However, we could also tell that newer files generated since her arrival were in better order than previous ones, so she was capable.

It was so nice to feel her genuine excitement as we were wrapping up today. You can tell that her head is above water and that she is enjoying being able to breathe easier (at least before whatever problems this new branch location might bring). It reminded me of being a teacher and the joy that I felt in seeing my students grow, partially due to my help. I like Sierra and I want to see her be happy and successful.

Aside: I have never seen someone react so positively to seeing a document related to file naming conventions (I found this artifact in their files from 2014; it was not produced in-house, but it is evidence that someone was at least thinking of the same kind of work that we are doing).

Wrapping up metadata

I’ve finished entering in the metadata for my entries. It wasn’t a particularly difficult process, though it is concerning that most entries have very sparse metadata, and that several entries appear to be missing the actual image (or in one case, the actual oral history audio recording) itself. In one case, one entry’s title/description seemed to be for the image associated with the previous entry.

I look forward to moving onto the contributor guide. In the meantime, I will go back and review my entries, seeing if any cleanup needs to be done.

Wrapping Up with Metadata

This week I completed my section of the Omeka inventory. I faced the same issues as Lauren, where a lot of the JPEG files were missing and the item pages were mostly empty. I also saw a good number of instances where the title did not correctly identify the information in the file. For example, the title would suggest the item was about one location when the actual information in the file was about something else. There was even an item with the words “WRONG TITLE?” in the title.

I agree with Lauren that building a controlled vocabulary for the Omeka website would be beneficial for the project. Many of the items share common themes yet are not placed under the same subjects or have no subjects at all. I also wonder how valuable some of the files are. Some items are just pictures from Google maps with no dates, subjects, or ties to specific people in the community. It leads me to wonder how selective the LCHP is when adding files to their archive. Maybe I am just missing the value of these images, though.

I plan to go back through my section and clean up my metadata. I still need to add in the adjustments we decided to make, such as tagging items not in collections with “None” under the Collection tab in our spreadsheet. I look forward to building the contributor guide (once we have access to the backend of the Omeka site) and working on something different for the project.

Speaking of Missing JPEGs…

I have now catalogued the metadata in Airtable for my quarter (186) of the 746 Omeka records, in time for our self-imposed deadline of tomorrow. Along the way I have found that in several records, for some reason particularly ones describing JPEGs, there are file descriptions, but the file itself is not attached. Hopefully MITH, or someone at least, still has these files on one of the 3 hard drives associated with this project, since they don’t seem to be on the Omeka servers.

These records in particular, unsurprisingly, have especially poor metadata, which is both sparse as well as rife with spelling and grammatical errors and inconsistent capitalization and punctuation. These were lower-numbered records, so they seem to be among the earliest inputted (but among the last I catalogued because the Omeka pages work backward from most recently inputted), and an Omeka learning curve could help explain some of these errors. We’re awaiting access to the admin side of the Omeka site so we can have more context to understand the file and record inputting process.

Fortunately, most of the records are indeed attached to files, with varying levels of metadata that are interesting to analyze. I can definitely see the value of our controlled vocabularies in organizing and standardizing the metadata, and our inventory in helping MITH assess how much overlap there is between their hard drives and the Omeka site and fill in any gaps. But I wish we had more time for this part of the project! Now we prepare to move on to writing the contributor guide, which I don’t think I will enjoy as much as I’ve enjoyed cataloging and learning about the people and history of Lakeland!