Data analysis and standardization

I’ve been spending this week working with the Excel inventory Lauren generated during our previous site visit. The initial analysis was pretty simple, though it was basically a manual review of the Excel doc based on the easily accessible information. However, we’d like to drill down into the information further in a few ways. 1) We want to present the inventory by year, so CreativeWorks can get some concrete data about the amount of digital content they generate. 2) We want to identify more of the file types, since the initial inventory does not provide all the data we need in each column. 3) It would be helpful to standardize the file sizes, since the inventory combines bytes, KB, GB, MB, etc., into a single column. 4) There is an odd scenario where Folders seem to be counted towards the total MB, but we suspect that may mean Folders and their contents are contributing duplicative info to the total size.

Working with Excel and some internet research, I created a MB column that should account for the file size standardization using these kinds of formulas.
– bytes to MB: # / 1048576
– GB to MB: # * 1024
– KB to MB: # / 1024

Next, I tried to use the Text-To-Columns feature to isolate the file names that used extensions to help identify some of the unknown file types. There is also a helpful CODEC column in our inventory that lets me isolate information even further.

I have been more stymied trying to isolate the information by year. There is the expected difference between file creation and file modified dates, further complicated by the fact that some of these files have metadata that sets them in the 1960s… I need to massage this more to see if I can get better results.

Of course, as I’ve been wrangling this data, I learned more about OpenRefine from this week’s reading and now I want to give that some additional thought as an option. While my manual processes will get us a one-time analysis that’s helpful for the purposes of the report we will present CW, a more repeatable method would be infinitely more useful to them.

Leave a Reply

Your email address will not be published. Required fields are marked *