Data Management Plan – Lindsay Tran and Mira Hayward

INTRODUCTION

In this project, we created a data management plan (including a data dictionary and data ethics statement) for data derived from Darrell Millner, Carl Abbott, and Cathy Galbraith’s 1995 report Cornerstones of Community: Buildings of Portland’s African American History. At some point, a spreadsheet of resources in the Albina District was created with information derived from the original narrative Cornerstones document. The goal of this project was to evaluate the status of this spreadsheet and create a data management plan that will guide future data entry workers towards reshaping the raw data of the spreadsheet into an accessible, consistently formatted, resource-focused database.  

Within the contexts of this project, our work facilitates further research into how displacement from Albina has affected the intangible cultural heritage of the Black community. By emphasizing individual contributions to the community’s vitality, the revamped GIS map will make it easier for researchers and community members to draw connections between urban renewal and its ill effects on communities; they will be able to point to specific examples from Albina thanks to the clarity of this map.

DATA MANAGEMENT

We began the process of building out the data dictionary by addressing our unfamiliarity with naming conventions and required fields. We first researched GIS requirements for character length, format (string or integer), and data hierarchy. After familiarizing ourselves with a few examples of data dictionaries–though none specific to the field of public history or historic preservation–we began to sort the existing fields in the Cornerstones dataset into three categories: discard, retain, and combine or change. We then reordered the hierarchy of data to keep related fields together and to make the data easier to scan by eye when a resource is selected on the map.

Discarded Fields

We discarded a number of fields to make the data easier to read and more relevant to what we hypothesized would be common research questions. Deleted fields included “Association,” “Association 2,” and “Association 3” (ASSOC, ASSOC 2, ASSOC 3). In our initial discussion about which fields to keep, we concluded that the information in the Association fields, if present, pertained to the typology of the resource (residential, commercial, church, organization, etc). This information is already addressed elsewhere, namely in the added field “RSRC_TYPE.” At a second meeting, as described below in the “Further Work” section, we concluded that the Association fields only perpetuated the pattern of centering data around people rather than resources; this constitutes an additional reason for removing these three fields, as it is was counterproductive to our goal of recentering the data around the resources themselves.

We also discarded the second tax lot field (TLID 2) as part of our effort to recenter the data around specific sites instead of individuals (place-based versus person-based). Only one tax lot should be affiliated with each site. 

We discarded the fields “X-COORD” and “Y-COORD” (both associated with georeferencing), as the tax lot ID provides sufficient information to pinpoint each site location. In addition, we discovered that each coordinate set was actually linked to multiple resources, which is not helpful for producing an accurate map.

We chose to discard the “RNUM” field because we were unsure what this referred to, and we could find no references in the Cornerstones report to indicate its meaning. We deleted the field “FC,” which referred to the resource’s extant status as of 1998. Finally, we discarded the field “New Address” because it was redundant; this information was already represented in the field “NEW_ADDR.”

Retained Fields

Fields we retained included “Name,” “Year,” “Old Address,” “TLID,” “OCCUPATION,” “NOTES,” “NEW_ADDR,” and “Photos & Files.” The only changes to these fields were capitalizing for consistency and changing the field names to better reflect the content and/or to fit the 10-character limit for field names.

Added, Combined, & Replaced Fields

Fields we added included “1998_DEMO,” “2010_DEMO” and “DEMO_YEAR” to encompass all demolition dates previously represented in the two fields “FC” and “Field Check.” We replaced the field “Occupation Location” with “EMPLOYER” and “EMPLOYER_2” to more neatly indicate place of employment; and added a “NOTES” field to cover all anecdotal information previously contained in the “OCCUPATION” field but that that did not easily fall into other categories. We replaced the “REF” field with both “SOURCE” and “SOURCE_YR” to better reflect the content of the fields and to help with further data sorting if desired. In anticipation of further data that could be collected to supplement the existing data, we added the fields “HH_SIZE” (Household Size), “NGHBHRD,” (Neighborhood), “DIST_STATUS“ (District Status), “RSRC_TYPE” (Resource Type – e.g. residential or commercial). We also added two fields, “NAME_2,” “OCCU_2” and “EMPLOYER_2” to account for an additional individual, like a spouse, who may have also resided at the address. 

Renamed Fields

We renamed several retained fields to better reflect the content or to adhere to the character limit for field names. The dataset as we received it included several field names that exceeded the character limit and were cut off, which rendered them incomprehensible.

“Name” was changed to “NAME_1” and “NAME_2”; “REF,” to “SOURCE” and “SOURCE_YR”; “Year,” to “RSRC_YEAR” and “SOURCE_YR”; “Old Address,” to “OLD_ADDR”; “Occupation,” to “OCCUPATION” and “OCCU_2”; “DEMO,” to “DEMO YR”; and “Photos & Files,” to “DOCUMENTS.” 

Hierarchy of Data

We reorganized the field order to reflect four key aspects of each resource: location information, resource information, demolition status, and documents. Location information includes the fields “NEW_ADDR,” “OLD_ADDR,” “RSRC_YEAR,” “RSRC_TYPE,” “NGHBRHD,” “DIST_STATUS,” and “TLID.” Resource information includes “NAME_1,” “OCCUPATION,” “EMPLOYER,” “NAME_2,” “OCCU_2,” “EMPLOYER_2,” “HH_SIZE,” “NOTES,” ‘SOURCE,” and “SOURCE_YR.” Demolition status includes the fields “1998_DEMO,” “2010_DEMO,” and “DEMO_YR.” Additional document information includes the single field “DOCUMENTS.”

Reorienting Towards Resources

The data collected in the Cornerstones project was organized around individuals. In the original data set, each entry line begins with the name of an individual or a business. Information on the resource is secondary. In order to transform this data set into a tool that is useful for historic preservationists, who are interested in the built environment, we decided to reorient the data towards resources (buildings). Instead of having multiple entries for individuals, and using the database as a way to record information about these individuals over time (which we feel is already done in the Cornerstones narrative report), we centered the resources. This reorientation was done in several ways. First, we moved all the information relating to the resource to the top end of the attribute table. Second, we consolidated multiple entries into individual, resource-centric entries. For example, multiple entries of Jane Doe at Location A should all be listed under Location A in one entry. If Jane Doe is also listed at Location B, this information should be entered in a separate entry for Location B. Finally, we added resource-related fields such as neighborhood (NGHBRHD), historic district status (DIST_STATUS), and resource type (RSRC_TYPE). These new fields will allow researchers using the database to obtain a fuller understanding of the built environment. 

Further Work

While we’ve made a good start on organizing this data, a number of problems remain for future data managers to tackle. Below, we’ve outlined some of the problems that we could not find satisfactory solutions to, given the limited scope and time of the course. 

First, there are a number of “artifacts” left in the dataset by past researchers that we weren’t able to decipher, such as abbreviations for sources. Future data managers might want to work directly with those involved with the Cornerstones project in order to learn more about the process of translating this data from a narrative document into a spreadsheet. Knowing who did this work, and perhaps being able to speak with that person, could provide insight into what these unknown abbreviations and codes mean. 

Second, we struggled to find a way to satisfactorily relay information about where exactly the information in the new NOTES field (formerly contained in the OCCUPATION field) comes from. Because this database was originally centered on individuals, each new piece of information required the creation of a new entry. In our new data model these multiple entries are combined. While this move allows the dataset to be centered on resources, instead of individuals, and makes for a more intuitive reading experience, it does create some confusion, as multiple pieces of information are combined into one entry. Currently, we ask data entry workers to list the source names (in the SOURCE field) and the source years (in SOURCE_YR) chronologically, with the exception that the source of the information in the OCCUPATION, EMPLOYER, OCCU_2, and EMPLOYER_2 should come first, with the remaining information listed chronologically in the NOTES field. This may lead to some confusion, as the information that the source name and source years entries refers to at present is spread out across multiple fields, and it might be hard for a reader to know which piece of information came first, chronologically speaking.  We have tried to address that by using the system listed above, but it’s a little clumsy, and future data managers may find a more streamlined way of accomplishing the goal of uniting source, source year, and information. 

Third, we were not able to find a way to succinctly and accurately capture data on non-occupational associations, such as affiliation with churches, political groups, and social groups. This information is important to understanding the lifeways and networks of the Black community in the Albina district, but the limitations imposed by the GIS framework and the scope of the data left us struggling to represent these associations. Because we wanted to center the data around resources, and not individuals, it felt inaccurate to include an association field, as the data usually did not make it clear whether it was simply the occupants of a resource that were affiliated with a group or whether the resource itself was. Additionally, many individuals were associated with multiple groups, and we couldn’t figure out a way to balance the need to capture this association data with our desire to keep the data from being bloated or overly detailed. 

Fourth, we hope further work will be done on organizing the contents of the NOTES field (containing information from the former OCCUPATION field). We have created a protocol for data entry workers to separate out important strands of this data, such as occupation and employer, but the rest of this information still resides in a disorganized text box. Because the type of information contained in NOTES varies so much, there was not a simple way to categorize it. We encourage future researchers/data managers to consider novel ways to represent this rich, nuanced data in a streamlined form. 

Finally, the data contained in this data set must be geocoded before it is usable. The original data set clustered multiple addresses at the same geographic coordinates, so until the data is geocoded it will not appear accurately on a map. Fortunately, this can be easily done using GIS software, once the correct addresses for each entry have been entered. 

DATA ETHICS

We recognize that data is never objective, and that its collection is informed by the biases, priorities, interests, and resources of the individual or group collecting the data. The Cornerstones data will ideally be accessible in the future to the public. As such, a careful consideration of how this data ought to be used and manipulated constitutes part of our recommendations for managing this dataset.

In constructing our data ethics guidelines, we looked to the PNW Just Futures Institute for Racial and Climate Justice’s (JFI) Research and Data Justice Principles. These principles were constructed by JFI to facilitate “ethical research and dissemination practices delineated in a series of research justice principles that seek to decolonize historically hierarchical university-community relationships.” We found the following principles to be particularly relevant to our work; future users of this data may wish to engage with the full document in order to guide their work. 

Accountability and Community Expertise

When conducting research on the intangible heritage of underrepresented communities, scholarly researchers depend on the contributions and resources of community members to complete their work. As such, community members’ time, energy, and intellectual agency must be accorded full value and respect. The ethical treatment of community expertise can be realized in several ways, including basing project timelines on community members’ prior commitments and schedules; compensating and crediting community members fairly for their contributions, including financial compensation if possible and appropriate; and prioritizing in the data collection process the emic perspective of community members over the etic perspective of academic researchers.

Awareness of Social Location

As students in a predominantly white graduate program, with access to the resources and opportunities provided by a research institution, we are aware that our work on this project is shaped by our social location. In the process of creating this data dictionary, we sought feedback from our classmates and instructors with the hope of incorporating the views of both burgeoning historic preservationists, historic preservation professionals, and community members, including people whose identities intersect these categories. Given the limitations of the course, we were not able to engage as fully as we would have liked with the community, and we recognize that the scope, perspective, and impact of our project is necessarily curtailed by the absence of community voices. To make this dataset an accessible, meaningful resource for all, more input from the community is needed. Researchers from the academy should acknowledge and be aware of this limitation. 

Digital Project Preservation and Sustainability 

The work done for this project is based on work originally done in the Cornerstones of Community report, which was published as a physical document and later uploaded as a PDF, and on work done by later workers to translate the narrative data of the Cornerstones report into a spreadsheet format. Given the multiple steps of translation, the data was understandably muddled and occasionally inaccessible. In our work for this project we aimed to undo some of the errors made in translation while also safeguarding against future errors by creating clear guidelines for data entry in the data dictionary. 

Additionally, we rely on digital tools to keep this data organized and accessible for future researchers. By using an online database, we can centralize future work done on the database, ensuring consistency. We worked closely with course TA Hannah Mellor to understand the requirements of using data in ArcGIS Online; their guidance helped us to create frameworks to ensure that this data will be usable across a range of platforms.

Research Transforms the Researcher

Recognizing the potential for changes in opinion and perspective over the course of this project is paramount to safeguarding the project’s intellectual integrity. In assessing and organizing the data, realizations about the data’s significance and consequences become apparent, which can change the researcher’s methodology over the duration of the project. This ties into the inherent subjectivity of the research process and thus the subjective nature of the data itself. Changes in perspective and priorities over the course of the project are acceptable and in fact desirable; but acknowledgement of these changes needs to be a part of the research process.

DATA DICTIONARY

Field Name
Field Type
Description
Character Limit
Format
Required?
Notes
NEW_ADDR
String
New address post-1933
255
Street number, abbreviated street direction, street name with type written out in full (e.g. 123 NW 1st Avenue, not 123 NW 1st Ave)
y
Between 1931-1933, Portland changed its street numbering system. For this approximately three-year interval, addresses gradually switched over to the new system. Be aware that the address listed in this field may date to any year within this interval. To find the new/old address of a site, visit the Renumbering Index at https://efiles.portlandoregon.gov/Record/2685610/ Direction possibilities: NE = Northeast NW = Northwest SE = Southeast SW = Southwest N = North S = South
OLD_ADDR
String
Address pre-1933, if applicable
255
Street number, abbreviated street direction (if applicable), street name (e.g. 456 Myrtle Street)
n
Changed from “Old Addres”
RSRC_YEAR
String
Year that resource was constructed
255
Four-digit year, e.g. “1955”
n
Can be found in Portland Maps.
RSRC_TYPE
String
The purpose that the resource served at the time when the entity/individual listed in NAME_1 occupied the resource.
255
Enter one from the following list: Residential Commercial Industrial Infrastructure Agricultural Institutional
y
See https://en.wikipedia.org/wiki/List_of_building_types.
NGHBRHD
String
The Portland neighborhood where resource is located. If the resource was in a neighborhood that no longer exists due to urban renewal projects, enter the name of the neighborhood prior to urban renewal.  
255
Name of neighborhood e.g. Boise
n
Do not abbreviate. Can be found in Portland Maps.
DIST_STATUS
String
Specifies if resource is in a historic district, other district, or no district at all.
255
Enter type or n/a
n
See: https://www.portland.gov/bps/historic-resources/historic-and-conservation-districts
TLID
String
Tax lot identification number
255
y
Include for reference for researchers – not narrative, but useful reference -avoids having to look up in a second place
NAME_1
String
Full name of associated individual 1 OR commercial/industrial/institutional/infrastructure/agricultural name associated with the site
255
First Name Last Name of individual or name of commercial/community establishment e.g. John Doe or Mt. Olivet Baptist Church
y
Associated individual 1 is the person listed as residing at the resource whose first name comes first in the alphabet. For example, “Abigail Adams” would be NAME_1, and “John Adams” would be NAME_2.
OCCUPATION
String
Job of NAME_1 OR type of commercial/community building
255
e.g Teacher or Restaurant
y
Do not abbreviate
EMPLOYER
String
Employer of NAME_1
255
e.g. Portland Hotel
n
Do not abbreviate
NAME_2
String
Full name of associated individual 2, if applicable
255
First Name Last Name of individual or name of commercial/community establishment e.g. John Doe or Mt. Olivet Baptist Church
n
This individual may be a spouse, a family member, a boarder, or a secondary person associated with the organization or business.
OCCU_2
String
Job of NAME_2, if applicable
255
e.g Teacher or Restaurant
n
Do not abbreviate
EMPLOYER_2
String
Employer of NAME_2, if applicable
255
e.g. Portland Hotel
n
Do not abbreviate
HH_SIZE
Short Integer
# of residents in building at time of residence of NAME_1
255
Integer, e.g. “4” or “15”
n
Can be found in the US Census.
NOTES
String
Any material found in the original SOURCE field that has not already been represented in another field goes into “NOTES”. Each piece of information should be preceded by a number that indicates its chronological place, relative to the other pieces of information in the entry. NB: source data on OCCUPATION, EMPLOYER, OCCU_2, and EMPLOYER_2 should always come first. Also included can be any other sources not listed in SOURCE
255
Example: 1. Occupation, 2. Employer, 3. Information A (or Occu_2), 4. Information B (or Employer_2), 5. Information C
n
Make sure that any information having to do with OCCUPATION or EMPLOYERhas been recorded in their respective fields.
SOURCE
String
Primary source materials for (each piece of) information, listed in chronological order, preceded by the corresponding number (see NOTES field). This information can be derived from the former “REF” field. If there are multiple pieces of information, sources should be listed in chronological order.
255
Use key formatting as described in notes section. If source is not included in key, put OTHER, and record source name in the NOTES section. Example: 1. Source of Occupation, 2. Source of Employer, 3. Source of Information A OR Occu_2, 4. Source of Information B OR Employer_2, 5. Source of Information C
n
Changed from “REF” Key sourced from “Buildings of African American History in Portland Master List,” pg. 113 of “Cornerstones of Community” document Key: OH – Oral history; Information provided by interviews with community residents KM – The History of Portland’s African American Community, Kimberly Moreland PP – A Peculiar Paradise, Elizabeth McLagan ADV – The Advocate (newspaper) OR – The Oregonian (newspaper) OBS – The Portland Observer (newspaper) REP – Royal Esquire Program, 1966 UL – Urban League Literature NAACP – NAACP Program, 1953 CR – Consumer Review MA – Black Pioneers of the Northwest, Martha Anderson OTHER – Includes all acronyms not specified above, as these were not explained in original key. UNKNOWN
SOURCE_YR
Short Integer
Year(s) that (each piece of) information was recorded, listed chronologically, preceded by the corresponding number (see NOTES field).
255
Full four digit year, e.g. 1955. If no year known, leave field blank. 1. Source Year of Occupation, 2. Source Year of Employer, 3. Source Year of Information A OR Occu_2, 4. Source Year of Information B OR Employer_2, 5. Source Year of Information C
n
If SOURCE = OH, no year, because year in data refers to year recorded not year referenced.
1998_DEMO
String
Was the building demolished by 1998?
255
Demoed or extant
n
Changed from “FC”
2010_DEMO
String
Was the building demolished by 2010?
255
Demoed or extant
n
Changed from “FC_DESC”
DEMO_YEAR
Short Integer
Year of demolition, if applicable and known
255
Four-digit year, e.g. “2004”
n
Changed from “DEMO”
DOCUMENTS
String
No text; only attached photos or other files that further illustrate the significance of the resource
255
n
Changed from “photos & files”
What should be done about multiple entries? Multiple entries (multiple data lines for the same individual(s)) should be consolidated around individual resources. For example, multiple entries of Jane Doe at Location A should all be listed under Location A in one entry. If Jane Doe is also listed at Location B, this information should be entered in a separate entry for Location B.