University of Oregon
UO Research Data Management Blog

Thoughts on Data and Ethics, and Resources for Psychology

Thoughts on Data Management and Data Ethics

Earlier this year, Brian Westra and I gave a brief presentation on Data Management issues in an annual seminar that the Department of Psychology holds for its first-year graduate students. The seminar’s larger topic was data ethics. One question that came up was how data management and data ethics relate to one another.

This post has two parts:

1) A few points on how data management and ethics relate. It can be useful to think about this topic explicitly because discussions about it can help to guide research and data management decisions.

2) A list of resources on these topics for new graduate students. Some of the links relate specifically to Psychology, but they all apply in principle across disciplines.

Short version: Data ethics is built on data management. Both are more about one’s frame of mind than about any specific tools one chooses to use. Having said that, it’s important to give oneself exposure to ideas and tools around these topics, in order to push one’s own thinking forward. Some useful resources are listed below to help you get started.

Thoughts on Data Management and Ethics

  • Data ethics is built on data management.Ethical questions and potential problems come up based on data management decisions made in the past, down to topics as seemingly trivial as where data are stored (on a passwordless thumb drive? Or in a place where disgruntled employees or Research Assistants can access or abscond with them?), to what data are even kept (it’s hard to re-use something, for good or ill, if you didn’t record it in the first place).

    This doesn’t just apply to ethically problematic topics, though. Things that may look like bad ideas at first (such as the ability of RAs to remove data from the lab) may not be in certain situations, just as ideas that seem good at first may come to seem bad later. The larger point is that ethical questions about both legitimate and illegitimate uses of research data need to be considered and addressed as they come up, and that DM decisions can help one to predict which questions are more likely than others to come up during the data lifecycle.

  • Talking about both data management and ethics is more than talking about tools.Because data ethics questions come up based on data management decisions, it is true that discussions about data ethics sometimes require at least some minimum level of technical understanding. Basic technical knowledge here can help to answer questions beyond whether certain data should be kept or not, such as which ways of storing and offering access to those data would be acceptable. This basic technical knowledge is important for many ethical discussions, because it can help to shape the conversation to more nuanced topics.

    Rather than the take-home message here being that lacking some amount of technical understanding means that one shouldn’t engage in conversations about data ethics, I think that making education on these topics easily accessible (here at the UO, for instance, through our own DM workshops, workshops from the College of Arts and Sciences Scientific Programming office, and resources such as the Digital Scholarship Center) is important and necessary, as is taking advantage of them.

  • Data Management (and, thus, data ethics) is about having a certain frame of mind, even at a superficial level.Data Management often has to do with thinking about decisions up-front rather than in a reactionary way. This frame of mind can also apply to talking about data ethics. Even if some ethical issues haven’t come up yet, having good DM in place can help one to more quickly understand and respond to new issues that do come up.

Resources for New Students (especially in Psychology)


Issues of data management are not going away; indeed, their relevance to individual researchers will likely increase — the White House, for example, recently issued new guidelines requiring Data Management Plans (and encouraging data sharing) for all federal grant-funded research. Below is a list of resources to prompt further thought and discussion among new grad students.

These are listed here with a focus on Psychology; having said that, many of them have relevance beyond the social sciences:

  • A useful summary of tools that are available for graduate students to organize their work (including data), from Kieran Healy, a Sociologist at Duke University.
  • An overview of the new “pre-registration” movement in Psychology: “Pre-registration” is when researchers register their hypotheses and methods before gathering any data. In addition to increasing transparency around research projects, this practice can increase how believable results seem, since it can decrease researchers’ incentives to go “fishing” for results in the data. This practice could also presumably be used to build a culture in which all aspects of a project, from methods to data, are shared.
  • Especially relevant for social scientists, a nice summary of several cases that deal with data management and the de-identification of data:
    • A summary of several cases in which de-identified data were able to be re-identified by other researchers, from the Electronic Privacy Information Center (EPIC)
    • A more nuanced, conceptual reply to (and criticism of) focusing on cases such as those in the summary above from EPIC. A take-home message from these readings is that data can sometimes be re-identified in very creative ways not immediately apparent to researchers. Other sites, such as this page from the American Statistical Association, summarize techniques that can be used in order to share sensitive data. Of special note is that “sensitive data” could, if that information were re-identified, include not only medically-related records, but even answers to survey questions about morality or political affiliations.
  • For students at the UO: Sanjay Srivastava, Professor of Psychology, often includes commentary on data analysis and transparency issues on his blog, The Hardest Science.

Feel free to comment here or email Brian with questions about data management issues. Also take a look at our main website for more resources.

Annotate, Annotate, Annotate

This post is part of a series on future-proofing your work (part 1, part 2). Today’s topic is annotating your work files.

Short version: Write a script to annotate your data files in your preferred data analysis program (SPSS and R are discussed as examples). This will let you save data in an open, future-proofed format, without losing labels and other extra information. Even just adding comments to your data files or writing up and annotating your analysis steps can help you and others in the future to figure out what you were doing. Making a “codebook” can also be a good way to accomplish this.


Annotating your files as insurance for the future…

Our goal today is simple: Make it easier to figure out what you were doing with your data, whether one week from now, or one month, or at any point down the road. Think of it this way: if you were hit by a bus and had to come back to your work much later, or have someone else take over for you, would your project come to a screeching halt? Would anyone even know where your data files are, what they represented, or how they were created? If you’re anxiously compiling a mental list of things that you would need to do for anyone else to even find your files, let alone interpret them, read on. I’m going to share a few tips for making this easier with a small amounts of effort.

In the examples below, I’ll be using .sav files for SPSS, a statistics program that’s widely used in my home discipline, Psychology. Even if you don’t use SPSS, though, the same principles should hold with any analysis program.

Annotating within Statistics Scripts:

Commenting Code

Following my post on open vs. closed formats, you likely know that data stored in open formats stand a better chance of being usable in the future. When you save a file in an open format, though, sometimes certain types of extra information get lost. The data themselves should be fine, but extra features, such as labels and other “metadata” (information about the data, such as who created the data file, when it was last modified, etc.), sometimes don’t get carried over.

We can get around this while at the same time making things more transparent to future investigators. One way to do this is to save the data in an open format, such as .csv, and then to save a second, auxiliary file, along with the data to carry that additional information.

Here’s an example:

SPSS Logo

SPSS is a popular data analysis program used in the social sciences and beyond. It also uses a proprietary file format, .sav. Thus, we’ll start here with an SPSS file, and move it into an open format.

SPSS has a data window much like any spreadsheet program. Let’s say that a collaborator has given you some data that look like this:

Example_Data_View

Example Data

Straightforward so far? The data don’t look too complicated: 34 participants (perhaps for a Psychology study), with five variables recorded about each of them. But what does “Rxn1,” the heading of the second variable in the picture, mean? What about “Rxn2?” Does a Grade of “2″ mean that the participant is in 2nd grade in school, or that the participant is second-rate in some sport, or something else?

“Ah, but I’ve thought ahead!” your collaborator says smugly. “Look at the ‘Variable’ view in SPSS!” And so we shall:

SPSS_Example_Variable_View

Example SPSS Variable View

SPSS has a “variable” view that shows metadata about each variable. We can see a better description of what each variable comprises in the “Label” column — “Rxn1,” we can now understand, is a participant’s reaction time on some activity before taking a drug. Your collaborator has even noted in the label for the opaquely-named “RxnAfterTx” variable how that variable was calculated. In addition, if we were to click on the “Values” cell for the “Grade” variable, we would see that 1=Freshman, 2=Sophomore, and so on.

It would have been hard to guess these things from only the short variable names in the dataset. If we want to save the dataset as a .csv file (which is a more open format), one of the drawbacks is that the information from this second screen will be lost. In order to get around that, we can create a new plain-text file that contains commands for SPSS. This file can be saved alongside the .csv data file, and can carry all of that extra metadata information.

In SPSS, comments can be added to scripts in three ways, following this guide (quoted here):

COMMENT This is a comment and will not be executed.

* This is a comment and will continue to be a comment until the terminating period.

/* This is a comment and will continue to be a comment until the terminating asterisk-slash */

We can use comments to add explanations to our new file as we add commands to it. This will actually allows us to add more than the original SPSS file would have held, since we can now have labels as well as the rationale behind them, in the form of comments.

Now able to write comments, we can write and annotate a data-labeling script for SPSS like this:

/* Here you can make notes on your Analysis Plan, perhaps putting the date and your initials to let others know when you last updated things (JL, March 2014): */

/*
Perhaps here you might want to list your analysis goals, so that it’s clear to future readers:
1 Be able to figure out what I was doing.
2 Get interesting results.
*/

/* An explanation of what the code below is doing for anyone reading in the future: Re-label the variables in the dataset so that they match what’s in the example screenshot from earlier in this post: */

VARIABLE LABELS
Participant_ID “” /* The quotes here just mean that Participant_ID has a blank label */
Rxn1 “Reaction time before taking drug”
Rxn2 “Reaction time after first dose of drug”
Rxn3 “Reaction time after second dose of drug”
RxnAfterTx “Rxn1 subtracted from average of Rxn2 and Rxn3″
Grade “Participant’s grade in school”
EXECUTE. /* Tell SPSS to run the command */

/* Now let’s add the value labels to the Grade variable: */

VALUE LABELS
Grade /* For the Grade variable, we’ll define what each value (1, 2, 3, or 4) means. */
1 ‘Freshman’
2 ‘Sophomore’
3 ‘Junior’
4 ‘Senior’
EXECUTE. /* Tell SPSS to run the command. */

Now if we import a .csv version of the data file into SPSS and run the script above, SPSS will have all of the information that the .sav version of the file had.

While this is slightly more work than just saving in .sav format, by saving the data and the accompanying script in plain-text formats, we future-proof them for use by other researchers with other software (although researchers running other software won’t be able to directly run the script above, they will have access to your data and will be able to read through your script, allowing them to understand the steps you took). By saving using plain-text formats, we also make it easier to use more powerful tools, such as version control systems (about which I might write a future post).

By using comments in our scripts (whether they’re accompanying specific data files or not), we enable future readers to understand the rationale behind our analytic decisions. Every programming language can be expected to have a way to make comments. SPSS’ is given above. In R, you just need to add # in front of a comment. In MATLAB, either start a comment with % or use %{ and %} to enclose a block of text that you want to comment out. In whatever language you’re using, the documentation on how to add comments will likely be among the easiest to find.

Other Approaches:

Another, complimentary, approach to making data understandable in the future is to create a “codebook.” A codebook is a document that lists every variable in a dataset, as well as every level (“Freshman,” “Sophomore,” etc.) of each variable, and sometimes provides some summary statistics. It gives a printable summary of what every variable represents.

SPSS can generate codebooks automatically, using the CODEBOOK command. R can do the same, using, for example, the memisc package:

install.packages(‘memisc’)
library(‘memisc’)
?codebook() # Look at the example given in this help file to see how the memisc package allows adding labels to and generating codebooks from R dataframes.

We can also write a codebook manually, perhaps adding it as a block comment to the top of the annotation script. It might start with something like this:

/*

Codebook for this Example Dataset:

Variable Name: Participant_ID
Variable Label (or the wording of the survey question, etc.): Participant ID Number

Variable Name: Rxn1
Variable Label: Reaction time before taking drug
Variable Statistics:
Range: 0.18 to 0.89
Mean: .50
Number of Missing Values: 0

(etc.)

*/

Wrapping Up

The use of annotations can save you and your collaborators time in the future by making things clear in the present. In a way, annotating your files (be they data, or code, or summaries of data analysis steps) is a way to accumulate scientific karma. Use these tips to do a favor for future readers, looking forward to a time in the future when you might be treated to the relief of reading over a well-documented file.

Saving Files for the Future

This is the second post in a series on future-proofing your work (part 1, part 3). Today’s topic is making a choice when clicking through the “Save As…” box in whatever program you use to do your work.

It pays to make sure that your files are saved in formats that you’ll be able to open in the future. This applies to data as well as manuscript drafts, especially if you often leave files untouched for months or years (e.g., while they’re being reviewed or are in press, but before anyone’s come along asking to see the data or for you to explain it).

Short version: “Open,” preferably plain-text file formats such as .csv, .txt, .R, etc., are better for long-term storage than “closed” formats such as .doc, .xls, .sav, etc. If in doubt, try to open a file in a plain-text editor such as Notepad or TextEdit — as a rule of thumb, if you can read the contents of the file in a program like that, you’re in good shape.


In general, digital files (from data to manuscript drafts) can be saved in two types of formats: open and closed.

  1. Open formats are those that
    a) can be opened anywhere, anytime, and/or
    b) have clear directions for how to build software to open them.A file saved in an open format could be saved in Excel but then just as easily be opened in Open Office Calc, SPSS, or even just a basic text editor such as Notepad or TextEdit. .txt, .R, .csv — if a file can be opened in a basic text editor (even if it’s hard to read when opened there), or has specifically been built to be openly understood by software developers (as with Open Office .odt, .ods, etc. files), you’re helping to future-proof your work.
  2. Closed or proprietary formats, on the other hand, require that you have a specific program to open them, or else reverse-engineer what that specific program is doing. SPSS .sav files, Photoshop .psd files, and, to some extent, Microsoft Office (.docx, .xlsx, etc.) files, among many others, are like this. How can you know if you’re using a proprietary file format? One rule of thumb is that if you try to open the file in a basic text editor and it looks like gibberish (not even recognizable characters), there’s a good chance that the file is in a closed format. This isn’t always the case, but it’s usually a good way to quickly check (1).

For an easy-to-reference table with file format recommendations, see our Data Management page on the topic.

It is important to note that, for example, even if R can read SPSS files, it doesn’t mean that SPSS files are “open.” They’re still closed, but have been reverse-engineered by the people who make R. SPSS, as a proprietary program using a proprietary file format, could change the format in its next version and break this reverse-engineering, or require that all users upgrade in order to open any files created in the new version.

So you’ve got a data file, and you’re willing to try out saving it in .csv format instead of .xlsx or .sav or whatever else your local proprietary vendor would suggest. Great! “But,” you say, “Will I lose any information when I re-save my data file? Will I lose labels? Will I lose analysis steps?” This, inquisitive reader, is an excellent question. In some cases, there is a trade-off between convenience now (using closed formats and some of the extra features they carry) vs. convenience later (finding that you can re-open a file that you created years ago with software that’s since upgraded versions or has stopped being developed).

In these types of cases, you could simply periodically save a copy of your files in an open format, and then keep on using the closed format that you’re more familiar with. Even doing something as simple as that could help you in the future. If you want to go a step further, however, read on in our next post, which will be published soon…

Future-Proof as you go: Your future self will thank you

The fog over this beautiful mountainside obscures potential dangers. So it is with confusion and poorly-documented datasets.

This post introduces a series about ensuring that files for your projects will be usable in the future. Part 2 is on saving files for the future. Part 3 is on annotating statistics scripts and using data “codebooks.”

Perhaps everyone in the sciences has been there: the foggy expanse where questions abound and answers are only just at the tip-of-the-tongue, where innovation and discovery are halted and fed to a ravenous beast called Frustration. Yes, as you may recognize from my description, this is the mental space that results from reading the work of a collaborator or, worse yet, your own material, months or years after a project has finished, trying to remember what you possibly could have meant when you saved your data file with the name “Good_One_Look_at_This_2bFINAL_final.sav” among a dozen or more other files with names both similar and dissimilar (“Good_Look_at_This_early_Final_Final.sav”? I can hear, echoing through the fog, “It seemed like a good idea at the time!”).

And it’s not just filenames — this foggy place is also where files saved in old and now-unusable versions of Word or Excel go to die, as well as files that, if you can open them, have such questionably-descriptive variable names as “Sam1″.

Curses of our past selves are the number 1 export of this frustrating, foggy expanse. And so I am writing to remind you: Be kind to your future self, and to the future selves of your collaborators. Future-proof as you go.

What does future-proofing your work mean? We’ll explore a few easy practices in a short series of upcoming posts.

Future-proofing can mean embedding notes in your work to help you and others remember weeks to years down the road what you were doing. It can even extend to making a quick check of what format you’re saving a file in (yes, the choice that you make in the pop-up “Save As…” box in Excel between .csv and .xlsx can make a difference to your future self!). In a series of posts to follow, I’ll be walking through some small changes that you can consider making to your workflow now in order to make things easier for you and your collaborators in the future. Join me there. (Links will be added to the top of this post as new installments become available.)

Elsevier protest — Faculty activism supporting open access

Many faculty members around the world — about 2900 as of this moment — have in the past few days signed a pledge to boycott Elsevier, a major commercial academic publisher. Cost of Knowledge websiteThe list of signatories to date is impressive and growing quickly; you can see the current list at “The Cost of Knowledge” website, or if you wish can use that site to add your own name.

The signatories have committed not to publish, referee, or do editorial work for Elsevier journals “unless they radically change how they operate.”

Concerns with Elsevier mirror some that many UO librarians also feel — that Elsevier has been at the forefront of rapid increases in journal pricing, anti-competitive practices aimed at libraries such as “bundling,” monopolizing access to a large segment of the last 90 years of scholarly publishing, lobbying in favor of bills such as SOPA and a bill (the “Research Works Act“) that would repeal the NIH Public Access Policy, and in general a large number of small policies antithetical to author rights and to public access to research.

My own impression is that of all major academic publishers Elsevier is trying the hardest to kill disciplinary and institutional online repositories such as PubMed Central or the arXiv or our own Scholars Bank.

The new pledge, which was launched in a posting a week ago by Mathematics Fields Medalist Timothy Gower, seems to be setting a new standard for faculty revolt against publishers whose practices are seen as focused on profit to the detriment of improved public access to scientific information.

Interested in more perspective? “Elsevier Publishing Boycott Gathers Steam”  and “As Journal Boycott Grows, Elsevier Defends Its Practices” in this week’s Chronicle of Higher Education are quite good.  Forbes also has an interesting economic analysis at “Elsevier’s Publishing Model Might be About to Go Up in Smoke.”

The UO doesn’t take a position on whether our faculty should sign this pledge. However, the UO Senate does have a formal recommendation (resolution passed in 2008) that all UO faculty authors include an author’s addendum as part of any copyright transfer. Such an addendum would let the original author retain rights that Elsevier clearly doesn’t want academic authors to keep, including the right to reuse your own work or to deposit a copy of the work in an online repository.

JQ Johnson
Director, Scholarly Communications
UO Libraries

[Originally posted by JQ Johnson, Director, Scholarly Communications, UO Libraries]

Panton Fellowships for scientists to promote open data in science

From the Open Knowledge Foundation:

Dear all,

The OKFN is delighted to announce the launch of the Panton Fellowships!

Funded by the Open Society Institute, two Panton Fellowships will be awarded to scientists who actively promote open data in science.

The Fellowships are open to all, and would particularly suit graduate students and early-stage career scientists. Fellows will have the freedom to undertake a range of activities, which should ideally complement their existing work. Panton Fellows may wish to explore solutions for making data open, facilitate discussion, and catalyse the open science community.

Fellows will receive £8k p.a. Prospective applicants should send a CV and covering letter to jobs[@]okfn.org by Friday 24th February.

Full details can be found at [Panton Principles](http://pantonprinciples.org/panton-fellowships/). You can also see our [blog post](http://blog.okfn.org/2012/01/25/panton-fellowships-apply-now/).

Please do feel free to circulate these details to interested individuals and appropriate mailing lists!

Kind regards,
Laura


Laura Newman
Community Coordinator
Open Knowledge Foundation
http://okfn.org/
Skype: lauranewmanonskype

copyright conflicts and online open access

If you follow copyright and intellectual property legislation, you might get the impression that universities and libraries are under attack from the publishing industry. There have recently been a series of lawsuits and bills designed to strengthen content publishers at the expense of authors and universities.

One awful example is SOPA (and it’s companion bill in the Senate, PIPA), which attempts to combat online “piracy”, may pass, and if it does will have very negative consequences for the stability of the Internet since it undermines the domain name system on which the Internet depends. Many of us noted and supported the Internet blackout earlier this week, which seems to have had an effect in getting widespread attention for the problem. The UO’s congressional delegation is firmly opposed to PIPA and SOPA, but if you have colleagues at other institutions it would never hurt to alert them to the issues.

Another such bill is H.R. 3699, which would undo the progress that the National Institutes of Health have made in making taxpayer-funded research publicly available. H.R. 3699 has started to generate strong reactions from the academic community. For example, there was an excellent op ed piece in the New York Times last week by Michael Eisen — see “Research Bought, Then Paid For” http://www.nytimes.com/2012/01/11/opinion/research-bought-then-paid-for.html#   In stark contrast, a recent OSTP request for information solicits advice that could result in extending the NIH public access mandate to more federal agencies. The UO filed a response discussing some of the benefits of improved public access. You can read it at https://scholarsbank.uoregon.edu/xmlui/handle/1794/11810

Yet a third interesting bill is H.R. 3433 (aka the GRANT Act), which, nominally in the name of public access to information, in fact would have serious consequences for NSF funding since it would mandate public disclosure of the names of reviewers (bye bye blind peer review) and would publish all grant applications (bye bye competitive advantage as foreign countries jump on the bandwagon based on which NSF grants are funded before our researchers have a chance to do the research).

I’m happy to report that our Office of Public and Government Affairs (Betsy Boyd) is on top of all of these.  She writes “There are several bills, as JQ noted, that threaten innovation and open access in the name of transparency and open access.”

We live in interesting times.

[Originally posted by JQ Johnson, Director, Scholarly Communications, UO Libraries]

GigaScience — open access journal

In the UO Science Library blog I try to highlight new open access science journals as I hear about them. If one looks relevant to innovations in visualizing, preserving, presenting or sharing data, I’ll also include them here.

Here’s a recently launched journal that is focused on “big data” studies, and will be waiving the article processing charge for all articles published during the first year.

GigaScience aims to revolutionize data dissemination, organization, understanding, and use. An online open-access open-data journal, we publish ‘big-data’ studies from the entire spectrum of life and biomedical sciences. To achieve our goals, the journal has a novel publication format: one that links standard manuscript publication with an extensive database that hosts all associated data and provides data analysis tools and cloud-computing resources.

Our scope covers not just ‘omic’ type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale sharable data.

via GigaScience.

E-Science Reading List

Deb Carver (Dean of UO Libraries), John Conery (Professor, Computer and Information Science), Lynn Stearney (Director of Grants, UO Foundation), Sean Sharp (Research and Instructional Technology, Campus Information Services), and I are participating in the ARL/DLF E-Science Institute.

The following is a list of readings compiled by the institute staff, and organized by topic. It’s pretty comprehensive, and may be helpful if you’re interested in gaining some background in these topics.

————————————————-

Readings Organized by Topic Area

July, 2011

Data deluge

Chicken or the egg?  Did e-Science cause the so called data deluge or is e-Science a response to this phenomenon?  The early e-Science funding initiatives in the U.K. at the beginning of the last decade targeted projects where data infrastructure was integral to managing the massive flow of data from digital instrumentation.  This focus anticipated the ever-expanding use of digital instrumentation across research domains and, indeed, within society.  The following articles take stock of the data deluge phenomenon.  Regardless of whether e-Science has caused or has been a response to this phenomenon, a basic understanding of the massive production of digital data helps focus on the variety of issues related to managing research data.

  • The data deluge and Data, data everywhere – The February 25, 2010 issue of the Economist included an article about the general deluge of data in society (the former link) and a technology pull-out section on overall characteristics of data (the latter link.)  These are particularly informative articles from the popular press.
  • The end of theory: the data deluge makes scientific method obsolete  – This June 23, 2008 story from Wired Magzine by Chris Anderson reflects a Google-perspective about making discoveries through huge amounts of data.  In other words, with a enough data and the right algorithm you don’t need a scientific model.  He suggests a scientific method based on computationally derived patterns from massive data collections that doesn’t require models to test.  What makes this method work are petabytes of data.
  • The coming data deluge – This short opinion piece from IEEE Spectrum within Technology introduces a number of words appearing in our language because of the data deluge.  For example, the author makes reference to data scientists.
  • Data -The February 11, 2011 issue of Science was dedicated to the challenges and opportunities arising from the data deluge in research.  This is an excellent compendium presenting perspectives on “the increasingly huge influx of research data” from a variety of scientific fields.

Data-driven science

The following trilogy provides a solid introduction to current thinking around data-driven science (or more generally, data-driven research).  The first title is an anthology describing the emergence of data-driven science.  The chapter by Jim Gray on e-Science: A Transformed Scientific Method, which was reproduced from a presentation in January 2007, serves as the framework for the other authors who provide examples of data-driven science in various disciplines.  The second title is from the U.S. Interagency Working Group on Digital Data, representing key U.S. agencies involved in scientific research.  Working from a set of data principles that they developed, this report outlines a strategic vision around scientific data for U.S. federal agencies.  The third title is a report to the European Commission from the High Level Expert Group on Scientific Data .  This report provides a useful public statement about the value of scientific data to society and espouses a vision for data in 2030.

  • The Fourth Paradigm: Data-Intensive Scientific Discovery  – This collection of essays from Microsoft Research is a tribute to Jim Gray and his ideas about data-driven science.
  • Harnessing the Power of Digital Data for Science and Society - This document includes a set of principles upon which federal scientific agencies should manage the data they produce.  There is an excellent appendix on the roles for organizations and individuals.
  • Riding the Wave: How Europe can gain from the rising tide of scientific data – Released in October 2010, this report establishes a strong case for European developments in research data infrastructure over the next several years.  The second chapter of this report uses a variety of scenarios that expresses the value proposition for investing in data infrastructure.  The third chapter describes challenges that have to be overcome in building new data infrastructure (which they interchangeably call “scientific e-infrastructure.”)  The fourth and fifth chapters present a vision for 2030 and a call for action, respectively.  This 38-page publication is an excellent follow up to The Fourth Paradigm: Data-Intensive Scientific Discovery, which was released in 2009.
  • Science Magazine, Special Online Collection: Dealing with Data (Feb 11, 2011) – Issue devoted to challenges with scientific research data, introducing many key ideas in different scientific disciplines.

Data Curation

The life cycle management of information is fundamental to understanding digital curation, for it is the stewardship and management of digital objects across the life cycle that determines the activities of digital curation.  Similarly, the essence of data curation is defined by the context of the research life cycle (see the class glossary for a definition the research life cycle.)  The management of research data spans the research life cycle, consisting of the many activities related to the design, production, manipulation, analysis and preservation of the data itself and its supporting metadata.  The stewardship of research data ensures that responsibilities for all data and metadata activities across the life cycle are assigned, understood and carried out.  It is the combination of the activities of research data management and the responsibilities of data stewardship over the research life cycle that embodies data curation.  The following articles introduce data curation and its supporting concepts.  Beginning with an article by Anna Gold, an overview of data curation is provided that traces the evolution of the concept and its current state of development.

Data Curation

The Data Life Cycle

  • JISC Research Lifecycle diagram – JISC, which historically stood for the Joint Information Systems Committee in the UK but  which is now simply known as JISC, employs a life cycle diagram to describe the support their organization provides to researchers across the stages of the research life cycle.  This brief, succinct representation of the life cycle shows two interrelated cycles making up an overall research life cycle.  One cycle consists of the stages associated with knowledge management and scholarly communications, while the other cycle has stages making up the research process.
  • Curation Lifecycle Model – The UK Digital Curation Centre provides an online representation of a life cycle model depicting stages in curating and preserving data from a digital records management perspective.
  • The data life cycle is mentioned in some the above readings, including pages 8 and 9 of Harnessing the Power.  The entry for data life cycle in the class glossary also links to an article describing characteristics of the research life cycle model.
  • e-Science and the Life Cycle of Research – by Charles Humphrey, June, 2008. Brief introduction to the research life cycle (also linked from the glossary of key terms and concepts).

Research Data Management

Data Stewardship

Research Libraries, Data and e-Science

Many research libraries have been involved over the past decade and a half in developing digital collections, in producing digital content through digitization projects and in preserving digital content through institutional repositories.  More recently and in conjunction with the emergence of data-driven science, the inclusion of research data in digital collections has become a focus of many libraries.  Some of the following readings explore the retooling that libraries face to incorporate research data into their digital collections.  Other readings provide case studies about how some libraries are addressing e-Science and research data.  Some of this work can be done within an institution and several of the case studies present local approaches to building research data collections and providing e-Science data services.  However, the support for e-Science and research data will increasingly require cross-institutional collaboration among libraries.  A typical e-Science project tends to consist of a large research team where the researchers are from different universities, come from a variety of disciplines and are located in institutions from around the globe. Examples of this include the teams of physicists working with the Hadron Collider and the international teams of scientists conducting research under the banner of the International Polar Year.  They work together through shared technology that generates massive volumes of data and supports its storage and processing through a distributed high-speed network.  No single research library has the capacity to respond to such large-scale projects thus challenging libraries to find new ways to collaborate around e-Science research data.  The infrastructural requirements alone to ingest, manage, preserve and provide access to large-scale research data are an impetus for libraries to collaborate.

Retooling

  • Retooling Libraries for the Data Challenge  – In this concise article, Dorothea Salo reviews pertinent characteristics of research data, digital libraries and institutional repositories in proposing ways in which libraries can address the data challenge.
  • Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century – This is a National Science Board 2005 report.
  • Agenda for Developing E-Science in Research Libraries – This November 2007 report contains recommendations about e-Science to the Scholarly Communication Steering Committee, the Public Policies Affecting Research Libraries Steering Committee, and the Research, Teaching, and Learning Steering Committee.
  • To Stand the Test of Time: Long-term Stewardship of Digital Data Sets in Science and Engineering – This 2006 ARL report contains the results of NSF-funded workshop to compose an agenda for research data infrastructure in science and engineering.
  • Skilling Up to Do Data: Whose Role, Responsibility, Career? – This 2009 IJDC article by Graham Pryor and Martin Donnelly looks on data curation roles and skills in the UK and proposes a framework for skills development in data management.
  • Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop  - This report “summarizes a 2009 National Research Council workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.”
  • The Shape of the Scientific Article in the Developing Cyberinfrastructure” – This report by Cliff Lynch discusses how “E-science represents a significant change, or extension, to the conduct and practice of science.  This article speculates about how the character of the scientific article is likely to change to support these changes in scholarly work. In addition to changes to the nature of scientific literature that facilitate the documentation and communication of e-science, it’s also important to recognize that active engagement of scientists with their literature has been, and continues to be, itself an integral and essential part of scholarly practice; in the cyberinfastructure environment, the nature of engagement with, and use of, the scientific literature is becoming more complex and diverse, and taking on novel dimensions.”

Case Studies

  • E-Science and Data Support Services:  A Study of ARL Member Institutions – This 2010 ARL report by Soehner, Steeves & Ward reviews the different approaches libraries are taking toward e-Science and data support services.  Six institutional cases studies are also provided.
  • Data Sharing, Small Science, and Institutional Repositories (post-print) – This 2010 article by Cragin, Palmer, Carlson and Witt in Philosophical Transactions of the Royal Society A contains results of the Data Curation Profiles research project done by UIUC and Purdue on how faculty view and practice data sharing.
  • Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force (access requires subscription) – This 2011 article by Newton, Miller and Bracke in Collection Management describes the Purdue Libraries task force charged with building faculty-produced collections for a data repository prototype.  This project developed an inventory and characterized the resources and skills required of the libraries and its data-collecting librarians. The roles and activities of librarians identified during the project were explored.
  • Determining Data Information Literacy Needs: A Study of Students and Research Faculty (access requires subscription) – This 2011 article by Carlson, Fosmire, Miller and Sapp-Nelson in portal: Libraries and the Academy describes how “researchers increasingly need to integrate the disposition, management, and curation of their data into their current workflows. However, it is not yet clear to what extent faculty and students are sufficiently prepared to take on these responsibilities. This paper articulates the need for a data information literacy program (DIL) to prepare students to engage in such an e-research environment. Assessments of faculty interviews and student performance in a geoinformatics course provide complementary sources of information, which are then filtered through the perspective of ACRL’s information literacy competency standards to produce a draft set of outcomes for a data information literacy program.”
  • Data Curation Program Development in U.S. Universities:  The Georgia Institute of Technology Example – This 2011 article by Walters in The International Journal of Digital Curation presents GT’s data curation program development. The main characteristic is a program devoid of top-level mandates and incentives, but rich with independent, “bottom-up” action. The paper addresses program antecedents and context, inter-institutional partnerships that advance the library’s curation program, library organizational developments, partnerships with campus research communities, and a proposed model for curation program development.
  • Data Services for the Sciences: A Needs Assessment” – This 2010 article by Westra in Ariadne describes scientific research data management as “a fluid and evolving endeavour, reflective of the high rate of change in the information technology landscape, increasing levels of multi-disciplinary research, complex data structures and linkages, advances in data visualisation and analysis, and new tools capable of generating or capturing massive amounts of data.  These factors create a complex and challenging environment for managing data, and one in which libraries can have a significant positive role supporting e-science. A needs assessment can help to characterise scientists’ research methods and data management practices, highlighting gaps and barriers, and thereby improve the odds for libraries to plan appropriately and effectively implement services in the local setting.”
  • The Cornell University Library (CUL) Data Working Group (DaWG) report – This 2008 report contains five recommendations from the Data Working Group detailing how the Cornell University Library could engage in data curation.  Included within these recommendations is a set of services that could be provided to researchers and local infrastructure and policies needed to sustain these services.
  • Responding to the Call to Curate:  Digital Curation in Practice at Penn State University Libraries (pre-print) – This 2011 article by Hswe, Furlough and Giarlo in the The International Journal of Digital Curation presents how Pennsylvania State University Libraries established a Content Stewardship program for the university, describing the planning and staffing needed for its implementation.  They specifically address the challenges of starting and sustaining a stewardship services program.
Page 1 of 4:1 2 3 4 »