Many of us use spreadsheets to record, manage, and share data, but the ease of working with spreadsheets can mask some important considerations for your data. Below are some of the more common mistakes people make with their tabular data, and how to avoid them.
This material is based on “Formatting Problems” by Christie Bahlai and Aleksandra Pawlik via datacarpentry.org.
- Multiple Tables:
Problem: Researchers commonly use one spreadsheet to organize multiple data tables. While this might be a convenient strategy during data collection, having multiple tables in the same spreadsheet confuses most data management systems. Having more than one table in a spreadsheet may imply false associations between datasets and could cause issues in data preservation in the future.Solution: Do not put more than one table in a single spreadsheet - Multiple Tabs:
Problem: Another logical solution to data organization is using several tabs in the same document to store data. Some researchers like to use separate tabs to distinguish between data characteristics (e.g. date of collection, region of collection, species type, etc.). This may seem appealing, but using this method to organize data puts you at risk for introducing inconsistencies into your collection or failing to see associations between your data.Solution: Instead of adding another tab to your document, consider adding another column to your data table. For example, instead of starting a new tab for every day of data collection, simply create a “Date of Sample” column and keep all your data on the same sheet. - Not Filling in Zeroes:
Problem: When collecting data some researchers like to leave data cells blank when they measure a value of zero. This may seem convenient, especially when you encounter multiple zero values, but many data systems recognize a value of zero as actual data while blank cells are interpreted as the absence of data. Leaving cells blank that should read “zero” can lead to issues with data calculations and preservation.Solution: Do not leave cells that should have a value of zero blank. Just enter the value of “0”. - Using Bad Null Values:
Problem: Using numerical values (9999, -9999, 0) to represent missing or null values can often confuse your data management system.Solution: Consult the image below from White et al, 2013, Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. to decide the best way to denote null values in your data. - Using Formatting to Convey Information:
Problem: Some researchers like to use highlighting, bolding, different fonts, or variation is font size to indicate important information about their data. Unfortunately, stylistic formatting is not usually recognized by data programs so any formatting information may get lost over time and reuse.Solution: Instead of adding special formatting features to convey information, create a new field or column and code the data appropriately. For example, if you wanted to indicate whether a sample was collected in March or April you could add a “Month” column and input March or April instead of highlighting the individual data to visually mark the categories. - Using Formatting for Visual Appeal:
Problem: Sure, merging two cells may look good, but merging cells could cause the computer to miss (or falsely identify) associations in the data.Solution: Do not merge cells. Consider re-arranging your data if needed, but do not merge cells. - Including Comments or Units in Cells:
Problem: While organizing their data, researchers may be tempted to include notes or unit measurements in data cells. This may seem practical, but putting extraneous information in your data cells can cause problems for data calculation or analysis software.Solution: Instead of making comments in your data spreadsheets, keep a readme file or some sort of meta data document. To avoid including units in cells, make sure that your headings indicate the unit of measurement for all data in that column. You should not be using multiple units of measurement in the same column. - More Than One Piece of Information in a Cell:
Problem: Sometimes including multiple pieces of information in a single cell makes sense for data organization purposes. For example, including both the city and state in one cell like “Eugene, OR” may seem logical, but including spaces or commas in data cells can cause problems with data analysis software as commas and spaces often indicate special formatting information.Solution: Do not put more than one piece of information in a single cell. Instead create multiple columns to accommodate additional information. - Field Name Problems:
Problem: Choosing heading names that are too long, complicated, have spaces between words, include uncommon abbreviations, or that might not make sense in 6 months may make future data use complicated and frustrating.Solution: Use succinct, clear heading names that explain the field without being too wordy, complex, or nuanced. Example: use a convention such as “Max_temp” over “Maximum Temperature in Degrees Celsius.” - Special Characters in Data:
Problem: Sometimes researchers use Excel as a word processor and include special characters such as symbols for temperature. Excel and other data systems cannot process special characters.Solution: Avoid using special characters when possible. - Inclusion of Metadata in Data Table:
Problem: Including metadata with your data is important, but metadata should not be included in the data spreadsheet. Including non-data information in your data spreadsheet can cause problems with electronic data systems.Solution: Keep a metadata or readme file that corresponds with your data, but keep it separate from your data spreadsheet.