This is a list of best practices and a compilation of the most common mistakes when we collect, store, and deliver data for further analysis and visualization. It’s inspired and broadly based on the excellent paper Tidy Data (http://www.jstatsoft.org/v59/i10) by Hadley Wickham as well as our own experience at Interactive Things while working with clients on data-driven design projects.
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
- Format data standardized and consistent (like dates and currencies).
- Keep table headers standardized and consistent (lowercase, underscored).
- Keep file names standardized and consistent.
- Deliver your spreadsheets as CSV files instead of XLS files.
- Deliver your spreadsheets as UTF-8 formatted.
- Introductory and other unnecessary text.
- Missing column header.
- Multiple column header lines.
- Comments in columns where values should be.
- Formatting on text and background of cells.
- Empty columns and rows for layout purposes.
- Connected cells between spreadsheets.
- Formulas to transform values in cells.
- Auto-formatting of values in cells (like dates and currencies).
- Hidden columns (either show them when necessary or delete them).
- Multiple tables on one spreadsheet.
- Wickham, H. Tidy Data (2014). Journal of Statistical Software, Vol. 59, Issue 10, Sep 2014. http://vita.had.co.nz/papers/tidy-data.pdf
- Tableau Software. Preparing Excel Files for Analysis (05 Dec 2014). http://kb.tableau.com/articles/knowledgebase/preparing-excel-files-analysis