Michigan State University

Processing your data

If you have successfully planned for the data collection phase, you will likely find your initial decision making will save some time and thinking as you consider processing your data. Data processing occurs for a number of reasons. Perhaps your raw data contains sensitive information such as identifiable demographics or private information. Maybe the data is in a proprietary format, and needs to be exported into another format. It could be that your data set is not very useful in answering your research question without extensive algorithmic computation. Any of these cases would involve processing or changing your data in some way or another. Data processing might include:

  • summarizing data (e.g. word frequency analysis on large text corpus)
  • aggregating data (e.g. joining images with geospatial coordinates)
  • validation and quality check (e.g. flagging possible outliers or errors)
  • cleaning data (e.g. removing outliers or desensitizing private records)
  • tabulation/analysis/mining (e.g. resetting data into a summarized or altered set)

Plan for it!

An effective data management plan will describe any extensive processing that data might incur during research. It is important to indicate how you will document such changes to your data. Oftentimes events that change data can be recorded using metadata fields, but other times it may be more effective to fully document changes in supplemental text files. A very large research project may invoke modeling diagrams to communicate these changes to team members. A smaller project may follow a research charter that outlines the processing activities.

Workflow Documentation

Methods:

Methodology documentation may take the form of a text document and should record personnel, methods, transactions and other particulars of your research process. A data management plan does not need this level of detail, but it should indicate how the researchers will document their workflow. Making sure the steps are documented is good practice and will potentially save time as oftentimes workflow descriptions find their ways into the methods section of research papers. Keep this file in the same directory as your data.

Processing:

Processing documentation should be more granular than your methodology documentation, and may include code commentary, example scripts, or example input and outputs. A common method of documenting the processing of data is to include in-line commentary of code, and to include those scripts as a component of your project documentation. A data management plan should indicate how code or scripts will be described using project documentation.