Collect the data
Once you have a working concept of the type and size of data you will produce, it is possible to plan for the collection of your data. At this stage it is important to make decisions regarding methodology workflow, data format and fidelity, and project documentation and metadata. This is because as you collect data, you must fix that data in a digital format. Because research projects will use various formats for different uses it is important that this original format is your highest quality sample. A misconception is that the only acceptable archival data is "raw data", but oftentimes raw data needs further processing to put that data into its most useful state. For example, some instruments produce data in a proprietary format that would not be of use to another researcher without the appropriate software to analyze the data. Therefore it is advised that you collect (and later, archive) data in a format that is non-proprietary, unencrypted, uncompressed, open standard, and commonly used. These original data can then be further transformed into appropriate formats for analysis and use.
Plan for it!
It is important to plan for the data collection phase to ensure that your research project has collected the highest quality data and that your data can be reinterpreted from project documentation.
Standards
File Naming Standards:
By using file naming conventions you can avoid many confusions that occur when managing a large file set. File naming conventions are meant to avoid ambiguity by making human readable file names. This practice is essential to a versioning system, where different versions of files are stored simultaneously. Use a directory naming structure that groups data by project name or grant number, and sub-directories that group data by event or process. Use a file naming convention that provides human readable context, date, and version number. Indicate these conventions in your data management plan, and stick to them! Below are a few examples of one way you might name your files and directories.
- Directory Structure Examples: /[Project]/[Grant Number]/[Event]/[Date]
- /text_analysis/DH_10011_11/encoding/20110117
- /Euphausia_superba/SEM/20110117
- /grand_rapids/20110117
- File naming Examples: [description]_[instrument]_[location]_[YYYYMMDD].[ext]
- borges_collation_20110117.xml
- krillMicrograph_backscatter_58C5392071359673_20110117.tif
- cityData_20110118.xls
For more examples and some great tips, check out these sections of the University of Wisconsin and University of Minnesota's data management guides.
Format Standards:
When choosing a format for your data you need to consider later lifecycle activities. What format will you use to analyze your data? What format will you use to preserve your data? At the data collection stage it is important to choose a format that is uncompressed and that is flexible in order to enable the most amount of use. Flexibility is found by choosing a standard format that is highly utilized either in your domain or at large. At the archival stage, it is also important that your file format is well documented, unencrytped, and non-proprietary. Of course, a file format that meets all of these requirements is one that can be used during both collection and archiving.
- Non-proprietary
- Open, documented standard
- Common usage by research community
- Standard representation (ASCII, Unicode)
- Unencrypted
- Uncompressed
For more examples check out this section of the MIT data management guide.
Metadata Standards:
Data is a representation of information that can be reinterpreted by experts (machine or human). Data that cannot be reinterpreted is much less useful. You may or may not be aware that many disciplines have standardized the description of data so that it is possible to interpret data however disparate the actual collections may be. The biggest help you will find from your field are standards for data formats (Data Standards) and metadata descriptions (Metadata Standards). In some cases, it is more efficient to write your own metadata standard, and in other cases a metadata standard might be considered overkill. A simple text document that describes the title, creator, contributor and rights of your data might be sufficient for your project.
- General Standards
- Dublin Core (generic)
- PBCore (public broadcasting & media)
- Data Cite
- Science Metadata Standards
- Darwin Core (life sciences)
- FGDC Federal Geographic Data Committee standard (geosciences)
- EML Ecological Metadata Language (ecology)
- Social Science Metadata Standards
- Data Documentation Initiative (numeric)
- CIDOC CRM (cultural heritage data)
- Humanities Metadata Standards
- TEI: Text Encoding Initiative (textual data)
- CDWA: Categories for the Description of Works of Art (audio/visual data)
Quality
Data Fidelity:
After deciding what kind of data (type), how much data (size), and what form to fix your data in (format), it is a good idea to decide the target quality of your data (fidelity). If you are using scientific instruments or collecting numeric data, you might only need to worry about sample rate, or how much data is collected per unit time. You may think of sample rate as a component of digital audio, and you would be right. Sample rates and bit rates are a common description of audio fidelity. Images, video and many other forms of data have a similar metric that can be adjusted to produce a differing quality of representation. You will also remember that one of the recommendations for choosing a data format is that your format is uncompressed. Compression is a method to make file sizes smaller. Some forms of compression are considered "lossy" and will actually remove information from your data through quantization. Other forms of compression are considered "lossless" and will retain the fidelity of information by using a compression method called coding. Most forms of compression employ both of these techniques (quantization and coding) and therefore it is recommended to avoid compressed formats. It is a good idea to become familiar with the fidelity metric of your data type and format, and to indicate the target quality in your data management plan.
Metadata Quality:
Similarly using a metadata standard is only one part of making data re-interpretable. The second component is ensuring a target quality for your description. This can be accomplished by using controlled vocabularies or other standards to represent your metadata. For example, it is always advisable to use a standard representation of date (YYYYMMDD) and version (e.g. 0001 not "final version" or "ver 2"). Most metadata fields can be standardized by writing supplemental guidelines for your project, or by adopting a formalized standard.
Finding help at MSU
Short Term Storage
Data collection must account for a short-term storage architecture. During this stage, many research projects use a multi-tiered storage approach such as a combination of shared file space, personal file space, and external hard drives. It is important to keep one or more redundant copies of your data, and as the LOCKSS saying goes "lots of copies keeps stuff safe." Here are your likely options for short-term storage at MSU:
- Internal hard drive (generally 50-500MB)
- External hard drive (up to 3TB)
- Departmental File Space (varies)
- HPCC high performance computing (50GB and up)
NONE of the above storage options should be considered long-term storage. For more information on long-term storage at MSU proceed to the Data Archiving stage.