Skip to Main Content

Research Data Management (RDM): Data types and formats

Descriptions of common format types

These are examples of different format types that you can use as you organize datasets in your project. 


 

Format

Description

Archive

Archive formats are similar to spreadsheets, but contain more than one row format. Typically they have one format to indicate cruise information, another for station information, and another for the actual measured parameter values.

Auxiliary

Auxiliary formats are usually small files that contain instructions (or other information) that an executable program needs in order to use the data file. They are much less comprehensive than true metadata files, which can play similar roles

Compression

These formats are used for efficient storage or transmission of data, using a variety of compression algorithms in software programs that range from open-source to commercial.

Document

The data are contained in formats usually concerned with digital documents, including proprietary formats (e.g. DOC) or elaborately formatted ASCII text.

Hard Copy

Data on paper, including all types of journals, logbooks, periodicals, etc.

Markup Language

A markup language is an artificial language using a set of annotations to text that give instructions regarding how text is to be displayed.

Message

Highly specified, formal code sequences for reporting weather and surface marine observations.

Metadata

A metadata standard is a common set of terms and definitions that describe data.

Raster and Grid

In the earth sciences, a gridded data file is usually thought of as a set of numbers making up a rectilinear array (i.e. rows and columns) of parameter values, and the raster is sometimes thought of as a visualization of the grid. Both are essential inputs to geographic information systems.

Relational Database

The formats used by Relational Database Management Systems, universally binary and completely invisible to the user.

Self-Describing

These formats contain extensive internal metadata, which provides user systems with all the information needed for both use and discovery. Station data, grids and rasters can be accommodated in these formats.

Spreadsheet

An array of rows and columns, each cell containing either alphanumeric text or numeric values. The columns in the spreadsheet, usually labeled in the first row, contain separate types of information; the rows contain all the separate types of information associated with a single entity, such as an oceanographic station. All rows in a true spreadsheet have exactly the same format.

Vector 

Files containing digital representations of geometric forms, such as points, lines, curves, and shapes or polygon(s), which are all based upon mathematical equations, to represent images in computer graphics. An essential input to geographic information systems.


The content included in this section was obtained from: Reed, G. (2015). Research data organization and standards. Ocean Teacher Global Academy: Research Data Management Course [Moodle platform]. Retrieved from http://classroom.oceanteacher.org/login/index.php

Types of Data

To a greater or lesser degree, many disciplines work with various different types of data. Below is a listing of several common types of data: 

  • Experimental results  
  • Observational data 
  • Data simulations 
  • Field notes 
  • Images (graphs, scans) 
  • Digitalized photos 
  • Born-digital documents
  • Quantitative data ( survey data) 
  • Data from historical archives
  • Samples
  • Objects (bones, algae)
  • Social media data (e.g., Twitter statistics)

Format Types

Using appropriate formats is essential in order to be able to access a file's content. Due to technological advances, different formats are constantly being developed. On many occasions, older formats or versions can become obsolete due to lack of use or support, to the point where current programs will not be able to render the content. Unfortunately, a lot of research data is lost in this way. When choosing a format to save your files in, check that it has acceptance within the community of your discipline, that it is widely used and well supported, and that there is some commitment to maintaining it for several years to come. If there are more advanced versions in existence, you can always migrate your data to those as a safety measure.

For more information on choosing file formats, see the following video from the 3:09 mark.

Lecture by Greg Reed (AODC, Australia) given during the research data management training course, IODE, 16-20 November 2015

File extensions for different formats

When selecting a file type or format, think about long-term access to the data. To lengthen the life of your files, use formats designed for this purpose. The following formats are recommended over proprietary formats. 

Archivo Texto Numérico Video Imagen Audio
Use

PDF
TXT
XML
RTF

CSV
Tab Delimited

MPEG-4

TIFF

JPEG2000

WAV

Do not use

MS Word

MS Excel

Quicktime

GIF
JPG

mp3

Tools for identifying appropriate formats

These are two resources that can help you make an informed decision about what formats to use in order for your data to be accessible in the future. Both collect data on different formats, the programs they are tied to, different versions, their probabilities of surviving long-term, and when they may become obsolete.

The United States National Archives released a set of tables showing recommended and acceptable formats for long-term preservation. You will find links to the formats' specifications and related standards. 


Best practices suggest that the chosen format be: 

check

  • Non-proprietary, in other words, open-source​
  • ​Documented according to market standards
  • Widely accepted by the researcher community  
  • Non-encrypted 
  • Uncompressed