In this lesson you learn how to discover, assess, and cite an open data set. You start by exploring repositories and learning about the issues and considerations for searching datasets. You then learn how to determine if the dataset is suitable for your use by learning what to review in documentation, licenses, and file formats. The lesson wraps up with a discussion about the importance of citing the datasets and how to read and follow citation instructions.
After completing this lesson, you should be able to:
Open data isn’t always simple to use in your research. Sometimes there are multiple versions of the same dataset, so learning how to discover and assess and then use open data will help you save time.
As an example, look at the monthly average carbon dioxide data from Mauna Loa Observatory in Hawaii. This is a foundational dataset for climate change. Not only is it one of the first observational datasets that clearly showed anthropogenic impacts on the Earth’s atmosphere, it constitutes the longest record of direct measurements of carbon dioxide in the atmosphere. These observations were started by C. David Keeling of the Scripps Institution of Oceanography in March of 1958 at a facility of the National Oceanic and Atmospheric Administration [Keeling, 1976].
If you want to make this figure yourself, or use the data for some other purpose, first you will want to find the data. If you search for this dataset, or any data, chances are that you will find a number of different sources. How do you decide which data to use?
If you start with Google and search for “Mauna Loa carbon dioxide data” you will find a lot of results. Here are just some of them:
How do you decide which one to use? In this lesson we will cover how to find, assess relevance, and use open data.
Open data can be discovered by accessing data repositories, search portals, and publications. A wide variety of these resources are available. A key step is identifying the appropriate search terms for your application. Learning community-specific nomenclature and standards can accelerate your search.
There are multiple pathways to find research data, and you should be practiced in all of them.
When we show up to the present moment with all of our senses, we invite the world to fill us with joy. The pains of the past are behind us. The future has yet to unfold. But the now is full of beauty simply waiting for our attention.
What is the first and best way to find research data? Ask your community, including your research advisor, colleagues, team members, and people online. Knowing where to find reliable, good data is as much a skill and art as any lab technique. You learn this skill set by working with professionals in your field. There is no one source, no one method.
Image source: NASA, Dominic Hart 2023
Datasets are often attached to scholarly publications in the form of supplementary material. Publication search engines can enable the discovery of relevant publications that you can then use to find data from a particular publication.
Data can also be found utilizing a wide variety of search portals including:
Select each tab to find out more information.
GENERIC DATA SEARCH PORTALS ☑ | DISCIPLINE-SPECIFIC DATA SEARCH PORTALS | NATIONAL AND INTERNATIONAL DATA SEARCH PORTALS |
---|---|---|
Generic data search portals enable discovery of a wide variety of data. Not built for specific disciplines, they serve a broader audience. This type of search portal collects and makes data findable. They are not sources of scientific data. These are aggregation services that emphasize quantity, not necessarily quality. This is where citizen scientists often go to find data, and it’s a great way for non-professionals to get involved in science. Examples include: |
GENERIC DATA SEARCH PORTALS | DISCIPLINE-SPECIFIC DATA SEARCH PORTALS ☑ | NATIONAL AND INTERNATIONAL DATA SEARCH PORTALS |
---|---|---|
Discipline-specific data search portals enable the discovery of specific types of data. They generally are tailored to meet their community’s needs. Examples include: |
GENERIC DATA SEARCH PORTALS | DISCIPLINE-SPECIFIC DATA SEARCH PORTALS | NATIONAL AND INTERNATIONAL DATA SEARCH PORTALS ☑ |
---|---|---|
National and international data search portals enable discovery of data produced by or funded by national and international organizations. Examples include:
|
A common way to share and find open data is through data repositories. Many repositories host open data with persistent identifiers, clear licenses and citation guidelines, and standard metadata.
Note that some of our example search portals are also repositories, but not always. Some of the search portals are simply catalogs of information about the data, rather than storage locations for the data themselves.
Select each tab to find out more information.
GENERAL REPOSITORIES ☑ | DOMAIN-SPECIFIC REPOSITORIES | INSTITUTIONAL REPOSITORIES | NATIONAL REPOSITORIES |
---|---|---|---|
General repositories are not designed for a specific community and are accessible to everyone. Examples include: See the Generalist Repository Comparison Chart – a tool for additional repositories and guidance. Dataverse has also published a comparative review of eight data repositories. |
GENERAL REPOSITORIES | DOMAIN-SPECIFIC REPOSITORIES ☑ | INSTITUTIONAL REPOSITORIES | NATIONAL REPOSITORIES |
---|---|---|---|
Specialized repositories (typically for specific data subject matter) provide support and information on required standards for metadata and more. Some examples are: |
GENERAL REPOSITORIES | DOMAIN-SPECIFIC REPOSITORIES | INSTITUTIONAL REPOSITORIES ☑ | NATIONAL REPOSITORIES |
---|---|---|---|
Many universities and organizations support research data and software management with repositories, known as institutional repositories, to aid their researchers with compliance requirements. |
GENERAL REPOSITORIES | DOMAIN-SPECIFIC REPOSITORIES | INSTITUTIONAL REPOSITORIES | NATIONAL REPOSITORIES ☑ |
---|---|---|---|
National repositories aggregate data and make it available to the public. Data stored in these repositories are often produced by the government. Examples include: |
Match the repository type to the correct definition.
General repositories | Designed for all communities and are accessible to everyone |
Domain-specific repositories | Repositories that are typically for specific data subject matters |
Institutional repositories | Repositories supported by universities and organizations |
National repositories | Repositories funded by the government |
Using open data for your project is contingent on a number of factors including quality of data, access and reuse conditions, data findability, and more. A few essential elements that enable you to assess the relevance and usability of datasets include (adapted from the GODAN Action Open Data course):
Practical Questions
Technical Questions
Social Questions
Many of these questions may be answered by viewing a dataset’s documentation and metadata, as well as a data’s format and license, all of which will be discussed further in the next lesson “Making Data Open”.
Acknowledgements and citations contribute towards fostering a culture of sharing data without fear of ideas or recognition being stolen. If a researcher can trust that their work will be cited, and used to further the development of science, the idea of making data open is more appealing and mutually beneficial. Use of standard citation practices are recommended to ensure due credit is given.
Data citations also aid in the transparency of how data is being used. By citing data, original authors and new researchers can easily track how the data are being used to answer different questions.
Many datasets and repositories explain how they’d prefer to be cited. The citation information often includes:
This is an example of a simple CITATION.cff file. Source: GitHub
Most datasets require (at a minimum) that you list the data’s producers, name of the archive hosting the data, dataset name, dataset date, and DOI when citing data.
Example from a NASA Distributed Active Archive Center (DAAC)
Matthew Rodell and Hiroko Kato Beaudoing, NASA/GSFC/HSL (08.16.2007), GLDAS CLM Land Surface Model L4 3 Hourly 1.0 x 1.0 degree Subsetted,version 001, Greenbelt, Maryland, USA:Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed on July 12th, 2018 at doi:10.5067/83NO2QDLG6M0
Example from NASA Planetary Data System (PDS)
Justin N. Maki. (2004). MER 1 MARS MICROSCOPIC IMAGER RADIOMETRIC
RDR OPS V1.0 [Data set]. NASA Planetary Data System. https://doi.org/10.17189/1520416
The following are the key takeaways from this lesson:
Answer the following questions to test what you have learned so far.
Question
01/03
Which of the following methods can be used for data discovery?
Question
02/03
Which of the following is/are questions to consider when assessing if a dataset can be used?
Question
03/03
What information is commonly found in a citation file?