Our paleobiological research is more data-driven than in many other paleontological institutions across continental Europe. We use big data to examine and understand key questions about evolution and extinction of organisms and communities on Earth. We are interested in how large-scale macroevolutionary and macroecological patterns over time are shaped by origination and extinction processes and how these are linked to global changes. We make use of several databases that compile dated fossils of organisms across various spatio-temporal scales. Our primary source of data is the Paleobiology Database, which is one of the largest compilations of global fossil data with more than 1.2 million occurrence entries. However, because of the growing size of this database, assessing data quality prior to conducting analyses is becoming increasingly difficult.

The Challenge

The goal of this challenge is to develop a framework to clean the data in the largest database of fossils spanning the entire Phanerozoic eon: The Paleobiology Database. The Paleobiology Database (PBDB) is the basis of most quantitative analyses of diversification and extinction. The data compiled within the PBDB is contributed by hundreds of researchers across the world and derives mostly from the published literature. We seek an automated solution to clean taxonomic and temporal assignments of occurrence data, which would help us and the entire scientific community to minimize errors in our analyses. Occurrence records from this and similar databases are an indispensable source in macroevolutionary and macroecological research. Issues with data quality can however diminish their usefulness. Manual cleaning is time-consuming, error prone, difficult to reproduce and rely on expert knowledge, making it impractical for datasets of such sizes. Your innovative tool should help paleontologists to scan occurrence datasets to flag dating and/or taxonomic imprecisions in a standardised and reproducible way.

Common errors in The Paleobiology Database

Close inspection of the PBDB data has revealed that it is riddled with taxonomic errors and incorrect reporting of temporal ranges. To start with, the taxonomic literature of many organismal groups is littered with inconsistencies and misidentifications (Curry & Humphries 2016). This could be due a variety of factors, such as, the lack of a standard rule in determining the boundaries between morphologically similar species, different species concepts, poorly preserved fossil specimens resulting in some unidentifiable morphological characters or lack of training on identifying fossil species that need to be documented.

One such example is the genus Lingula, a common inarticulate modern brachiopod.  The original description of the genus is from 1797 based on extant organisms. Later, fossil specimens dated to the Mesozoic and Paleozoic, linguliform in morphology, were also assigned to Lingula based on the similarity of the shell to the extant specimens leading to Darwin naming Lingula as a living fossil:

This status is now rejected as it has been shown that this shape corresponds to a burrowing lifestyle, occurring in different brachiopod lineages, with different and evolving internal structures (Emig 2003, 2008). Many of these fossil linguliform taxa have now been reassigned to other genera (Emig 2008). This anachronism, which has been known for more than two decades now, has however not been addressed in the database and the numerous publications that use this data. Similar problems are also present in other taxonomic groups such as bivalves and corals (see the table below). One way of checking such errors is with expert opinions such as those that informed the famous compendium of the late Jack Sepksoski (2002).

Some examples of known errors in the PBDB data

Group Taxon

Stratigraphic range

according to PBDB

Estimated true range

according to Sepkoski (2002)

Brachiopod Lingula Cambrian to Modern Eocene to Modern
Bivalve Ostrea Permian to Modern Cretaceous to Modern
Coral Thecosmilia Triassic to Eocene Jurassic to Cretaceous

Your starting point: The dataset

The dataset from the PBDB is the basis for the challenge. Data can be downloaded from:

You may also access these data using the R package `chronosphere` along with the compiled data from Sepkoski (2002) using the following R code: access-pbdb-data.r

Your aim

In this challenge, you will need to develop a script that can automatically flag taxonomic and stratigraphic inconsistencies in the provided datasets. To build this algorithm, you will have access to the data in the PBDB.

One way to identify the inaccurate or imprecise information in fossil data is to (a) compare the data to the Sepkoski’s compendium (Sepkoski 2002), which will also be provided to you and (b) investigate the ages provided in the PBDB to flag any unexpectedly large range or an unexpected age.

Most commonly, ranges in the PBDB will be longer than in Sepksoski’s compendium when they are so called wastebasket taxa, characterized by many records in the PBDB and lack of diagnostic characters (Plotnick and Wagner 2006). However, there is also a biological underpinning to commonness in the fossil record (Plotnick and Wagner 2018).

What to submit?

Please provide your source code sending it to: The R language is preferred as this is the usual programming language used by paleontologists. Other languages and even non-programming solutions (e.g. plausibility tests) are also allowed.  Make sure the code is well documented and please use a consistent style throughout to ensure readability.

Your award

The team that submitted the best solution for our challenge will be invited for a week-long stay here at FAU in Erlangen, Bavaria, Germany, in late summer / early fall 2021. We will cover your travel expenses and offer you an interesting program where you can learn more about our research, but also about the region. This will include a visit of the medieval city of Nuremberg as well as a hiking trip to the unique natural preserve of the Franconian Alb.

As a special price, you will be able to participate in the Summer Science School 2021 in Erlangen (up to three winners of a team) without the need to submit an application for participation at the Summer School.

In case that travelling will not be possible due to COVID-19 restrictions, your team will be awarded with 2,500 €, but of course, we hope that it will be possible to welcome to here at FAU soon.


Emig C (2003) Proof that Lingula (Brachiopoda) is not a living-fossil, and emended diagnoses of the Family Lingulidae. Carnets de Géologie 1:1-8
Emig C (2008) On the history of the names Lingula, anatina, and on the confusion of the forms assigned them among the Brachiopoda. Carnets de Geologie 8:1-13
Plotnick R, Wagner P (2006) Round up the usual suspects: common genera in the fossil record and the nature of wastebasket taxa. Paleobiology 32:126-146
Plotnick R, Wagner P (2018) The greatest hits of all time: the histories of dominant genera in the fossil record. Paleobiology 44:368-384
Sepkoski J (2002) A compendium of fossil marine animal genera. Bulletins of American Paleontology 363:1-563

No suggestions yet, be the first!
No suggestions found!