About the Database


What is the IsoGenie Database?

The IsoGenie database is a Neo4j-powered database designed to store nearly all the data generated by the IsoGenie project. This includes text-based data, such as temperature, pH or other measurements. It covers temporal and spatial data. If data cannot be stored natively in the graph database itself, links referring to the data files can be used in lieu.


Who and How was it built?

First things first. The "database" encompasses several components: 1) the Neo4j graph database, which stores text-based information and serves as the basic storage structure 2) the web server, which is built using a number of different tools. These two aspects come together to form the IsoGenieDB. The website, other front-end and most of the back-end components were built using the tools displayed on the left. The database was built nearly end-to-end by Benjamin "Ben" Bolduc, a postdoc working with Virigina Rich and Matthew Sullivan (both IsoGenie PIs). The Ohio State University provided the http server, a happy home to store the website and database, and had the patience to assist someone who didn't know a line of javascript, html, css, nodeJS, etc. Any questions about the layout, organization of data, underlying graph database, web-like questions can be directed towards him. Data questions should be directed towards the appropriate data provider.


How much data, and what kinds of data can be stored?

If you check out this "recent" article, it turns out it's effectively unlimited. Actual usage is therefore limited by hardware, meaning whatever resources we can purchase. Data is duplicated at various data processing stages, from raw data that can be downloaded at the raw data downloads page, to the data uploaded to the Neo4j database.

Nearly any type of text-like data can be stored natively in the database. Images and non-text can have links referring to the actual data locations on the web server, allowing users to still access/download the data and see that data when placed in context of other data. This includes coring information, omics data (meta -genomics, -transcriptomics, -proteomics), terrestrial geochemistry, vegetation "ground cover," satellite and drone imagery (both stored as links). This data can be temporal, either 1 second/sample or 10,000 samples/second. The effective limitation is filtering/aggregating the data so that it is human-comprehensible. For example, let's assume that autochambers take 1 measurement every minute for 5 years straight, that's ~2.628 million measurements. When you query the database, you don't necessarily want to grab all 2+ million nodes, so instead [depending on the observed sample frequencies] the minute measurements are combined into hours or days, reducing the number of nodes to 1,825 (days) or 43,800 (hours). Users can then effectively sort through and filter much larger spans of data more quickly, while still being able to retrieve all the data associated with whatever aggregated level was done.


How do you deal with site synonyms?

For those who aren't aware, site synonyms are "unique" terms that each lab uses that all mean the same thing. One great example is the use of Sphagnum and Bog. One describes the dominant vegetation, the other the habitat classification. Over the years some labs have switched, others have been consistent - but one thing remains the same - there's no universal naming scheme for each "plot of land."

To address this issue, a unified naming scheme was used that involved sampling members from multiple labs. Whenever data is imported into the database the site names or their acronyms/abbreviations is passed through a naming dictionary that matches the names against all known variations. The dictionary also contains habitat types, vegetation, and other abbreviations which are then associated with the data import. This allows the original source providers to search for their own data using their own nomenclature while simultaneously allowing for other members to use "their" naming scheme to find the same samples.


How do we use this database in our publications?

The IsoGenieDB is low-maintenance (see other question) and is intended to survive through IsoGenie3. As such, references to the website will be maintained. In addition to DOI's being associated with each collection of data they belong to, we are exploring the possibility of auto-creation of DOIs (alongside a DOI-generating service) that will allow for IsoGenie members to "publish" their data, with a DOI linked to a collection of datapoints in the database.


Can we submit data?

While we would like to eventually expand the database to include non-IsoGenie member data, we're instead focused on creating a fully comprehensive database for all our members and aren't accepting non-IsoGenie data quite yet.


What does querying mean?

Querying generally means using a "querying" language to directly query the underlying Neo4j graph database. As one can imagine, highly sophisticated querying can use not only a query, but also parse and/or retrieve additional information based on the initial data returned. Instead of forcing users to learn a new querying language - that even the most advanced querying syntax can only return "limited" amounts of data - we've taken a hybrid approach. Some data is pre-returned to the website for quick filtering. At other times, python is used in the background to fetch results, translating more laymen search terms into the querying language used by the Neo4j database. This is especially useful during iterative querying, where prior results affect future queries. For simple queries, i.e. those that have 1 or 2 data types, nearly all data can quickly be filtered via the querying interface.

The major limitation to this method is that there's no easy way of foreseeing what data users will frequently request, or data returned in such a way that it is more useful for the end-user. If you find yourself wanting to run the same query repeatedly, simply varying a few search parameters, that can be easily automated. If you want to query/filter data based on numerous parameters, or iteratively retrieve data, get subsets, run GPS localization and other geospatial analyses, you'll want to check out the "network analytic queries" for more info.


How can I explore spatial relationships in my data?

The IsoGenieDB is spatially-aware in the sense that it contains standardized GPS coordinate information that can be acted upon by other tools/software. This can be seen through the map interface, where GPS information is pulled from the database alongside other site/core-specific information and rendered on a geospatial coordinate system (think map pins on Google Earth).

In a way this means that querying based on GPS information is limited (through the "querying" page) to text-based matches. The map interface is another matter. Since GPS information can be retrieved, the limitation to the sophistication of the map queries is limited only by current coding skills and/or available "plugins" that are designed to work with the Google-based mapping software that powers the map interface. For example, overlaying images is simple a plugin that can be installed into the map, as are distance calculations between points on the map, drawing a shape and selecting all the points within it, even adding "walking directions" is a plugin + a little coding to get it working with our data.

The ultimate goal is a fully embellished, feature-rich map interface that combines filtering of site characteristics and summarizing data based on any geospatial selections. It's not science fiction, it's in the alpha stage of development!


Below is a sample graph representation of select datatypes and how they can connect in a database.