Search engine vs Database in BI - 2

Structure is value.

4 min readAug 26, 2011

Without structure, the Eiffel tower would have been only a bunch of metal.

Note: this article was published in 2011. We find it still relevant today.

As we saw earlier, data structure is essential for BI.

But not all the data in the world is structured. For instance, scanned documents, emails are not structured. A search engine has way better tools to process natural language than a database. It is also really good to speed up full text search. In general, search engines are great to extract info from unstructured structures.

When data are unstructured at the origin, what kind of treatment can we do to structure them?
What is the typical data architecture of a BI application over unstructured data?

The example: a list of recipes

Let’s say you have a large dataset of recipes. One entry in this dataset can look like this:

Title : Burger

course category : entree, complexity to prepare: 1 out of 5, time to prepare: 10 min description: “ a delicious burger home made with a juicy steak, toasted bread, crispy lettuce. Can be served with ketchup and love. ”Instructions : “Toast the bread first. Then cook in a frying pan the meat. Salt at the end of the cooking.” source: “allrecipes.com”, courtesy of Pirate Johnny ©

Despite trying to be unambiguous in its wording for human beings, recipes for current computers are unstructured!

Now you want to be able to run statistics over a large set of recipes. The factual fundamental part of this recipe is the description, which is unstructured.

Thesaurus

From a data perspective, one can see a recipe as a collection, an assemblage of ingredients. A “burger” is composed of “bread”, “steak”, “cheese”, “lettuce” and “ketchup” for instance.

The simplest way to model this situation is to use a tag mechanism. So we use a Full Text Indexer to process the description and extract the tags. Of course, we need to get a base of what is a ingredient and what is not. In our burger “love” is a noun but not an ingredient, even if some french people would say so.

The list of recognizable words is called a thesaurus. With the help of our Full Text Indexer and the reference thesaurus, each recipe now has a list of ingredient tags associated with it. Now I’m able to count the ten most used ingredients, count the most used ingredient with steak…

But is using a Full Text Indexer with a simple thesaurus sufficient?

More advanced tools for structuring data

Common situation is a synonym or a connexe term : a “steak” is indeed a “beef steak”. When one wants to build indicators for all the beef recipes, we need to count a recipe with steak as a recipe containing beef. The tags are connected between each others and we must model this in our thesaurus.

What will really sublime the data is indeed the organization of the tags. This organization requires hierarchies (steak is a specialization of beef, which is a specialization of meat), segmentation (alcohol-free, alcool), collections (japanelo, tacos, burritos are in mexican recipes family).

On the one hand, some of these structures are supported out-of-the-box by search engines. The rest need to exist in the applicative domain.

On the other hand only a few database engines are able to manage even a basic thesaurus as well. So here, a search engine combined with a database really make sense. The search engine helps structuring the data up the chain then the SQL database stores the data for the report generation.

Search engines and database combined

We will conclude in sharing a conceptual schema on most of our mixed approach applications. The goal is to try to get the best of both worlds.

First we load the data in a search engine in a not structured way, and use a typical discovery app to find patterns and treatments to qualify as much as possible the information. We then load via a smarter, more complete text analysis process to a SQL database. This database will be perfectly suited to produce decision-making reports.

Originally published at inovia.fr on August 26, 2011.