Foundbite's Data Model: Relational DB vs. NoSQL on Cloudant

By Mike Breslin

Two weeks ago, we published a blog about Foundbite, which is a great app that recently launched in the Windows Phone Store. In this blog, we thought it would be useful to walk through a data design and querying exercise for an application like Foundbite. We’ll do this exercise for a relational DB as well as Cloudant to highlight the differences. We’ll start by showing what the Foundbite schema would look like in a relational database like MySQL, and then show its counterpart using the flexible schema of Cloudant’s JSON document store. We’ll then go through the process of adding additional data fields in both, as well as comparing a few queries in each database.

Foundbite combines pictures with the ambient sound happening at the time to allow users to share truly authentic experiences. Foundbite was built using Cloudant as its back-end data layer, and data for each Foundbite includes timestamps of when it was taken and uploaded, along with images, audio, description, and location information. The developers of Foundbite plan to periodically add data entities to enhance the user experience.

Contact us about Cloudant services and training

I recently joined Cloudant as director of services, and we’ve started offering formal training and consulting services to customers who would like to learn the Cloudant data layer in more detail. Making the transition between relational databases and Cloudant NoSQL is a common question, so we hope this hypothetical exercise provides some helpful context. While the JSON below is taken directly from Foundbite, the UML diagrams are not. They are, however, our own take on one possible relational model for the application (suggestions welcome below in the comments).

I’m a die-hard Boston Red Sox fan, and one lucky enough to celebrate in Fenway Park last month as they clinched the World Series. I don't have a Windows Phone so I couldn't use Foundbite at the game, but let’s use the example of me posting a few pictures of Fenway Park -- which would have included audio of the raucous crowd noise that ensued, had I been using the app -- when my beloved Sox won the final game.

Building a Relational Model

To start, foundbites will need to contain the following information: timestamp when uploaded, location coordinates, description, timestamp when taken, total length in ticks of the audio, one or more images, an audio file, and a username. In a relational schema, one would typically normalize the data and connect various tables with primary and foreign keys to obtain an efficient data model. You might end up with entity-relationship diagram that looked something like the following:

Note that there would probably be a users table in this scenario as well, but we’ve eliminated that for simplicity sake. Having already launched the application, the creators of Foundbite plan to continue adding data fields to enable new features. So their schema will be changing as incremental updates are made. New fields include reverse geo-location data like country, state, and city, along with hashtags and Foursquare venue IDs. In order to accomplish this in a relational database, you would have to run a few CREATE TABLE and ALTER TABLE statements, the latter of which might prove to be a bit tricky if your application were to remain in production. The end result might look something like this with the additions highlighted in bold:

Building a Document in Cloudant

Now lets transition over to Cloudant and walk through how one would store Foundbite data in a document database. Cloudant allows for “flexible schema,” meaning developers can use any schema design they choose and each document is not required to follow a specific format. Cloudant stores JSON documents, which are somewhat analogous to rows in a relational database, and they can contain field types like numbers, strings, dates, Booleans, sub-structures, arrays, and attachments (video, images, or any other file type). The initial Foundbite data would likely be stored in Cloudant in a document that looked like this:

{
    "_id": "foundbitedoc1",
    “schema”: “entry”,
    “timeStamp": 1379447630,
    “userName": "soxfan",
    “geoLocation": {
        “coordinates": [42.34493, -71.093232],
        “type": "Point"
    },
    “description": "Red Sox just clinched the World Series in Boston",
    “timeTaken": 1379447530,
    “imagesUploaded": [”http://foundbite.blob.core.windows.net/pics/fenway1.png", ”http://foundbite.blob.core.windows.net/pis/fenway2.png"],
    “audioUploaded": {
        “file": ”http://foundbite.blob.core.windows.net/audio/fb-audio1.mp3",
        “length": 23
    }
}

In this case, all of the required information for the Foundbite is stored in a single document. We add an arbitrary field called "schema": to denote that this is a Foundbite entry to separate it from other document types that could exist in our Cloudant database. Note that this isn’t necessary, but it is a good best-practice. We are leveraging the GeoJSON standard for location coordinates, and are referencing to audio and image files stored in Windows Azure Blob Storage. You could alternatively use binary document attachments to store the audio and images directly attached to the documents in Cloudant itself.

Now we want to show what it would take to add more data to the foundbites. In Cloudant, you can start by adding the data to new Foundbite documents immediately without any schema changes, database downtime or table locks. Previous Foundbites can be updated as needed or left without the additional fields. The updated doc would end up looking like the JSON below, with new fields added for “reverseGeoLocation”:, “fourSquareVenueID":, and “hashTags"::

{
    "_id": "foundbitedoc2",
    “schema”: “entry”,
    ”timeStamp": 1379447630,
    “userName": "soxfan",
    “geoLocation": {
        “coordinates": [42.34493, -71.093232],
        “type": "Point”,
        “reverseGeoLocation”: {
            “country”: “United States”,
            “countyState”: “Massachusetts”,
            “cityDistrict”: “Boston”
         }
    },
    “description": "Red Sox just clinched the World Series in Boston",
    “timeTaken": 1379447530,
    “imagesUploaded": [”http://foundbite.blob.core.windows.net/pics/fenway1.png", ”http://foundbite.blob.core.windows.net/pis/fenway2.png"],
    “audioUploaded": {
        “file": ”http://foundbite.blob.core.windows.net/audio/fb-audio1.mp3",
        “length": 23
    },
    “fourSquareVenueID": "40eb3d00f964a520250a1fe3",
    “hashTags": ["#boston", "#redsox"]
}

Schema flexibility is one of the primary reasons that developers choose a document store database like Cloudant. Now let’s move on and compare some typical queries that one would run in a relational DB and compare those to their counterparts in Cloudant.

Feed of Foundbites by User

One of the primary functions of the Foundbite app is the ability to see a feed of all the posts by a given user, ordered by the time the foundbite was taken. Let’s first see what this would look like using SQL against the relational DB example above. We would build a SQL statement that joins the foundbites table and the audio table to be able to gather the time taken. We would add a filter on a specific username in the WHERE clause and use ORDER BY to sort by the time taken. Let’s take a look:

SELECT
    foundbites.id,
    audio.timetaken
FROM
    foundbites,
    audio
WHERE
    foundbites.audio_id=audio.id AND
    foundbites.username=’soxfan’
ORDER BY
    audio.timetaken

Now if you wanted to get all the supporting information about the foundbite like the location and audio and images files, then you would add additional joins to the relevant tables and end up with a query something like the following:

SELECT
     foundbites.id,
     foundbites.timestamp,
     foundbites.description,
     locations.city_district,
     audio.path,
     audio.length,
     audio.timetaken,
     images.path,
     foursquares.venue
FROM
     foundbites,
     locations,
     audio,
     images,
     foursquares
WHERE
     foundbites.audio_id=audio.id AND
     foundbites.location_id=locations.id AND
     images.foundbite_id=foundbites.id AND
     foundbites.foursquare_id=foursquares.id AND
     foundbites.timestamp > ‘2013/10/20 00:00:00’ AND
     foundbites.timestamp < ‘2013/11/01 23:00:00’ AND
     foundbites.username=‘soxfan’
ORDER BY
     audio.timetaken

In Cloudant, typically you would use a Secondary Index with MapReduce to query your data ordered by a user-defined key. For more ad-hoc queries, one could use the built-in Lucene Search as well, but we’ll get to that later. To build the Secondary Index, we would create a view (in this case called feed_by_user) in a design doc that includes both the username and the time taken in the key of the index:

views: {
   feed_by_user: {
	   map: function(doc) {
		   if(doc.schema == 'entry') {
			emit([doc.userName, doc.timeTaken], null)
		   }
	   }
   }
}

We could then query the view with Cloudant’s HTTP API with the appropriate filters to get back all the documents for a specific username and time taken:

https://mydb.cloudant.com/foundbite/_design/app/_view/feed_by_user?startkey=["soxfan",1379000000]&endkey=["soxfan",1379085276]

Note that the index is already sorted by the key (in this case [<userName>, <timeTaken>]) so there is no need to do additional sorting. You could also add &include_docs=true to the query to get back all of the information in each Foundbite document as well.

Search on Foundbite Description

Another important feature of Foundbite is the ability to search on the foundbite descriptions, for example to search for all foundbites with “Boston” in the description, as in the home of the World Series Champion Boston Red Sox. In the relational database, we could query to pull back all the foundbite information that contains the word “Boston”. To do this in MySQL, you would probably leverage its full-text searching feature. That would involve altering the tables with searchable fields by running a command like this: ALTER TABLE foundbites ADD FULLTEXT(description). You can then query the description field using the WHERE MATCHAGAINST syntax as follows:

SELECT
     foundbites.id,
     foundbites.timestamp,
     foundbites.description,
     locations.city_district,
     audio.path,
     audio.length,
     audio.timetaken,
     images.path,
     foursquares.venue
FROM
     foundbites,
     locations,
     audio,
     images,
     foursquares
WHERE
     foundbites.audio_id=audio.id AND
     foundbites.location_id=locations.id AND
     images.foundbite_id=foundbites.id AND
     foundbites.foursquare_id=foursquares.id AND
     MATCH(foundbites.description) AGAINST ('Boston');

For large data sets, even the built-in full-text searching feature of MySQL won’t scale adequately. You would more likely install and configure a third-party search engine to ensure speedy queries. Search engines like Apache Solr, Elasticsearch and Sphinx are commonly used alongside a relational database like MySQL. As an example, adding Apache Solr would involve installing an external JVM, downloading the Solr binaries and configuring the JVM, connecting the JVM to your MySQL database, building a mapping of data between MySQL schema to Solr, and importing the data from MySQL into Solr.

Now let’s move over to Cloudant to see how one would search for foundbites matching a certain description. Cloudant includes ad-hoc search, which is built upon Lucene, and allows users a highly efficient way to search document contents. In order to use Cloudant Search, you would set up an index in a design document on the searchable fields (in this case description) as follows:

indexes: {
   descriptions: {
	   index: function(doc) {
		   index(default, doc.description)
		   }
	   }
   }
}

You can then query to find a list of all the foundbites with “Boston” in the description:

https://mydb.cloudant.com/foundbite/_design/app/_search/descriptions?q=description:Boston

Cloudant’s customers commonly use search indexes to run ad-hoc queries across large datasets in a very efficient and quick manner, and enjoy the benefit of not requiring any third-party components to leverage the feature with their data.

Conclusion

The above is just a simple example of using Cloudant versus a relational data store, but hopefully it will help shed light on the high-level differences to data design and querying between the two different types of databases.

If you have any questions or need assistance, contact us at #cloudant on IRC or email support@cloudant.com. If you are on a relational data store and require assistance migrating to Cloudant, check out our training classes and data migration service at https://cloudant.com/product/training-services/.

Getting More Help

If you need help getting started with Cloudant, visit the Cloudant Developer Resources Site (https://cloudant.com/for-developers/) or contact us for assistance:

Sign Up for Updates!

Recent Posts