Real Time Analytics using MongoDB
Real time analytics using MongoDB
MongoDB is a top database choice for application development. Developers choose this database because of its flexile data model and its inherent scalability as a NoSQL database. These features enable development teams to iterate and pivot quickly and efficiently. MongoDB was not originally developed for the high performance analysis. Yet, analytics is now a vital part of modern data applications. Developers have formed ingenious solutions for real time analytical queries on data stored in MongoDB, using in house solutions or third-party products.
Using the Mongo upsert and $inc features, we can efficiently solve the problem. When an app server renders a page, the app server can send one or more updates to the database to update statistics.
We can be do this efficiently for a few reasons. First, we send a single message to the server for the update. The message is an “upsert” – if the object exists, we increment the counters, if it does not, and the object is created. Second, we do not wait for a response – we simply send the operation, and immediately return to other work at hand. As the data is simply page counters, we do not need to wait and see if the operation completes (we wouldn’t report such an error to our web site user anyway). Third, the special $inc operator lets us efficiently update an existing object without requiring a much more expensive query/modify/update sequence.
There are mainly two methods to perform analytics using MongoDB;
1. Replicating a MongoDB database into a SQL database:
Replicating data into a SQL database allows users to keep on using MongoDB as their production database and use the relational format to analyze data with ease. SQL can used on this relational version of MongoDB data. This allows users to access and maintain data with ease and combine data from multiple tables using indexes to perform insightful analysis.
SQL brings in a lot of conveniences when working with lengthy aggregations and complex data joins. However, data replication is not as easy as it sound. This requires an ETL job which might be complicated as it requires transferring data from a NoSQL environment to a SQL environment.
2. Data Virtualization:
Data Virtualization is a method that can be used for MongoDB real time analytics. This method is the ideal solution to counter the limitations or replicating databases. Various tools provide an interactive and user friendly interface. These tools can be connected with MongoDB with ease and allow the users to query or manipulate their data stored in MongoDB. Users can now develop visualizations and perform real time analysis in just a few clicks making use of smart and easy to use dashboards and customer facing reports. The advantage here is that it doesn’t require any additional hardware or tedious ETL jobs to analyze data.
One such tool is Apache Spark. MongoDB supports this popular framework that is loved by data scientist, engineers and analysts. MongoDB provides powerful large scale analytics features. These allow users to perform analytics within the platform by converting data into visualizations along with a parallel query execution engine to boost the performance. MongoDB also supports a SQL based BI connector that allow users to explore their MongoDB data using different business intelligence tools such as Microsoft BI, etc.
Advantages:
2. Powerful Analytics: MongoDB supports real time analytics with a wide variety of data. It allows performing analytics on secondary data, and even on text searches. It has strong integrations with aggregation frameworks and the MapReduce paradigm.
3. Speed: MongoDB being a document-oriented database, allows us to query data quickly. Its rich indexing capabilities allow it to perform way faster than a relational database.
4. Easy Setup: MongoDB can be set up easily on any system.
5. Scalability: NoSQL databases are built to scale. MongoDB’s sharding capability allows it to distribute data across datasets, servers etc. This gives it an unlimited growth capability and a higher production rate than a relational database.
6. Data Adaptability: A NoSQL system like MongoDB supports a wide variety of data such as text data, geospatial data, etc. It provides an ultra-flexible data model making it easier to incorporate data and making adjustments for better performance.
7. Real-Time: With MongoDB, user can analyze data of any structure within the database and get real-time results without costly data warehouse loads.
· Disadvantages:
1. No support for Joins: MongoDB doesn’t support joins. Joins are implemented using programming languages such as Java, however, this makes the querying complex performance.
2. Memory Constrains: MongoDB leads to unnecessary usage of memory. It stores every key value pair and hence suffers from duplication of values.
3. No Referential Integrity: These are the defined and validated relations between different pieces of data. Referential Integrity helps to keep the information consistent and adds another layer of validation underneath the programmatic one.
v Using MongoDB with Relational Databases:
Relational databases have been around for decades. Programmers have built countless applications, web-based or other type of applications, on top of such databases. If the domain of the problem is relational, then the sing an RDBMS is an obvious choice. The real world entities are mapped into tables, and the relationships among the entities are maintained using more tables. But there could be some parts of the problem domain where using a relational data model will not be the best approach, and perhaps we may need a data store that supports a flexible schema. In such scenarios, we could use a document oriented data storage solution as MongoDB. The application code will have separate modules for accessing and manipulating the data of the RDBMS and that of MongoDB.
Potential use cases:
1. Storing results of aggregation queries:
The results of expensive aggregation queries (count, group by, and so on) can be stored in a MongoDB database. This allows the application to quickly get the result from MonogDB without having to perform the same query again, until the result becomes stale (at which point the query will be performed and the result will be stored again). Since the schema of a MongoDB collection is flexible, we don’t need to know anything about the structure of the result data beforehand. The rows returned by the aggregation query could be stored as BSON (Binary JSON) documents.
2. Data Archiving:
As the volume of data grows, queries and other operations on a relational table increasingly take more time. One solution to this problem is to partition the data into two tables: an Online table, which contains the working dataset, and an archival table that holds the old data. The size of the online table will remain more or less the same, but the archival table will grow larger. The drawback of this approach is that when the schema of the online table changes,
we will have to apply the same changes to the archive table. This will be a very slow operation because of the volume of the data. Also, if we drop one or more columns in the online table, we will have to drop those columns in the archive tables too, thus losing the old data that might have been valuable. To get around this problem, we could use a MongoDB collection as the archive.
3. Logging:
We can apply MongoDB for logging events in an application. We can use a relational database for the same purpose, but the Insert operations on the log table will incur an extra overhead that will make the application response slower. We can also try simple file based logging, but in that case, we would have to write our own, regular-expression- powered log parsing code to analyze the log data and extract information out of it.
4. Storing entity metadata:
The application that you built maps the entities of the domain into tables. The entities could be physical, real-world objects (users, products, and so on), or they could be something virtual (blog posts, categories, and tags). User can
determine what pieces of information user need to store for each of these entities, and then user design the database schema and define the table structures.
v Defining the relational model:
The relational data model provides conceptual tools to design the database schema of the relational database. The relational model describes the data, relationship between that data, data sematic and constraints on the data in the
relational database.
The relational model expresses the data and relationship among the data in the form of tables. A relational model is popular for its simplicity and possibility of hiding the low level implementation details from database developer and
database users. Relational data model expresses the database as a set of relations. Each relation has columns and rows which are formally called attributes and tuples respectively. Each tuple in relation is a real world entity or relationship.
1. The database is set of related relations.
2. Each relation has a name which indicate what type of tuples in relation has. For example, a relation name student indicates that it has student entities in it.
3. Each relation has a set of attributes which represents different types of values.