In my previous blog, How to use MongoDB Data Modelling to Improve Throughput Operations, we discussed the 2 major data modelling relationship approaches that is, embedding and referencing. The scalability of MongoDB is quite dependent on its architecture and to be specific, data modelling. When designing a NoSQL DBM, the main point of consideration is to ensure schema-less documents besides a small number of collections for the purpose of easy maintenance. Good data integrity, adopting data validation through some defined rules before storage is encouraged. A database architecture and design should be normalized and decomposed into multiple small collections as a way of shunning data repetition, improve data integrity and make it easy with retrieval patterns. With this in place, you are able to improve data consistency, atomicity, durability and integrity of your database.
Data modelling is not an afterthought undertaking in an application development phase but an initial consideration since many application facets are actually realised during the data modelling stage. In this article we are going to discuss which factors need to be considered during data modelling and see how they affect performance of a database in general.
Many times you will need to deploy a cluster of your database as one way of increasing data availability. With a well designed data model you can distribute activities to a sharded cluster more effectively hence reduce the throughput operations aimed at a single mongod instance. The major factors to consider in data modelling include:
- Scalability
- Atomicity
- Performance and Data usage
- Sharding
- Indexing
- Storage optimization
- Document structure and growth
- Data Lifecycle
1. Scalability
This is an increase in the workload of an application driven by increased traffic. Many applications always have an expectation in the increase in number of its users. When there are so many users being served by a single database instance, the performance does not always meet the expectations. As a database manager, you thus have a mandate to design this DBM such that collections and data entities are modelled based on the present and future demands of the application. The database structure should generally be presentable to enhance easy the process of replication and sharding. When you have more shards, the write operations are distributed among this shards such that for any data update, it is done within the shard containing that data rather than looking up in a single large set of data to make an update.
2. Atomicity
This refers to the succeeding or failure of an operation as a single unit. For example you might have a read operation that involves a sort operation after fetching the result. If the sort operation is not handled properly, the whole operation will therefore not proceed to the next stage.
Atomic transactions are series of operations that are neither divisible nor reducible hence occur as single entities or fail as a single operations. MongoDB versions before 4.0 support write operations as atomic processes on a single document level. With the version 4.0, one can now implement multi-document transactions. A data model that enhances atomic operations tends to have a great performance in terms of latency. Latency is simply the duration within which an operation request is sent and when a response is returned from the database. To be seccant, it is easy to update data which is embedded in a single document rather than one which is referenced.
Let’s for example consider the data set below
{
childId : "535523",
studentName : "James Karanja",
parentPhone : 704251068,
age : 12,
settings : {
location : "Embassy",
address : "420 01",
bus : "KAZ 450G",
distance : "4"
}
}
If we want to update the age by increasing it by 1 and change the location to London we could do:
db.getCollection(‘students’).update({childId: 535523},{$set:{'settings.location':'London'}, $inc:{age:1}}).
If for example the $set operation fails, then automatically the $inc operation will not be implemented and in general the whole operation fails.
On the other hand, let’s consider referenced data suche that there are 2 collections one for student and the other for settings.
Student collection
{
childId : "535523",
studentName : "James Karanja",
parentPhone : 704251068,
age : 12
}
Settings collection
{
childId : "535523",
location : "Embassy",
address : "420 01",
bus : "KAZ 450G",
distance : "4"
}
In this case you can update the age and location values with separate write operations .i.e
db.getCollection(‘students’).update({childId: 535523},{$inc:{age:1}})
db.getCollection('settings’).update({childId: 535523 } , {$set: { 'settings.location':'London'}})
If one of the operations fails, it does not necessarily affect the other since they are carried out as different entities.
Transactions for Multiple Documents
With MongoDB version 4.0, you can now carry out multiple document transactions for replica sets. This improves the performance since the operations are issued to a number of collections, databases and documents for fast processing. When a transaction has been committed the data is saved whereas if something goes wrong and a transaction fails, the changes that had been made are discarded and the transaction will be generally aborted. There will be no update to the replica sets during the transaction since the operation is only visible outside when the transaction is fully committed.
As much as you can update multiple documents in multiple transactions, it comes with a setback of reduced performance as compared to single document writes. Besides, this approach is only supported for the WiredTiger storage engine hence being a disadvantage for the In-Memory and MMAPv1 storage engines.
3. Performance and Data Usage
Applications are designed differently to meet different purposes. There are some which serve for the current data only like weather news applications. Depending on the structure of an application, one should be able to design a correspondent optimal database to server the required use case. For example, if one develops an application which fetches the most recent data from the database, using a capped collection will be the best option. A capped collection enhances high throughput operation just like a buffer such that when the allocated space is exploited, the oldest documents are overwritten and the documents can be fetched in the order they were inserted. Considering the inserting order retrieval, there will be no need to use indexing and absence of an index overhead will equally improve the write throughput. With a capped collection, the data associated is quite small in that it can be maintained within the RAM for some time. Temporal data in this case is stored in the cache which is quite read than being written into hence making read operation quite fast. However, the capped collection comes with some disadvantages such as, you cannot delete a document unless drop the whole collection, any change to the size of a document will fail the operation and lastly it is not possible to shard a capped collection.
Different facets are integrated in the data modelling of a database depending on the usage needs. As seen report applications will tend to be more read intensive hence the design should be in a way to improve the read throughput.
4. Sharding
Performance through horizontal scaling can be improved by sharding since the read and write workloads are distributed among the cluster members. Deploying a cluster of shards tends to partition the database into multiple small collections with distributed documents depending on some shard key. You should select an appropriate shard key which can prevent query isolation besides increasing the write capacity. A better selection generally involves a field which is present in all the documents within the targeted collection. With sharding, there is increased storage since as the data grows, more shards are established to hold a subset of this cluster.
5. Indexing
Indexing is one of the best approaches for improving the write workload especially where the fields are occuring in all the documents. When doing indexing, one should consider that each index will require 8KB of data space. Further, when the index is active it will consume some disk space and memory hence should be tracked for capacity planning.
6. Storage Optimization
Many small documents within a collection will tend to take more space than when you have a few documents with sub-embedded documents. When modelling , one should therefore group the related data before storage. With a few documents, a database operation can be performed with few queries hence reduced random disk access and there will be fewer associated key entries in the corresponding index. Considerations in this case therefore will be: use embedding to have fewer documents which in turn reduce the per document overhead. Use shorter field names if fewer fields are involved in a collection so as not to make document overhead significant. Shorter field names reduce expressiveness .i.e.
{ Lname : "Briston", score : 5.9 }
will save 9 bytes per document rather than using
{ last_name : "Briston", high_score: 5.9 }
Use the _id field explicitly. By default, MongoDB clients add an _id field to each document by assigning a unique 12-byte ObjectId for this field. Besides, the _id field will be indexed. If the documents are pretty small, this scenario will account for a significant amount of space in overall document number. For storage optimization, you are allowed to specify the value for the _id field explicitly when inserting documents into a collection. However, ensure the value is uniquely identified because it serves as a primary key for documents in the collection.
7. Document Structure and Growth
This happens as a result of the push operation where subdocuments are pushed into an array field or when new fields are added to an existing document. Document growth has some setbacks i.e. for a capped collection, if the size is altered then the operation will automatically fail. For a MMAPv1 storage engine, versions before 3.0 will relocate the document on disk if the document size is exceeded. However, later versions as from 3.0, there is a concept of Power of 2 Sized Allocations which reduces the chances of such re-allocations and allow the effective reuse of the freed record space. If you expect your data to be growing, you may want to refactor your data model to use references between data in distinct documents rather than using a denormalized data model.To avoid document growth, you can also consider using a pre-allocation strategy.
8. Data Lifecycle
For an application that uses the recently inserted documents only, consider using a capped collection whose features have been discussed above.
You may also set the Time to Live feature for your collection. This is quite applicable for access tokens in password reset feature for an applications.
Time To Live (TTL)
This is a collection setting that makes it possible for mongod to automatically remove data after a specified duration. By default, this concept is applied for machine generated event data, logs and session information which need to persist for a limited period of time.
Example:
db.log_events.createIndex( { "createdAt": 1 }, { expireAfterSeconds: 3600 } )
We have created an index createdAt and specified some expireAfterSeconds value of 3600 which is 1 hour after time of creation. Now if we insert a document like:
db.log_events.insert( {
"createdAt": new Date(),
"logEvent": 2,
"logMessage": "This message was recorded."
} )
This document will be deleted after 1 hour from the time of insertion.
You can also set a clock specific time when you want the document to be deleted. To do so, first create an index i.e:
db.log_events.createIndex( { "expireAt": 1 }, { expireAfterSeconds: 0 } )
Now we can insert a document and specify the time when it should be deleted.
db.log_events.insert( {
"expireAt": new Date(December 12, 2018 18:00:00'),
"logEvent": 2,
"logMessage": "Success!"
} )
This document will be deleted automatically when expireAt value is older than the number of seconds specified in the expireAfterSeconds, i.e 0 in this case.
Conclusion
Data modelling is a spacious undertaking for any application design in order to improve its database performance. Before inserting data to your db, consider the application needs and which are the best data model patterns you should implement. Besides, important facets of applications cannot be realised until the implementation of a proper data model.