Severalnines - MongoDB

Most software applications nowadays involve some dynamic data storage for extensive future reference in the application itself. We all know data is stored in a database which falls into two categories that are: Relational and Non-relational DBMS.

Your choice of selection from these two will fully depend on your data structure, amount of data involved, database performance and scalability.

Relational DBMS store data in tables in terms of rows such that they use Structured Querying Language (SQL) making them a good choice for applications involving several transactions. They include MySQL, SQLite, and PostgreSQL.

On the other hand, NoSQL DBMS such as MongoDB are document-oriented such that data is stored in collections in terms of documents. This gives a greater storage capacity for a large set of data hence a further advantage in scalability.

In this blog we are assuming you have a better knowledge for either MongoDB or MySQL and hence would like to know the correlation between the two in terms of querying and database structure.

Below is a cheat sheet to further familiarize yourself with the querying of MySQL to MongoDB.

MySQL to MongoDB Cheat Sheet - Terms

MySQL Terms	MongoDB Terms	Explanation
Table	Collection	This is the storage container for data that tends to be similar in the contained objects.
Row	Document	Defines the single object entity in the table for MySQL and collection in the case of MongoDB.
Column	Field	For every stored item, it has properties which are defined by different values and data types. In MongoDB, documents in the same collection, may have different fields from each other. In MySQL, every row must be defined with the same columns from the existing ones.
Primary key	Primary key	Every stored object is identified with a unique field value in the case of MongoDB we have _id field set automatically whereas in MySQL you can define your own primary key which is incremental as you create new rows.
Table Joins	Embedding and linking documents	Connection associated with an object in a different collection/table to data in another collection/table.
where	$match	Selecting data that matches criteria.
group	$group	Grouping data according to some criteria.
drop	$unset	Removing a column/field from a row/document/
set	$set	Setting the value of an existing column/field to a new value.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Schema Statements

MySQL Table Statements	MongoDB Collection Statements	Explanation
The database and tables are created explicitly through the PHP admin panel or defined within a script i.e Creating a Database `CREATE DATABASE database_name` Creating a table `CREATE TABLE users ( id MEDIUMINT NOT NULL AUTO_INCREMENT, UserId Varchar(30), Age Number, Gender char(1), Name VarChar(222), PRIMARY KEY (id) )`	The database can be created implicitly or explicitly. Implicitly during the first document insert the database and collection are created as well as an automatic _id field being added to this document. `db.users.insert( { UserId: "user1", Age: 55, Name: "Berry Hellington", Gender: "F", } )` You can also create the database explicitly by running this comment in the Mongo Shell `db.createCollection("users")`	In MySQL, you have to specify the columns in the table you are creating as well as setting some validation rules like in this example the type of data and length that goes to a specific column. In the case of MongoDB, it is not a must to define neither the fields each document should hold nor the validation rules the specified fields should hold. However, in MongoDB for data integrity and consistency you can set the validation rules using the JSON SCHEMA VALIDATOR
Dropping a table `DROP TABLE users`	`db.users.drop()`	This are statements for deleting a table for MySQL and collection in the case of MongoDB.
Adding a new column called join_date `ALTER TABLE users ADD join_date DATETIME` Removing the join_date column if already defined `ALTER TABLE users DROP COLUMN join_date DATETIME`	Adding a new field called join_date `db.users.updateMany({},{$set:{‘join_date’: new Date()})` This will update all documents in the collection to have the join date as the current date. Removing the join_date field if already defined `db.users.updateMany({},{$unset:{‘join_date’: “”})` This will remove the join_date field from all the collection documents.	Altering the structure of the schema by either adding or dropping a column/field. Since the MongoDB architecture does not strictly enforce on the document structure, documents may have fields different from each other.
Creating an index with the UserId column ascending and Age descending `CREATE INDEX idx_UserId_asc_Age_desc ON users(UserId)`	Creating an index involving the UserId and Age fields. `db.users.ensureIndex( { UserId: 1, Age: -1 } )`	Indices are generally created to facilitate the querying process.
`INSERT INTO users(UserId, Age, Gender) VALUES ("user1", 25, "M")`	`db.users.insert( { UserId: "bcd001", Age: 25, Gender: "M", Name: "Berry Hellington", } )`	Inserting new records.
`DELETE FROM users WHERE Age = 25`	`db.users.deleteMany( { Age = 25 } )`	Deleting records from the table/collection whose age is equal to 25.
`DELETE FROM users`	`db.users.deleteMany({})`	Deleting all records from the table/collection.
`SELECT * FROM users`	`db.users.find()`	Returns all records from the users table/collection with all columns/fields.
`SELECT id, Age, Gender FROM users`	`db.users.find( { }, { Age: 1, Gender: 1 } )`	Returns all records from the users table/collection with Age, Gender and primary key columns/fields.
`SELECT Age, Gender FROM users`	`db.users.find( { }, { Age: 1, Gender: 1,_id: 0} )`	Returns all records from the users table/collection with Age and Gender columns/fields. The primary key is omitted.
`SELECT * FROM users WHERE Gender = “M”`	`db.users.find({ Gender: "M"})`	Returns all records from the users table/collection whose Gender value is set to M.
`SELECT Gender FROM users WHERE Age = 25`	`db.users.find({ Age: 25}, { _id: 0, Gender: 1})`	Returns all records from the users table/collection with only the Gender value but whose Age value is equal to 25.
`SELECT * FROM users WHERE Age = 25 AND Gender = ‘F’`	`db.users.find({ Age: 25, Gender: "F"})`	Returns all records from the users table/collection whose Gender value is set to F and Age is 25.
`SELECT * FROM users WHERE Age != 25`	`db.users.find({ Age:{$ne: 25}})`	Returns all records from the users table/collection whose Age value is not equal to 25.
`SELECT * FROM users WHERE Age = 25 OR Gender = ‘F’`	`db.users.find({$or:[{Age: 25, Gender: "F"}]})`	Returns all records from the users table/collection whose Gender value is set to F or Age is 25.
`SELECT * FROM users WHERE Age > 25`	`db.users.find({ Age:{$gt: 25}})`	Returns all records from the users table/collection whose Age value is greater than 25.
`SELECT * FROM users WHERE Age <= 25`	`db.users.find({ Age:{$lte: 25}})`	Returns all records from the users table/collection whose Age value is less than or equal to 25.
`SELECT Name FROM users WHERE Name like "He%"`	`db.users.find( { Name: /He/ } )`	Returns all records from the users table/collection whose Name value happens to have He letters.
`SELECT * FROM users WHERE Gender = ‘F’ ORDER BY id ASC`	`db.users.find( { Gender: "F" } ).sort( { $natural: 1 } )`	Returns all records from the users table/collection whose Gender value is set to F and sorts this result in the ascending order of the id column in case of MySQL and time inserted in the case of MongoDB.
`SELECT * FROM users WHERE Gender = ‘F’ ORDER BY id DESC`	`db.users.find( { Gender: "F" } ).sort( { $natural: -1 } )`	Returns all records from the users table/collection whose Gender value is set to F and sorts this result in the descending order of the id column in case of MySQL and time inserted in the case of MongoDB.
`SELECT COUNT(*) FROM users`	`db.users.count()` or `db.users.find().count()`	Counts all records in the users table/collection.
`SELECT COUNT(Name) FROM users`	`db.users.count({Name:{ $exists: true }})` or `db.users.find({Name:{ $exists: true }}).count()`	Counts all records in the users table/collection who happen to have a value for the Name property.
`SELECT * FROM users LIMIT 1`	`db.users.findOne()` or `db.users.find().limit(1)`	Returns the first record in the users table/collection.
`SELECT * FROM users WHERE Gender = ‘F’ LIMIT 1`	`db.users.find( { Gender: "F" } ).limit(1)`	Returns the first record in the users table/collection that happens to have Gender value equal to F.
`SELECT * FROM users LIMIT 5 SKIP 10`	`db.users.find().limit(5).skip(10)`	Returns the five records in the users table/collection after skipping the first five records.
`UPDATE users SET Age = 26 WHERE age > 25`	`db.users.updateMany( { age: { $gt: 25 } }, { $set: { Age: 26 } } )`	This sets the age of all records in the users table/collection who have the age greater than 25 to 26.
`UPDATE users SET age = age + 1`	`db.users.updateMany( {} , { $inc: { age: 1 } } )`	This increases the age of all records in the users table/collection by 1.
`UPDATE users SET age = age - 1 WHERE id = 1`	`db.users.updateMany( {} , { $inc: { age: -1 } } )`	This decrements the age of the first record in the users table/collection by 1.

To manage MySQL and/or MongoDB centrally and from a single point, visit: https://severalnines.com/product/clustercontrol.

Tags:

MySQL

MongoDB

mongo

nosql

commands

Below is an excerpt from our whitepaper “MongoDB Management and Automation with ClusterControl” which can be downloaded for free.

Considerations for Administering MongoDB

Built-in Redundancy

A key feature of MongoDB is its built-in redundancy, in the form of Replication. If you have two or more data nodes, they can be configured as a replica set, in which all data written to the Primary node, is replicated in near real time to the secondary nodes,

MongoDB Replica Set

ensuring multiple copies of the data. In the case of Primary failover, the remaining nodes in the replica set conduct an election and promote the winner to be Primary, a process that typically takes 2-3 seconds, and writes to the replica set can resume. MongoDB also uses a journal for faster, safer writes to the server or replica set, and also employs a “write concern” method through which the level of write redundancy is
configured.

To manually deploy a replica set, the high-level steps are as follows:

Allocate a single physical or virtual host for each database node, and install the MongoDB command line client on your desktop. For a redundant replica set configuration, a minimum of three nodes are required, at least two of which will be data nodes. One node in the replica set may be configured as an arbiter: this is a mongod process configured only to make up a quorum by providing a vote in the election of a Primary when required. Data is not replicated to arbiter processes.
Install MongoDB on each node. Some Linux distributions include MongoDB Community Edition, but be aware that these may not include the latest versions. MongoDB Enterprise is available only by download from MongoDB’s website. Similar functionality to MongoDB Enterprise is also available via Percona Server for MongoDB, a drop-in replacement for MongoDB Enterprise or Community Edition.
Configure the individual mongod.conf configuration files for your replica set, using the “replication parameter”. If you will use a key file for security, configure this now also. Note that using key file security also enables role-based authentication, so you will also need to add users and roles to use the servers. Restart the mongod process on each server.
Ensure connectivity between nodes. You must ensure that MongoDB replica set nodes can communicate with each other on port 27017, and also that your client(s) can connect to each of the replica set nodes on the same port.
Using the MongoDB command line client, connect to one of the servers, and run rs.initiate() to initialise your replica set, followed by rs.add() for each additional node. rs.conf() can be used to view the configuration.

While these steps are not as complex as deploying and configuring a MongoDB sharded cluster, or sharding a relational database, they can be onerous and prone to error, especially in larger environments.

Scalability

MongoDB is frequently referred to as “web scale” database software, due to its capacity for scaling horizontally. Like relational databases, it is possible to scale MongoDB vertically, simply by upgrading the physical host on which is resides with more CPU cores, more RAM, faster disks, or even increased bus speed. Vertical scaling has its limits however, both in terms of cost-benefit ratio and diminishing returns, and of technical limitation. To address this, MongoDB has an “auto-sharding” feature, that allows databases to be split across many hosts (or replica sets, for redundancy). While sharding is also possible on relational platforms, unless designed for at database inception, this requires major schema and application redesign, as well as client application redesign, making this a tedious, time-consuming, and error-prone process.

MongoDB sharding works by introducing a router process, through which clients connect to the sharded cluster, and configuration servers, which store the cluster metadata, the location in the cluster of each document. When a client submits a query to the router process, it first refers to the config servers to obtain the locations of the documents, and then obtains the query results directly from the individual servers or
replica sets (shards). Sharding is carried out on a per collection basis.

A critically important parameter here, for performance purposes, is the “shard key”, an indexed field or compound field that exists in each document in a collection. It is this that defines the write distribution across shards of a collection. As such, a poorly-chosen shard key can have a very detrimental effect on performance. For example, a purely time-series based shard key may result in all writes going to a single node for extended periods of time. However, a hashed shard key, while evenly distributing writes across shards, may impact read performance as a result set is retrieved from many nodes.

MongoDB Sharded Cluster

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Arbiters

A MongoDB arbiter is a mongod process that has been configured not to act as a data node, but to provide only the function of voting when a replica set Primary is to be elected, to break ties and guard against a split vote. An arbiter may not become Primary, as it does not hold a copy of the data or accept writes. While it is possible to have more than one arbiter in a replica set, it is generally not recommended.

MongoDB elections and the arbiter process

Delayed Replica Set Members

Delayed replica set members add an additional level of redundancy, maintaining a state that is a fixed number of seconds behind the Primary. As delayed members are a “rolling backup” or a running “historical” snapshot of the data set, they can help to recover from various types of human error.

Delayed members are “hidden” replica set members, invisible to client applications, and so cannot be queried directly. They also may not become Primary during normal operations, and must be reconfigured manually in the case that they are to be used to recover from error.

MongoDB Delayed Secondary node

Backups

Backing up a replica set or sharded cluster is carried out via the “mongodump“ command line utility. When used with the --oplog parameter, this creates a dump of the database that includes an oplog, to create a point-in-time snapshot of the state of a mongod instance. Using mongorestore with the --replayOplog parameter, you can then fully restore the data state at the time the backup completed, avoiding inconsistency.

For more advanced backup requirements, a third party tool called “mongodbconsistent-backup” - also command line based - is available that provides fully consistent backups of sharded clusters, a complex procedure, given that sharded databases are distributed across multiple hosts.

Monitoring

There are a number of commercial tools, both official and unofficial, available on the market for monitoring MongoDB. These tools, in general, are single product management utilities, focusing on MongoDB exclusively. Many focus only on certain specific aspects, such as collection management in an existing MongoDB architecture, or on backups, or on deployment. Without proper planning, this can lead to a situation where a proliferation of additional tools must be deployed and managed in your environment.

The command line tools provided with MongoDB, “mongotop” and “mongostat” can provided a detailed view of your environments performance, and can be used to diagnose issues. In addition, MongoDB’s “mongo” command line client can also run “rs.status()” - or in a sharded cluster “sh.status() - to view the status of replica sets or clusters and their member hosts. The “db.stats()” command returns a document that addresses storage use and data volumes, and their are equivalents for collections, as well as other calls to access many internal metrics.

Synopsis

This has been a brief synopsis of considerations for administering MongoDB. Even at such a high level though, it should immediately be obvious that while it is possible to administer a replica set or sharded cluster from the command line using available tools, this does not scale in an environment with many replica sets or with a large production sharded cluster. In medium to large environments comprising many hosts
and databases, it quickly becomes unfeasible to manage everything with command line tools and scripts. While internal tools and scripts can be developed to deploy and maintain the environment, this adds the burden of managing new development, revision control systems, and processes. A simple upgrade of a database server may become a complex process if tooling changes are required to support new database
server versions.

But without internal tools and scripts, how do we automate and manage MongoDB clusters? Download the whitepaper to learn how!

Tags:

Recently, MongoDB released a new feature starting from version 3.6, Change Streams. This gives you the instantaneous access to your data which helps you to stay up to date with your data changes. In today’s world, everyone wants instant notifications rather than getting it after some hours or minutes. For some applications, it’s critical to push real time notifications to all subscribed users for each and every updates. MongoDB made this process really easy by introducing this feature. In this article, we will learn about MongoDB change stream and its applications with some examples.

Defining Change Streams

Change streams are nothing but the real time stream of any changes that occur in the database or collection or even in deployments. For example, whenever any update (Insert, Update or Delete) occurs in a specific collection, MongoDB triggers a change event with all the data which has been modified.

You can define change streams on any collection just like any other normal aggregation operators using $changeStream operator and watch() method. You can also define change stream using MongoCollection.watch() method.

Example

db.myCollection.watch()

Change Streams Features

Filtering Changes
You can filter the changes to get event notifications for some targeted data only.
Example:
```
pipeline = [
   {
     $match: { name: "Bob" }
   } ];
changeStream = collection.watch(pipeline);
```
Related resources
ClusterControl for MongoDB
How to Use MongoDB Data Modeling to Improve Throughput Operations
Become a MongoDB DBA: How to scale reads
MongoDB Aggregation Framework Stages and Pipelining
This code will make sure that you get updates only for records which has name equals to Bob. This way you can write any pipelines to filter the change streams.
Resuming Change Streams
This feature ensures that there is no data loss in case of any failures. Each response in the stream contains the resume token which can be used to restart the stream from a specific point. For some frequent network failures, mongodb driver will try to re-establish the connection with the subscribers using the most recent resume token. Although, in case of complete application failure, resume token should be maintained by the clients to resume the stream.
Ordered Change Streams
MongoDB uses a global logical clock to order all the change stream events across all the replicas and shards of any cluster so, the receiver will always receive the notifications in the same order the commands were applied on the database.
Events with full documents
MongoDB returns the part of the matching documents by default. But, you can modify the change stream config to receive a full document. To do so, pass { fullDocument: “updateLookup”} to watch method.
Example:
```
collection = db.collection("myColl")
changeStream = collection.watch({ fullDocument: “updateLookup”})
```
Durability
Change streams will only notify for the data which are committed to the majority of the replicas. This will make sure that events are generated by majority persistence data ensuring the message durability.
Security/Access Control
Change streams are very secure. Users can create change streams only on the collections on which they have read permissions. You can create change streams based on user roles.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Example of Change Streams

In this example, we will create change streams on the Stock collection to get notified when any stock price go above any threshold.

Setup the cluster
To use change streams, we have to create replica set first. Run the following command to create single node replica set.
```
mongod --dbpath ./data --replSet “rs”
```

Insert some records in the Stocks collection

var docs = [
 { ticker: "AAPL", price: 210 },
 { ticker: "AAPL", price: 260 },
 { ticker: "AAPL", price: 245 },
 { ticker: "AAPL", price: 255 },
 { ticker: "AAPL", price: 270 }
];
db.Stocks.insert(docs)

Setup node environment and install dependencies

mkdir mongo-proj && cd mongo-proj
npm init -y
npm install mongodb --save

Subscribe for the changes

Create one index.js file and put the following code in it.

const mongo = require("mongodb").MongoClient;
mongo.connect("mongodb://localhost:27017/?replicaSet=rs0").then(client => {
 console.log("Connected to MongoDB server");
 // Select DB and Collection
 const db = client.db("mydb");
 const collection = db.collection("Stocks");
 pipeline = [
   {
     $match: { "fullDocument.price": { $gte: 250 } }
   }
 ];
 // Define change stream
 const changeStream = collection.watch(pipeline);
 // start listen to changes
 changeStream.on("change", function(event) {
   console.log(JSON.stringify(event));
 });
});

Now run this file:

node index.js

Insert a new record in db to receive an update

db.Stocks.insert({ ticker: “AAPL”, price: 280 })

Now check your console, you will receive an update from MongoDB.
Example response:

{
"_id":{
"_data":"825C5D51F70000000129295A1004E83608EE8F1B4FBABDCEE73D5BF31FC946645F696400645C5D51F73ACA83479B48DE6E0004"},
"operationType":"insert",
"clusterTime":"6655565945622233089",
"fullDocument":{
"_id":"5c5d51f73aca83479b48de6e",
"ticker":"AAPL",
"Price":300
},
"ns":{"db":"mydb","coll":"Stocks"},
"documentKey":{"_id":"5c5d51f73aca83479b48de6e"}
}

Here you can change the value of operationType parameter with following operations to listen for different types of changes in a collection:

Insert
Replace (Except unique Id)
Update
Delete
Invalidate (Whenever Mongo returns invalid cursor)

Other Modes of Changes Streams

You can start change streams against a database and deployment same way as against collection. This feature has been released from MongoDB version 4.0. Here are the commands to open a change stream against database and deployments.

Against DB: db.watch()
Against deployment: Mongo.watch()

Conclusion

MongoDB Change Streams simplifies the integration between frontend and backend in a realtime and seamless manner. This feature can help you to use MongoDB for pubsub model so you don’t need to manage Kafka or RabbitMQ deployments anymore. If your application requires real time information then you must check out this feature of MongoDB. I hope this post will get you started with MongoDB change streams.

Tags:

There are so many database management systems (DBMS) to choose from ranging from relational to non-relational DBMS. In the past years, the Relational DBMS where more dominant but with recent data structure trends the non-relational DBMS are becoming more popular. The choices for relational DBMS are quite obvious: MySQL, PostgreSQL and MS SQL. On the other hand, MongoDB a non-relational DBM has come to rise basically due to its ability to handle a large set of data. Every selection has got its pros and cons but your choice will mainly be determined by your application needs since both serve in different niches. However, in this article, we are going to discuss the pros of using MongoDB over MySQL.

Pros of Using MongoDB Over MySQL

Speed and performance
High Availability and Cloud Computing
Schema Flexibility
Need to grow bigger
Embedding feature
Security Model
Location-based data
Rich query language support

Speed and Performance

This is one of the major benefits of using MongoDB over MySQL especially when a large set of unstructured data is involved. MongoDB by default encourages high insert rate over transaction safety. This feature is not available in MySQL hence for instance if you are to save a lot of data to your DBM at once, in the case of MySQL you will have to do it one by one. But in the case of MongoDB, with the availability of insertMany() function, you can safely do the multiple inserts. Observing some of the querying behaviours of the two, we can summarize the different operation requests for 1 million documents in the illustration below.

In the case of updating which is a write operation, MongoDB takes 0.002 seconds to update all student emails whereas MySQL takes 0.2491s to execute the same task.

From the illustration, we can conclude that MongoDB takes way lesser time than MySQL for the same operations. MongoDB is mainly structured such that documents are the basis of storage which promotes huge query and data storage. This implies that the performance is dependent on two key values that are the design and scale out. On the other hand, MySQL has data stored in an individual table hence at some point one has to lookup on the entire table before doing a write operation.

High Availability and Cloud Computing

For unstable environments, MongoDB provides a better handling technique than MySQL. This is because it takes very less time for the active secondary nodes to elect a new primary node thus easy administration at the point of failure. Besides, due to comprehensive secondary indexes and native replication, creating a backup for a MongoDB database is quite easy as compared to MySQL since the latter has integrated replication support.

In a nutshell, setting a set of servers that can act as Master-Slaves is easy and fast in MongoDB than MySQL. Besides, recovery from a cluster failure is instant, automatic and safe. For MySQL, there is no clear official solution for providing failover between master and slave in the event of a failure.

Cloud-based storage solutions require data to be smoothly spread across various server to scale up. MongoDB can load a high volume of data as compared to MySQL and with built-in sharding, it is easy to partition and spread out data across multiple servers as a way of utilizing the cost-saving solution as per the cloud-based storage merits.

Schema Flexibility

MongoDB is schemaless such that different documents in the same collection may have the same or different fields from each other. This means there is no restriction on document structure for every insert or update hence changes to the data model won’t have much impact. Of course, there are scenarios that can opt one to use undefined schema for example if you are de-normalizing a database schema or when your database is growing but your schema is unstable. MongoDB therefore allows one to add various types of data as per needs change.

On the other hand, MySQL is table oriented whereby each row must have the same columns as the other rows. Adding a new column would require one to run an ALTER operation which is quite expensive in terms of performance as it will have to lock up the entire database. This is especially the case when the table grows over 10GB, MongoDB does not have this issue.

With a flexible schema it is easy to develop and maintain a cleaner code. Besides, MongoDB provides the option of using a JSON validator in case you want to ensure some data integrity and consistency for your collection hence you can do some validation before insert or update of a document.

The Need to Grow Bigger

Databases scaling is not an easy undertaking especially with MySQL it may result in degraded performance when the 5-10GB per table memory is surpassed. With MongoDB, this is not an issue since one can partition and shard the database with the In-built sharding feature. Once a shard key is specified and sharding is enabled, data is partitioned evenly according to the shard key. If a new shard is added, there is automatic rebalancing. Sharding basically allows horizontal scaling which is difficult to implement in MySQL. Besides, MongoDB has got built-in replication whereby replica sets create multiple copies of the data. Each member of this set has a role either as primary or secondary at any point in the process.

Reads and writes are done on the primary and then replicated to the secondaries. With this merit in place, in case of data inconsistency or instance failure, a new member may be voted in to act as primary.

Embedding Feature

Unlike MySQL where you cannot embed data to a field, MongoDB offers a better embedding technique for related data. As much as you can do a JOIN for tables in MySQL, you may end up having so many tables with some being unnecessary especially if they don’t involve so many fields. In the case of MongoDB you can decide to embed data into a field for related data or reference from another collection if you expect the document grow in future beyond the JSON document size.

For example if we have data for users who we want to capture their addresses and some other information, in the case of MongoDB we can easily have a simple structure like

{
    id:1,
    name:'George Bush',
    gender: 'Male',
    age:45,
    address:{
        City: 'New York',
        Street: 'Florida',
        Zip_code: 1342243
    }
}

But in the case of MySQL we will have to make 2 tables with an id referencing in this case. I.e

Users details table

id	name	gender	age
1	George Bush	Male	45

User address table

id	City	Street	Zip_code
1	George Bush	Male	134224

In MySQL you will have so many tables which could be so hectic to deal with especially when scaling is involved. As much as one can also do a table join in a single query when fetching this data in MySQL, the latency is quite larger as compared to MongoDB and this is one of the reasons that makes the performance of MongoDB outdo the performance of MySQL.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Security Model

Database administration (DBA) is quite essential in MySQL but not necessary in the case of MongoDB. This means you need to have the DBA to modify a schema in the case of MySQL when an application changes. On the other hand, one can do schema modification without DBA in MongoDB since it is great for class persistence and a class can equally be serialized to JSON and stored. However, this is the best practice if you don’t expect the data to grow big otherwise you will need to follow some best practices to avoid pitfalls.

Location Based Data

In order to improve on throughput operations especially read operations, MongoDB provides built-in special functions that enhance finding relevant data from specific locations which are accurate hence fastening the process. This is not possible in the case of MySQL.

Rich Query Language Support

On a personal interest as a MongoDB enthusiast, I got my attraction with flexibility on querying feature of MongoDB. Regarding the aggregation framework in the later versions and MapReduce feature, one can optimize the result data to suit own specifications. As much as MySQL also offers operations such as grouping, sorting and many more, MongoDB is quite extensive especially with embedded data structures. Further as mentioned early, queries are returned with lesser latency in the aggregation framework than when a JOIN was to be done in the case of MySQL. For instance, MongoDB offers an easy way of modifying a schema using the $set and $unset operations for the embedded schema. But, in the case of MySQL, one has to do the ALTER command for the only table within which the field exists and this is quite expensive in terms of performance.

Conclusion

Regarding the merits discussed above, as much as database selection absolutely depends on application design MongoDB offers a lot of flexibility along different lines. If you are looking for something that will cater for better performance, dealing with complex data hence no need restrictions on schema design, future expectations on database growth and rich query language technique, I would recommend you to go for MongoDB.

Tags:

Deploying a clustered database is one thing, but how you maintain your DBM while in cluster can be a large undertaking for a consistent serving of your applications. One should have an often update on the status of the database more especially the most crucial metrics in order to get a clue of what to upgrade or rather alter as a way of preventing any bottlenecks that may emerge.

There are a lot of considerations regarding MongoDB one should take into account especially the fact that it’s installation and running are quite easy chances of neglecting basic database management practices are high.

Many at times, developers fail to put into account future growth and increased usage of the database which consequently results in crashing of application or data with some integrity issues besides being inconsistent.

In this article we are going to discuss some of the best practices one should employ for MongoDB cluster for an efficient performance of your applications. Some of the factors one should consider include...

Upgrading to latest version
Appropriate storage engine
Hardware resources allocation
Replication and sharding
Never change server configuration file
Good Security Strategy

Upgrading to Latest Version

I have worked with MongoDB from versions before 3.2 and to be honest, things were not easy at that time. With great developments, fixed bugs and newly introduced features, I will advise you to always upgrade your database to the latest version. For instance, the introduction of the aggregation framework had a better performance impact rather than relying on the Map-Reduce concept that was already in existence. With the latest version 4.0, one has now the capability to utilize the multi document transactions feature which generally improves on throughput operations. The latest version also has some additional new type conversion operators such as $toInt, $toString, $trim and $toBool. This operators will greatly help in the validation of data hence create some sense of data consistency. When upgrading please refer to the docs so that you may avoid making slight mistakes that may escalate to be erroneous.

Choose an Appropriate Storage Engine

MongoDB supports 3 storage engines as per now that is: WiredTiger, In-Memory and MMAPv1 storage engines. Each of these storage engines has got merits and limitations over the other but your choice will depend on your application specification and the core functionality of the engine. However, I personally prefer the WiredTiger storage engine and I would recommend this for one who is not sure which one to use. The WiredTiger storage engine is well suited for most workloads, provides a document-level concurrency model, checkpointing and compression.

Some of the considerations regarding selections of storage engine are dependent on this aspects:

Transactions and atomicity: provision of data during an insert or update which is committed only when all conditions and stages in application have been executed successfully. Operations are therefore bundled together in an immutable unit. With this in place multi-document transaction can be supported as seen in the latest version of MongoDB for the WiredTiger storage engine.
Locking type: it is a control strategy on access or update of information. During the lock duration no other operation can change the data of selected object until the current operation has been executed. Consequently, queries get affected at this time hence important to monitor them and reduce the bulk of locking mechanism by ensuring you select the most appropriate storage engine for your data.
Indexing: Storage engines in MongoDB provide different indexing strategies depending on the data types you are storing. Efficiency of that data structure should be quite friendly with your workload and one can determine this by considering every extra index as having some performance overhead. Write optimized data structures have lower overhead for every index in a high-insert application environment than non-write optimized data structures. This will be a major setback especially where a large number of indexes is involved and selection of an inappropriate storage engine. Therefore, choosing an appropriate storage engine can have a dramatic impact.

Hardware Resources Allocation

As new users sign into your application, the database grows with time and new shards will be introduced. However, you cannot rely on the hardware resources you had established during the deployment stage. There will be a correspondent increase on the workload and hence require more processing resources provision such as CPU and RAM to support your large data clusters. This is often referred to capacity planning in MongoDB. The best practices around capacity planning include:

Monitor your database constantly and adjust in accordance to expectations. As mentioned before, an increase in number of users will trigger more queries henceforth with an increased workload set especially if you employ indexes. You may start experiencing this impacts on the application end when it starts recording a change in the percentage of writes versus reads with time. You will therefore need to re-configure your hardware configurations in order to address this issue. Use mongoperf and MMS tool to detect changes in system performance parameters.
Document all you performance requirement upfronts. When you encounter same problem you will at least have a point of reference which will save you some time. Your recording should involve size of data you want to store, analysis of queries in terms of latency and how much data you would like to access at a given time. In production environment you need to determine how many requests are you going to handle per second and lastly how much latency you are going to tolerate.
Stage a Proof of Concept. Performa schema/index design and comprehend the query patterns and then refine your estimate of the working set size. Record this configuration as a point of reference for testing with successive revisions of the application.
Do your tests with real workload. After carrying out stage of proof concept, deploy only after carrying a substantial testing with real world data and performance requirements.

Replication and Sharding

These are the two major concepts of ensuring High Availability of data and increased horizontal scalability respectively in MongoDB cluster.

Sharding basically partitions data across servers into small portions known as shards. Balancing of data across shards is automatic, shards can be added or removed without necessarily taking the database offline.

Replication on the other end maintains a multiple redundant copies of the data for high availability. It is an in-built feature in MongoDB and works across a wide area networks without the need for specialized networks. For a cluster setup, i recommend you have at least have 2+ mongos, 3 config servers, 1 shard an ensure connectivity between machines involved in the sharded cluster. Use a DNS name rather than IPs in the configuration.

For production environments use a replica set with at least 3 members and remember to populate more configuration variables like oplog size.

When starting your mongod instances for your members use the same keyfile.

Some of the considerations of your shardkey should include:

Key and value are immutable
Always consider using indexes in a sharded collection
Update driver command should contain a shard key
Unique constraints to be maintained by the shard key.
A shard key cannot contain special index types and must not exceed 512 bytes.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Never Change Server Configuration File

After doing your first deployment, it is advisable not to change a lot of parameters in the configuration file otherwise you may land into trouble especially with shards. The weakest link with sharding is the config servers. This is to say all the mongod instances have to be running in order for sharding to work.

Good Security Strategy

MongoDB has been vulnerable to external attacks in the past years hence an important undertaking for your database to have some security protocols. Besides running the processes in different ports, one should at least employ one of the 5 different ways of securing MongoDB databases. You can consider platforms such as MongoDB Atlas which secure databases by default through encryption of the data both in-transit and at-rest. You can use strategies like TLS/SSL for all incoming and outgoing connections.

Conclusion

MongoDB cluster control is not an easy task and it involves a lot of workarounds. Databases grow as a result of more users hence increased workload set. On has therefore a mandate to ensure the performance of the DBM is in line with this increased number of users. The best practices go beyond increasing hardware resources and applying some MongoDB concepts such as sharding, replication and indexing. However, many of the inconveniences that may arise are well addressed by upgrading your MongoDB version. More often the latest versions have bugs fixed, new feature requests integrated and almost no negative impact to upgrading even with major revision numbers.

Tags:

Among the tasks involved in database management is improving performance by employing different strategies. Indexing is one of the tips that improve throughput operations by facilitating data access to query requests. It does so by minimizing the number of disk access required when a query is processed. Failure to use indexes in MongoDB will force the database to perform a full collection scan, that is, scan through all the documents in the collection in order to select documents that match an issued query statement. Obviously, this will take a lot of time especially if there are so many documents involved. In a nutshell, indexes support efficient execution of queries.

MongoDB Indexes

Since we expect to store many documents in a MongoDB collection, we need to find a way to store a small portion of data for each document in a different partition for easy traversing by use of indexes. An index will store a specific field value or fields and then sort this data in order of the value of that field. With this ordering, efficient query matching and range-based query operations are supported. Indexes are defined at the collection level and they are supported by any field or embedded field of the documents in the collection.

When you create a document, MongoDB by default assigns an _id field if not specified and makes this a unique index for that document. Basically, this is to prevent inserting of the same document more than ones in that collection. In addition, for a sharded cluster, it is advisable to use this _id field as part of the shard keys selection, otherwise there must be some uniqueness of data in the _id field in order to avoid errors.

Creating an Index for a Collection

Assuming you have inserted some data in your collection and you want to assign a field to be an index, you can use the createIndex method to achieve this, i.e.

Let’s say you have this json data:

{
    _id:1,
    Name: “Sepp Maier”, 
    Country: “Germany”
}

We can make the Name field a descending index by:

db.collection.createIndex({Name: -1})

This method creates an index with the same specification if only not in existence already.

Types of Indexes in MongoDB

MongoDB involves different types of data hence different types of indexes are derived to support these data types and queries.

Single Field
Using a single field of a document one can make the field an index in an ascending or descending manner just like the example above. Besides, you can create an index on an embedded document as a whole, for example:
```
{ 
    _id: “xyz”,
    Contact:{
        email: “example@gmail.com”, 
        phone:”+420 78342823” },
    Name: “Sergio”
}
```
Contact field is an embedded document hence we can make it an ascending index with the command:
```
db.collection.createIndex({ Contact: 1})
```
In a query we can fetch the document like:
```
db.collection.find({ 
    Contact: {email: “example@gmail.com”,
    phone:”+420 78342823”} 
})
```
A best practice is creating the index in the background especially when a large amount of data is involved since the application needs to access the data while building the index.
Compound Index
Compound indexes are often used to facilitate the sort operation within a query and support queries that match on multiple fields. The syntax for creating a compound index is:
```
db.collection.createIndex( { <field0>: <type>, <field1>: <type1>, ... } )
```
Creating a compound index for the sample data below
```
{ 
    _id: “1”,
    Name: “Tom”,
    Age: 24,
    Score:”80”
}
db.collection.createIndex({ Age: 1, Score:-1})
```
Considerations:
- A limit of only 32 fields can be supported.
- Value of the field will define the type of index i.e. 1 is ascending and -1 is descending.
- Don’t create compound indexes that have hashed index type.
- The order of fields listed in a compound index is important. The sorting will be done in accordance with the order of the fields.
Multikey Index
At some point, you may have fields with stored array content. When these fields are indexed, separate index entries for every element are created. It therefore helps a query to select documents that consist arrays by matching on element or elements of the arrays. This is done automatically by MongoDB hence no need for one to explicitly specify the multikey type. From version 3.4, MongoDB tracks which indexed fields cause an index to be a multikey index. With this tracking, the database query engine is allowed to use tighter index bounds.
Limitations of Multikey Index
- Only one array field can be used in the multikey indexing for a document in the collection. I.e. You cannot create a multikey index for the command and data below
```
{ _id: 1, nums: [ 1, 2 ], scores: [ 30, 60 ]}
```
  You cannot create a multikey index
```
{ nums: 1, scores: 1 } 
```
- If the multikey index already exists, you cannot insert a document that violates this restriction. This is to say if we have
```
{ _id: 1, nums:  1, scores: [ 30, 60 ]}
{ _id: 1, nums: [ 1, 2 ], scores:  30}
```
  After creating a compound multikey index, an attempt to insert a document where both nums and scores fields are arrays, the database will fail the insert.
Text Indexes
Text indexes are often used to improve on search queries for a string in a collection. They do not store language-specific stop words (i.e “the”, ”a”, “or”). A collection can have at most one text index. To create a text index:
```
db.collection.createIndex({Name:”text”})
```
You can also index multiple fields i.e.
```
db.collection.createIndex({
    Name:”text”,
    place:”text”
})
```
A compound index can include a text index key in combination with the ascending/descending index key but:
- All text index keys must be adjacently in the index specification document when creating a compound text index.
- No other special index types such as multikey index fields should be involved in the compound text index.
- To perform a $text search, the query predicate must include equality match conditions on the preceding keys.
Hashed Indexes
Sharding is one of the techniques used in MongoDB to improve on horizontal scaling. Sharding often involves hash based concept by use of hashed indexes. The more random distribution of values along their range is portrayed by these indexes, but only support equality matches and cannot support range-based queries.

Overall Operational Considerations for Indexes

Each index requires at least 8kB of data space.
When active, each index will consume some disk space and memory. This is significant when tracked in capacity planning.
For a high read-to-write ratio collection, additional indexes improve performance and do not affect un-indexed read operations.

Limitations of Using Indexes

Adding an index has some negative performance impact for write operations especially for collections with the high write-to-read ratio. Indexes will be expensive in that each insert must also update any index.
MongoDB will not create, update an index or insert into an indexed collection if the index entry for an existing document exceeds the index key limit.
For existing sharded collections, chunk migration will fail if the chunk has a document that contains an indexed field that has an index entry that exceeds the index key limit.

Conclusion

There are so many ways of improving MongoDB performance, indexing being one of them. Indexing facilitates query operations by reducing latency over which data is retrieved by somehow minimizing the number of documents that need to be scanned. However, there are some considerations one needs to undertake before deciding to use a specific type of index. Collections with high read-to-write ratio tend to utilize indexes better than collections with high write-to-read operations.

Tags:

One of the most advertised features of MongoDB is its ability to be “schemaless”. This means that MongoDB does not impose any schema on any documents stored inside a collection. Normally, MongoDB stores documents in a JSON format so each document can store various kinds of schema/structure. This is beneficial for the initial stages of development but in the later stages, you may want to enforce some schema validation while inserting new documents for better performance and scalability. In short, “Schemaless” doesn’t mean you don’t need to design your schema. In this article, I will discuss some general tips for planning your MongoDB schema.

Figuring out the best schema design which suits your application may become tedious sometimes. Here are some points which you can consider while designing your schema.

Avoid Growing Documents

If your schema allows creating documents which grow in size continuously then you should take steps to avoid this because it can lead to degradation of DB and disk IO performance. By default, MongoDB allows 16MB size per document. If your document size increases more than 16 MB over a period of time then, it is a sign of bad schema design. It can lead to failure of queries sometimes. You can use document buckets or document pre-allocation techniques to avoid this situation. In case, your application needs to store documents of size more than 16 MB then you can consider using MongoDB GridFS API.

Avoid Updating Whole Documents

If you try to update the whole document, MongoDB will rewrite the whole document elsewhere in the memory. This can drastically degrade the write performance of your database. Instead of updating the whole document, you can use field modifiers to update only specific fields in the documents. This will trigger an in-place update in memory, hence improved performance.

Try to Avoid Application-Level Joins

As we all know, MongoDB doesn’t support server level joins. Therefore, we have to get all the data from DB and then perform join at the application level. If you are retrieving data from multiple collections and joining a large amount of data, you have to call DB several times to get all the necessary data. This will obviously require more time as it involves the network. As a solution for this scenario, if your application heavily relies on joins then denormalizing schema makes more sense. You can use embedded documents to get all the required data in a single query call.

Use Proper Indexing

While doing searching or aggregations, one often sorts data. Even though you apply to sort in the last stage of a pipeline, you still need an index to cover the sort. If the index on sorting field is not available, MongoDB is forced to sort without an index. There is a memory limit of 32MB of the total size of all documents which are involved in the sort operation. If MongoDB hits that limit then it may either produce an error or return an empty set.

Having discussed adding indexes, it is also important not to add unnecessary indexes. Each index you add in the database, you have to update all these indexes while updating documents in the collection. This can degrade database performance. Also, each index will occupy some space and memory as well so, number of indexes can lead to storage-related problems.

One more way to optimize the use of an index is overriding the default _id field. The only purpose of this field is keeping one unique field per document. If your data contains a timestamp or any id field then you can override _id field and save one extra index.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Read v/s Write Ratio

Designing schema for any application hugely depends on whether an application is read heavy or write heavy. For example, if you are building a dashboard to display time series data then you should design your schema in such a way that maximizes the write throughput. If your application is E-commerce based then, most of the operations will be read operations as most users will be going through all the products and browsing various catalogs. In such cases, you should use denormalized schema to reduce the number of calls to DB for getting relevant data.

BSON Data Types

Make sure that you define BSON data types for all fields properly while designing schema. Because when you change the data type of any field, MongoDB will rewrite the whole document in a new memory space. For example, if you try to store (int)0 in place of (float)0.0 field, MongoDB rewrites the whole document at a new address due to change in BSON data type.

Conclusion

In a nutshell, it is wise to design schema for your Mongo Database as it will only improve the performance of your application. Starting from version 3.2, MongoDB started supporting document validation where you can define which fields are required to insert a new document. From version 3.6, MongoDB introduced a more elegant way of enforcing schema validation using JSON Schema Validation. Using this validation method, you can enforce data type checking along with required field checking. You can use the above approaches to check whether all documents are using the same type of schema or not.

Tags:

We’re happy to announce that our new whitepaper How to Deploy Open Source Databases is now available to download for free!

Choosing which DB engine to use between all the options we have today is not an easy task. An that is just the beginning. After deciding which engine to use, you need to learn about it and actually deploy it to play with it. We plan to help you on that second step, and show you how to install, configure and secure some of the most popular open source DB engines.

In this whitepaper we are going to explore the top open source databases and how to deploy each technology using proven methodologies that are battle-tested.

Topics included in this whitepaper are …

An Overview of Popular Open Source Databases
- Percona
- MariaDB
- Oracle MySQL
- MongoDB
- PostgreSQL
How to Deploy Open Source Databases
- Percona Server for MySQL
- Oracle MySQL Community Server
  - Group Replication
- MariaDB
  - MariaDB Cluster Configuration
- Percona XtraDB Cluster
- NDB Cluster
- MongoDB
- Percona Server for MongoDB
- PostgreSQL
How to Deploy Open Source Databases by Using ClusterControl
- Deploy
- Scaling
- Load Balancing
- Management

Download the whitepaper today!

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

About ClusterControl

ClusterControl is the all-inclusive open source database management system for users with mixed environments that removes the need for multiple management tools. ClusterControl provides advanced deployment, management, monitoring, and scaling functionality to get your MySQL, MongoDB, and PostgreSQL databases up-and-running using proven methodologies that you can depend on to work. At the core of ClusterControl is it’s automation functionality that lets you automate many of the database tasks you have to perform regularly like deploying new databases, adding and scaling new nodes, running backups and upgrades, and more.

To learn more about ClusterControl click here.

About Severalnines

Severalnines provides automation and management software for database clusters. We help companies deploy their databases in any environment, and manage all operational aspects to achieve high-scale availability.

Severalnines' products are used by developers and administrators of all skill levels to provide the full 'deploy, manage, monitor, scale' database cycle, thus freeing them from the complexity and learning curves that are typically associated with highly available database clusters. Severalnines is often called the “anti-startup” as it is entirely self-funded by its founders. The company has enabled over 32,000 deployments to date via its popular product ClusterControl. Currently counting BT, Orange, Cisco, CNRS, Technicolor, AVG, Ping Identity and Paytrail as customers. Severalnines is a private company headquartered in Stockholm, Sweden with offices in Singapore, Japan and the United States. To see who is using Severalnines today visit, https://www.severalnines.com/company.

Tags:

Database systems are crucial components in the cycle of any successful running application. Every organization involving them therefore has the mandate to ensure smooth performance of these DBMs through consistent monitoring and handling minor setbacks before they escalate into enormous complications that may result in an application downtime or slow performance.

You may ask how can you tell if the database is really going to have an issue while it is working normally? Well, that is what we are going to discuss in this article and we term it as benchmarking. Benchmarking is basically running some set of queries with some test data along with some resource provision to determine whether these parameters meet the expected performance level.

MongoDB does not have a standard benchmarking methodology thereby we need to resolve in testing queries on own hardware. As much as you may also get impressive figures from the benchmark process, you need to be cautious as this may be a different case when running your database with real queries.

The idea behind benchmarking is to get a general idea on how different configuration options affect performance, how you can tweak some of these configurations to get maximum performance and estimate the cost of improving this implementation. Besides, applications grow with time in terms of users and probably the amount of data that is to be served hence need to do some capacity planning before this time. After realizing a rising trend of data, you need to do some benchmarking on how you will meet the requirements of this vast growing data.

Considerations in Benchmarking MongoDB

Select workloads that are a typical representation of today’ modern applications. Modern applications are becoming more complex every day and this is transmitted down to the data structures. This is to say, data presentation has also changed with time for example storing simple fields to objects and arrays. It is not quite easy to work with this data with default or rather sub-standard database configurations as it may escalate to issues like poor latency and poor throughput operations involving the complex data. When running a benchmark you should therefore use data which is a clear presentation of your application.
Double check on writes. Always ensure that all data writes were done in a manner that allowed no data loss. This is to improve on data integrity by ensuring the data is consistent and is most applicable especially in the production environment.
Employ data volumes that are a representation of “big data” datasets which will certainly exceed the RAM capacity for an individual node. When the test workload is large, it will help you predict future expectations of your database performance hence start some capacity planning early enough.

Methodology

Our benchmark test will involve some big location data which can be downloaded from here and we will be using Robo3t software to manipulate our data and collect the information we need. The file has got more than 500 documents which are quite enough for our test. We are using MongoDB version 4.0 on an Ubuntu Linux 12.04 Intel Xeon-SandyBridge E3-1270-Quadcore 3.4GHz dedicated server with 32GB RAM, Western Digital WD Caviar RE4 1TB spinning disk and Smart XceedIOPS 256GB SSD. We inserted the first 500 documents.

We ran the insert commands below

db.getCollection('location').insertMany([<document1, <document2>…<document500>],{w:0})
db.getCollection('location').insertMany([<document1, <document2>…<document500>],{w:1})

Write Concern

Write concern describes acknowledgment level requested from MongoDB for write operations in this case to a standalone MongoDB. For a high throughput operation, if this value is set to low then the write calls will be so fast thus reduce the latency of the request. On the other hand, if the value is set high, then the write calls are slow and consequently increase on the query latency. A simple explanation for this is that when the value is low then you are not concerned about the possibility of losing some writes in an event of mongod crash, network error or anonymous system failure. A limitation in this case will be, you won’t be sure if these writes were successful. On the other hand, if the write concern is high, there is an error handling prompt and thus the writes will be acknowledged. An acknowledgment is simply a receipt that the server accepted the write to process.

When the write concern is set high

When the write concern is set low

In our test, the write concern set to low resulted in the query being executed in min of 0.013ms and max of 0.017ms. In this case, the basic acknowledgment of write is disabled but one can still get information regarding socket exceptions and any network error that may have been triggered.

When the write concern is set high, it almost takes double the time to return with the execution time being 0.027ms min and 0.031ms max. The acknowledgment in this case is guaranteed but not 100% it has reached disk journal. In this case, chances of a write loss are thus 50% due to the 100ms window where the journal might not be flushed to disk.

Journaling

This is a technique of ensuring no data loss by providing durability in an event of failure. This is achieved through a write-ahead logging to on-disk journal files. It is most efficient when the write concern is set high.

For a spinning disk, the execution time with journaling enabled is a bit high, for instance in our test it was about 0.251ms for the same operation above.

The execution time for an SSD however is a bit lower for the same command. In our test, it was about 0.207ms but depending on the nature of data sometimes this could be 3 times faster than a spinning disk.

When journaling is enabled, it confirms that writes have been made to the journal and hence ensuring data durability. Consequently, the write operation will survive a mongod shutdown and ensures that the write operation is durable.

For a high throughput operation, you can half query times by setting w=0. Otherwise, if you need to be sure that data has been recorded or rather will be in case of a back-to-life after failure, then you need to set the w=1.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Replication

Acknowledgment of a write concern can be enabled for more than one node that is the primary and some secondary within a replica set. This will be characterized by what integer is valued to the write parameter. For example, if w = 3, Mongod must ensure that the query receives an acknowledgment from the main node and 2 slaves. If you try to set a value greater than one and the node is not yet replicated, it will throw an error that the host must be replicated.

Replication comes with a latency setback such that the execution time will be increased. For the simple query above if w=3, then average execution time increases to 270ms. A driving factor for this is the range in response time between nodes affected by network latency, communication overhead between the 3 nodes and congestion. Besides, all three nodes wait for each other to finish before returning the result. In a production deployment, you will therefore not need to involve so many nodes if you want to improve on performance. MongoDB is responsible for selecting which nodes are to be acknowledged unless there is a specification in the configuration file using tags.

Spinning Disk vs Solid State Disk

As mentioned above, SSD disk is quite fast than spinning disk depending on the data involved. Sometimes it could be 3 times faster hence worthy paying for if need be. However, it will be more expensive to use an SSD especially when dealing with vast data. MongoDB has got merit that it supports storing databases in directories which can be mounted hence a chance to use an SSD. Employing an SSD and enabling journaling is a great optimization.

Conclusion

The experiment was certain that write concern disabled results in reduced execution time of a query at the expense of data loss chances. On the other hand, when the write concern is enabled, the execution time is almost 2 times when it is disabled but there is an assurability that data won’t be lost. Besides, we are able to justify that SSD is faster than a Spinning disk. However, to ensure data durability in an event of a system failure, it is advisable to enable the write concern. When enabling the write concern for a replica set, don’t set the number too large such that it may result in some degraded performance from the application end.

Tags:

Every database system has a structured component which is responsible for maintaining how data is stored and served both in memory and disk. This is often referred to as a storage engine. More often when evaluating the architecture of operational databases, developers put into the account on first-hand factors such as data modeling, reduced latency, improved throughput operations, data consistency, scalability easiness, and minimal fault tolerance. In spite of that, one needs to have a detailed and advanced knowledge on the underlying storage engine for a better tuning so that it delivers on the highlighted factors efficiently.

A simple cycle of an application to db system is illustrated below...

Example of common application architecture

WiredTiger Storage Engine

MongoDB supports mainly 3 storage engines whose performance differ in accordance to some specific workloads. The storage engines are:

WiredTiger Storage Engine
In-Memory Storage Engine
MMAPv1 Storage Engine

The WiredTiger storage engine has both configurations of a B-Tree Based Engine and a Log Structured Merge Tree Based Engine.

B-Tree Based Engine

This is one of the ancient storage engines from which other sophisticated setups are derived from. It is a self-balancing tree data structure that ensures data sorting and enables searches, sequential access, insertions and deletes in a logarithmic manner. It is row-based storage such that each row is considered as being a single record in the database

Merits of a B-Tree Storage Engine

High throughput and low latency reads. B-Trees has a tendency of growing wide and shallow such that very few nodes are traversed.
Keeps keys in sorted order for sequential traversing and indexes are balanced with a recursive algorithm.
The interior storage nodes are always kept at least half full which in general reduces wastage.
Easy to handle a large number of insertions and deletions within a short duration.
Hierarchical indexing is employed with the aim of reducing disk reads.
Speeds up insertions and deletions through usage of partially full blocks.

Limitations of a B-Tree storage engine

Poor write performance due to the need to ensure a well-ordered data structure with random writes. Random writes are more expensive than sequential writes to the storage.
Ready-modify-write penalty of an entire block even for a minor update to a row in a block.

Log Structured Merge Tree Based Engine

Because of the poor write performance of the B-Tree Based Engine, developers had to come up with a way to cope with larger datasets to DBMS. The Log Structured Merge Tree Based Engine (LSM Tree) was hence established to improve performance for indexed access to files with high write volume over an extended period. In this case, random writes at the first stage of cascading memory are turned into sequential writes at the disk-based first component.

Merits of a LSM Tree Storage Engine

Ability to do fast sequential writes enhances quick handling of large fast-growing data.
Well suited for tiered storage hence giving organizations a better selection in terms of cost and performance. Flash-based SSDs provide great performance in this case.
Better compression and storage efficiency hence saving on storage space and enhancing almost full storage
Data is always available for query immediately.
Insertions are very fast.

Limitations of a B-Tree storage engine

Consume more memory as compared to B-Tree during read operations due to read and space amplification. However, some approaches such as bloom filters have mitigated this effect in practice such that the number of files to be checked during a point query is reduced.

The WiredTiger technology was designed in a way to employ both B-Tree and LSM advantages making it sophisticated and the best storage engine for MongoDB. IT is actually MongoDB’s default storage engine.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

WiredTiger Storage Engine Architecture

As mentioned above, it involves the concept of two basic storage engines that is the B-Tree and LSM Tree engines hence it is a multiversion concurrency control (MVCC) storage engine. The merits of the two combined enable the system see a snapshot of the database at the time it accesses a collection. Checkpoints are established such that a consistent view of data is recorded to disk between checkpoints. In case of a crash between checkpoints, it is easy to recover with these checkpoints or rather, even if there are no checkpoints for data, one can recover it from disk journal files.

Extensive usage of cache rather than disk to enhance low latency. WiredTiger storage engine relies heavily on the OS page-cache such that compressed data is fetched without involving the disk. Besides, the least recently used data is cleared from the RAM preserving more space for the cache.

B-Tree storage concept offers highly efficient reads and good write performance with low CPU utilization. It also has a document-level locking implementation that enables highly concurrent workloads and this concurrency consequently facilitates the server to take advantage of many core CPUs. In general, all theses enhances the high scalability of the database.

The enterprise edition supports on-disk encryption for the WiredTiger storage engine which is a feature that greatly improves data security.

WiredTiger storage engine enables a write-ahead logging which ensures an automatic crash recovery and makes writes durable.

Advantages of the WiredTiger Storage Engine

Efficient storage due to a multiple of compression technologies such as the Snapp, gzip and prefix compressions.
It is highly scalable with concurrent reads and writes. This in the end improves on the throughput and general database performance.
Assure data durability with write-ahead log and usage of checkpoints.
Optimal memory usage. The WiredTiger uses both the internal cache and file system cache.
With the filesystem cache, MongoDB can easily use the free memory that is not used by the WiredTiger cache.

WiredTiger Storage Engine Setbacks

Difficulties in updating data. The concurrency scheme prevents in-place updates such that updating a field value in a document re-writes the entire document.

Conclusion

WiredTiger storage engine integrates concepts from two major storage engines,the B-Tree and LSM tree storage engine to achieve maximum and optimal performance. Weighing the advantages from both cases and collectively employing them makes the WiredTiger a general purpose storage engine. For this reason, in the current versions of MongoDB, it is the default storage engine. This implies if you really don’t have a strong reason to abhor it, then it is the best for your data. However, the storage engine choice heavily relies on your data use case or rather where the WiredTiger cannot meet your expectations. In general, this is the best default storage engine.

Tags:

Technology changes day-by-day and modern applications need to take serious adjustments in order to fulfill the fast delivery expectations of their organizations. Unfortunately, this makes them more complex, more sophisticated, and harder to maintain.

In terms of database management, the data structures inside of MongoDB change in accordance to application needs over time and it could be quite expensive (or rather risky).

In the long run, we need to have an efficient database configured easily and ensure competent software delivery. Achieving all these in a manual way comes with a number of setbacks such as

Complicated coordination among team members.
High chances of task repetition
Susceptible to a lot of human mistakes and errors
Uneasy to overcome complexity
Reduced collaboration and job dissatisfaction
Time-consuming
Poor accountability and compliance

The difficulties of database administration are mainly centered on

Deploying
Maintaining
Upgrading which may affect the operational overhead up to 95% reduction.

Achieving this can take a lot of time and manual effort for MongoDB. To ensure success you will need to have a simple system with which you can ensure all the listed setbacks above can be addressed from a single platform in a timely manner, that is to say, somehow an automated system. There are quite a number of options but in this article, we are going to discuss how utilizing Ansible.

What is Ansible

Ansible is simply a universal language that unravels the mystery of how work is done. In other words, it is an IT orchestration engine that automates the deployment of applications, configuration management and orchestrate more advanced IT tasks such as ensuring zero downtime rolling updates and continuous deployments.

Machines can easily be managed in an agent-less manner with a greater focus on security and reliability through the use of a language designed around “auditability” by humans.

While deploying MongoDB may not be that difficult, maintenance, backup, and monitoring become increased factors of concern as time goes by. In addition, it is not that easy when you are new to database management. With Ansible developers can deploy and configure applications with ease, it also allows for swift delivery to any hosting platform.

As Ansible is not part of the database cluster system it can be installed in any remote computer and a configuration made to your database host. Please check the installation guide to know which version is suitable for your operating system.

Ansible, by default, connects to a database host through an SSH protocol.

Ansible Playbooks

Playbooks are templates where Ansible code is written hence direct Ansible itself what to execute in a such like to-do-list manner. They are written in YAML (Yet Another Markup Language) format. Each contains step-by-step operations that are followed by the user on a particular machine which run sequentially. Their structure is constituted of one or more Plays. A Play is basically a code block that maps a set of instructions defined against a particular host.

Commonly Used YAML Tags in Ansible

name
This is the tag that defines the name of the Ansible playbook. It is advisable to set a name that precisely defines what it will be doing.
hosts
This defines a host group or list of hosts against which the defined tasks are to be run. It is a mandatory tag which tells Ansible on which hosts to run the tasks that have been listed. Since tasks can be performed on multiple machines either same or remote machines one can define a group of hosts entry in this tag.
vars
Like any other programming language, you will need variables. With this tag, you can define variables that you will be using in your playbook.
tasks
This tag will enable you to list a set of tasks to be executed. Tasks are actually actions one need to perform. A task field defines the name of task which essentially helps text for the user during debugging of the playbook. A piece of code defined as a module is linked internally by each task and any arguments that are to be used within the module are passed through the tasks tag.

A simple playbook structure looks something like this...

---
 name: install and configure DB
   hosts: testServer
   become: yes

   vars: 
      mongoDB_Port : 27017
   
   tasks:
   -name: Install the mongodb
      yum: <code to install the DB>
    
   -name: Ensure the installed service is enabled and running
   service:
      name: <your service name>

Writing a Simple Playbook to Install and Start MongoDB

Enabling Root SSH Access

Some setups of managed nodes may deter you from log in as a root user hence need to define a playbook to resolve this. We will create a playbook enable-root-access.yml that will look like this

---
- hosts: ansible-test
  remote_user: ubuntu
  tasks:
    - name: Enable root login
      shell: sudo cp ~/.ssh/authorized_keys /root/.ssh/

When you run the command

$ ansible-playbook -i inventory.txt -c ssh enable-root-access.yaml

You should see something like

PLAY [ansible-test] ***********************************************************
GATHERING FACTS ***************************************************************
TASK: [Enable root login] *****************************************************
PLAY RECAP ********************************************************************

Selecting hosts and users in mongodbInstall.yaml

---
- hosts: ansible-test
  remote_user: root
  become: yes

Adding tasks to be executed

Tasks are executed sequentially, so we need to outline them in a sequential manner i.e.

apt_key to add repository keys. MongoDB public GPG Key need to be imported first

- name: Import the public key used by the package management system
    apt_key: keyserver=hkp://keyserver.ubuntu.com:80 id=7F0CEB10 state=present

Adding MongoDB apt_repository

- name: Add MongoDB repository
  apt_repository: repo='deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' state=present

Installing packages and starting mongod, then reload the local package database

- name: install mongodb
  apt: pkg=mongodb-org state=latest update_cache=yes
  notify:
  - start mongodb

Managing services, using handler to start and restart services

handlers:
  - name: start mongodb
    service: name=mongod state=started

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

The general playbook code should look like this

---
- hosts: ansible-test
  remote_user: root
  become: yes
  tasks:
  - name: Import the public key used by the package management system
    apt_key: keyserver=hkp://keyserver.ubuntu.com:80 id=7F0CEB10 state=present
  - name: Add MongoDB repository
    apt_repository: repo='deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' state=present
  - name: install mongodb
    apt: pkg=mongodb-org state=latest update_cache=yes
    notify:
    - start mongodb
  handlers:
    - name: start mongodb
      service: name=mongod state=started

We can then run this file with ansible using the command

ansible-playbook -i inventory.txt -c ssh mongodbInstall.yaml

If the playbook has been successfully executed you should see this in your terminal

PLAY [ansible-test] ***********************************************************

GATHERING FACTS ***************************************************************
ok: [12.20.3.105]
ok: [12.20.3.106]

TASK: [Import the public key used by the package management system] ***********
changed: [12.20.3.105]
changed: [12.20.3.106]

TASK: [Add MongoDB repository] ************************************************
changed: [12.20.3.105]
changed: [12.20.3.106]

TASK: [install mongodb] *******************************************************
changed: [12.20.3.105]
changed: [12.20.3.106]

NOTIFIED: [start mongodb] *****************************************************
ok: [12.20.3.106]
ok: [12.20.3.105]

PLAY RECAP ********************************************************************
12.20.3.105                : ok=5    changed=3    unreachable=0    failed=0
12.20.3.106                : ok=5    changed=3    unreachable=0    failed=0

If now you run mongo, you will be directed to mongo shell

MongoDB shell version v4.0.3
connecting to: mongodb://127.0.0.1:27017
Implicit session: session { "id" : UUID("07c88442-0352-4b23-8938-fdf6ac66f253") }
MongoDB server version: 4.0.3
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
    http://docs.mongodb.org/
Questions? Try the support group
    http://groups.google.com/group/mongodb-user

Conclusion

Ansible is a simple open source IT engine that automates application deployment, service orchestration, and cloud provision.

It works by connecting database nodes and pushing out defines instructions known as modules to them, executes them through SSH by default and then getting rid of them when finished. It doesn’t run any daemons or servers hence can be run from any remote machine. In the next tutorial, we are going to discuss how to maintain a MongoDB replica set in the cloud using Ansible.

Tags:

While this storage engine has been deprecated as far back as MongoDB version 4.0, there are some important features in it. MMAPv1 is the original storage engine in MongoDB and is based on mapped files. Only the 64-bit Intel architecture (x86_64) supports this storage engine.

MMAPv1 drives excellent performance at workloads with...

Large updates
High volume reads
High volume inserts
High utilization of system memory

MMAPv1 Architecture

MMAPv1 is a B-tree based system which powers many of the functions such as storage interaction and memory management to the operating system.

It was the default database for MongoDB for versions earlier than 3.2 until the introduction of the WiredTiger storage engine. Its name comes from the fact that it uses memory mapped files to access data. It does so by directly loading and modifying file contents, which are in a virtual memory through a mmap() syscall methodology.

All records are contiguously located on the disk and in the case that a document becomes larger than the allocated record size, then MongoDB allocates a new record. For MMAPv1 this is advantageous for sequential data access but at the same time a limitation as it comes with a time cost since all document indexes need to be updated and this may result to storage fragmentation.

The basic architecture of the MMAPv1 storage engine is shown below.

As mentioned above, if a document size surpasses the allocated record size, it will result to a reallocation which is not a good thing. To avoid this the MMAPv1 engine utilizes a Power of 2 Sized Allocation so that every document is stored in a record that contains the document itself (including some extra space known as padding). The padding is then used to allow for any document growth that may result from updates while it reduces the chances of reallocations. Otherwise, if reallocations occur you may end up having storage fragmentation. Padding trades additional space to improve on efficiency hence reducing fragmentation. For workloads with high volumes of inserts, updates or deletes, the power of 2 allocation should be most preferred whereas exact fit allocation is ideal fo collections that do not involve any update or delete workloads.

Power of 2 Sized Allocation

For smooth document growth, this strategy is employed in the MMAPv1 storage engine. Every record has a size in bytes which is a power of 2, i.e. (32, 64, 128, 256, 512...2MB). 2MB being the default larger limit any document that surpasses this, its memory is rounded to the nearest multiple of 2MB. For example, if a document is 200MB, this size will be rounded off to 256MB and 56MB trade of space will be available for any additional growth.This enables documents to grow instead of triggering a reallocation the system will need to make when documents reach their limits of available space.

Merits of Power 2 Sized Allocations

Reuse of freed records to reduce fragmentation: With this concept, records are memory quantized to have a fixed size which is large enough to accommodate new documents that would fit into to allocated space created by an earlier document deletion or relocation.
Reduces document moves: As mentioned before, by default MongoDB inserts and updates that make document size larger than the set record size will result in updating of the indexes too. This simply means the documents have been moved. However, when there is enough space for growth within a document, the document will not be moved hence less updates to indexes.

Memory Usage

All free memory on machine in the MMAPv1 storage engine is used as cache. Correctly sized working sets and optimal performance is achieved through a working set that fits into memory. Besides, for every 60 seconds, the MMAPv1 flushes changes to data to disk hence saving on the cache memory. This value can be changed such that the flushing may be done frequently. Because all free memory is used as cache, don’t be shocked that system resource monitoring tools will indicate that MongoDB uses a lot of memory since this usage is dynamic.

Merits of MMAPv1 Storage Engine

Reduced on-disk fragmentation when using the pre-allocation strategy.
Very efficient reads when the working set has been configured to fit into memory.
In-place updates i.e individual field updates can result in more data being stored hence improving on large documents update with minimal concurrent writers.
With a low number of concurrent writers, the write performance can be improved through the data flushing to disk frequently concept.
Collection-level locking facilitates write operations. Locking scheme is one of the most important factors in database performance. In this case, only 1 client can access the database at a time. This creates a scenario such that operations flow more quickly than when presented in a serial manner by the storage engine.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Limitations of the MMAPv1 Storage Engine

High space utilization when doing iterations. MMAPv1 lacks a compression strategy for the file system hence does an over-allocation of record space.
Collection access restriction for many clients when doing a write operation. MMAPv1 uses collection-level-locking strategy which means 2 or more clients cannot access the same collection at the same time hence a write blocks all reads to this collection. This is leads to coarse concurrency that makes it impossible to scale the MMAPv1 engine.
System crash may potentially result in data loss if the journaling option is not enabled. However, even if it is, the window is too small but at least may safe you from a large data loss scenario.
Inefficient storage utilization. When using the pre-allocation strategy, some documents will occupy more space on disk than the data itself would.
If the working set size exceeds the allocated memory, performance drops to a large extent. Besides, document significant growth after initial storage may trigger additional I/O hence cause performance issue.

Comparing MMAPv1 and WiredTiger Storage Engines

Key Feature	MMAPv1	WiredTiger
CPU performance	Adding more CPU cores unfortunately does not boost the performance	Performance improves with multicore systems
Encryption	Due to memory-mapped files being used, it does not support any encryption	Encryption for both data in transit and rest is available in both MongoDB enterprise and Beta installation
Scalability	Concurrent writes that result from collection-level locking make it impossible to scale out.	High chances of scaling out since the least locking level is the document itself.
Tuning	Very little chances of tuning this storage engine	Plenty of tuning can be done around variables such as cache size, checkpoint intervals and read/write tickets
Data compression	No data compression hence more space may be used	Snappy and zlib compression methods available hence documents may occupy less space than in MMAPv1
Atomic transactions	Only applicable for a single document	As from version 4.0 atomic transaction on multi-documents is supported.
Memory	All free memory on machine is used as its cache	Filesystem cache and internal cache are utilized
Updates	Supports in-place updates hence excels at workloads with heavy volume inserts, reads and in-place updates	Does not support in-place updates.The whole document has to be rewritten.

Conclusion

When coming to storage engine selection for a database, many people don’t know which one to choose. The choice normally relies on the workload that it will be subjected unto. On a general gauge, the MMAPv1 would make a poor choice and that’s why MongoDB made a lot of advancements to the WiredTiger option. However, it still may outdo some other storage engines depending on the use case for example where you need to perform only read workloads or need to store many separate collections with large documents whereby 1 or 2 fields are frequently updated.

Tags:

Replication has been widely applied in database systems for ensuring high availability of data through creating redundancy. It is basically a strategy of making a copy of the same data in different running servers that may be in different machines so that, in case of failure of the main server, another one can be brought about to continue with the serving.

A replica set is a group of MongoDB instances that maintain the same set of data. They are the basis of production deployments. Replication is advantageous by the fact that data is always available from a different server just in case the main server system fails. Besides, it improves on read throughput by enabling a client send read request to different servers and get response from the nearest one.

A replica set constitutes several data bearing nodes which could be hosted in different machines and an arbiter node. One of these data bearing nodes is labeled as the primary while the others are secondary nodes. The primary node receives all write operations and then replicates the data to the other nodes after the write operation has been completed and the changes recorded in an oplog.

An arbiter is an additional instance that do not maintain a data set but provides a quorum in a replica set by responding to heartbeat and election requests by other replica set members.They thus reduce on the cost of maintaining a replica set rather than a fully functional replica set member with a data set.

Automatic Failover

A primary node may fail due to some reasons such as power outages or network disconnection thereby not able to communicate with the other members. If the communication is cut off for more than the configured electionTimeoutMillis period, one of the secondaries calls for an election to nominate itself as the new primary. If the election is complete and successful, the cluster continues with the normal operation. During this period, no write operations can be carried out. However, the read queries can be configured to go as normal on the secondaries while the primary is offline.

For an optimal replication process, the median time before cluster elects a new primary at maximum should be 12 seconds with default replication configuration settings. This may be affected by factors such as network latency which may extend the time hence one should be considerate of the cluster’s architecture to ensure this time is not set too high.

The value for electionTimeoutMillis can be lowered from the default 10000 (10 seconds) hence the primary can be detected very first during very fast. However, this may be calling the elections frequently for even minor factors such as temporary network latency even though the primary node is healthy. This will lead to issues such as rollbacks for write operations.

Ansible for Replica Sets

As mentioned, a replica set may have members from different host machines hence making it more complex to maintain the cluster. We need a single platform from which this replica set can be maintained with ease. Ansible is one of the tools that provides a better overview for configuring and managing a replica set. If you are new to ansible please have a quick recap from this article to understand the basics such as creating a playbook.

Configuration Parameters

arbiter_at_index: this defines the position of the arbiter in the replica set members list. An arbiter remember does not have any data as the other members and cannot be used as the primary node. It is only available to create a quorum during the election. For example if you have an even number of members, it is good to add an arbiter such that if the votes are equal, it adds a 1 to make a winning member. The value to be assigned should be an integer.
chaining_allowed: This takes a boolean value and defines whether the other secondary members should replicate from the other secondary members if chaining _allowed = true. Otherwise if chaining _allowed = false, the other secondary members can only replicate from the primary. The default value is true.
election_timeout_secs: by default the value is 10000 (takes an integer value). It is the time in milliseconds for detecting when the primary node is not reachable or rather not communicating to the other members hence trigger an election. Set this to a median value of 12 seconds. If set too high, it will take long before detecting the primary failure and hence longer to do an election. Since this affects the write operation, you may end up losing a lot of data during that period. On the other hand if it is set too low, there will be frequent triggering of an election even when the case is not that serious and the primary still reachable. As a result, you will have so many rollbacks for write operations that may at some point lead to poor data integrity or inconsistency.
heartbeat_timeout_secs: Replica sets need to communicate to each other before an election by sending a signal referred to as a heartbeat. The members then need to respond to this signaling within a specific period which by default is set to 10 seconds. Heartbeat_timeout_secs is the number of seconds the replica set members wait for a successful heartbeat from each other and if a member does not respond, it is marked as inaccessible. However, this is applicable only for protocol version 0. Tha value for this is therefore an integer.
login_host: This is the host that houses the login database. By default for MongoDB is localhost.
login_database: the default is the admin and is where login credentials are stored.(takes a string value)
login_user: the username with which the authentication should be done.(takes a string value)
login_password: the password to authenticate user with. (takes a string value)
login_port: This is the MongoDB port for the host to login to. (takes an integer value)
members: defines a list of replica set members. It is a string separated by comma or a yaml list i.e. mongodb0:27017,mongodb2:27018,mongodb3:27019… If there is no port number, then the 27017 is assumed.
protocol_version: takes an integer that defines the version of the replication process. Either 0 or 1
replica_set: this is a string value that defines the name of the replica set.
ssl: boolean value that defines whether to use SSL connection when connecting to the database or not.
ssl_certs_reqs: this specifies if a certificate is required from the other side of the connection and if there will be need to validate it if provided. The choices for this are CERT_NONE, CERT_OPTIONAL and CERT_REQUIRED. The default is CERT_REQUIRED.
validate: takes a boolean value that defines whether to do any basic validation on the provided replica set config. The default value is true.

Creating a MongoDB Replica Set Using Ansible

Here is a simple example of tasks for setting up a replica set in ansible. Let’s call this file tasks.yaml

# Create a replicaset called 'replica0' with the 3 provided members
- name: Ensure replicaset replica0 exists
  mongodb_replicaset:
    login_host: localhost
    login_user: admin
    login_password: root
    replica_set: replica0
    arbiter_at_index:2
    election_timeout_secs:12000
    members:
    - mongodb1:27017
    - mongodb2:27018
    - mongodb3:27019
  when: groups.mongod.index(inventory_hostname) == 0

# Create two single-node replicasets on the localhost for testing
- name: Ensure replicaset replica0 exists
  mongodb_replicaset:
    login_host: localhost
    login_port: 3001
    login_user: admin
    login_password: root
    login_database: admin
    replica_set: replica0
    members: localhost:3000
    validate: yes

- name: Ensure replicaset replica1 exists
  mongodb_replicaset:
    login_host: localhost
    login_port: 3002
    login_user: admin
    login_password: secret
    login_database: root
    replica_set: replica1
    members: localhost:3001
    validate: yes

In our playbook we can call the tasks like

---
- hosts: ansible-test
  remote_user: root
  become: yes
  Tasks:
- include: tasks.yml

If you run this in your playbook, ansible-playbook -i inventory.txt -c ssh mongodbcreateReplcaSet.yaml you will be presented with a response if the replica set has been created or not. If the key mongodb_replicaset is returned with a valueof success and a description of the replica set that has been created, then you are good to go.

Conclusion

In MongoDB generally it is tedious to configure a replica set for the mongod instances that may be hosted by different machines. However, Ansible provides a simple platform of doing the same by just defining a few parameters as discussed above. Replication is one of the processes that ensures continuous application operation hence should be well configured by setting a multiple number of members in the production world. An arbiter is used to create a quorum during the election process hence should be included in the configuration file by defining its position.

Tags:

MongoDB

nosql

ansible

Everyone knows that MongoDB is schemaless, then why it is required to perform schema validation? It is easy and fast to develop the application with MongoDB's schema-less behavior and use it as a proof of concept. But once the application moves to production and becomes stable and mature, there is no need to change the schema frequently and it is not advisable also. At this time, it is very important to enforce some schema validation in your database to avoid unwanted data being inserted which can break your application. This becomes much more important when data is being inserted from multiple sources in the same database.

Schema validation allows you to define the specific structure of documents in each collection. If anyone tries to insert some documents which don't match with the defined schema, MongoDB can reject this kind of operation or give warnings according to the type of validation action.

MongoDB provides two ways to validate your schema, Document validation, and JSON schema validation. JSON Schema validation is the extended version of document validation, so let's start with document validation.

Document Validation

Most of the developers who have worked with relational databases know the importance of predictability of the data models or schema. Therefore, MongoDB introduced document validation from version 3.2. Let's see how to add validation rules in MongoDB collections.

Suppose, you have a collection of users which have the following types of documents.

{
    "name": "Alex",
    "email": "alex@gmail.com",
    "mobile": "123-456-7890"
}

And, following are the validations which we want to check while adding new documents in users collection:

name, email fields are mandatory
mobile numbers should follow specific structure: xxx-xxx-xxxx

To add this validation, we can use the “validator” construct while creating a new collection. Run the following query in Mongo shell,

db.createCollection("users", {
  validator: {
        $and: [
            {
                "name": {$type: "string", $exists: true}
            },
            {
                "mobile": {$type: "string", $regex: /^[0-9]{3}-[0-9]{3}-[0-9]{4}$/}
            },
            {
                "email": {$type: "string", $exists: true}
            }
        ]
    }
})

You should see the following output:

{ "ok" : 1 }

Now, if you try to add any new document without following the validation rules then mongo will throw a validation error. Try to run the following insert queries.

Query:1

db.users.insert({
    "name": "akash"
})

Output:

WriteResult({
    "nInserted" : 0,
    "writeError" : {
        "code" : 121,
        "errmsg" : "Document failed validation"
    }
})

Query:2

db.users.insert({
    "name": "akash",
    "email": "akash@gmail.com",
    "mobile": "123-456-7890"
})

Output:

WriteResult({ "nInserted" : 1 })

However, there are some restrictions with document validation approach such as one can add any number of new key-value pair to the document and insert it into the collection. This can't be prevented by document validation. Consider the following example,

db.users.insert({
    "name": "akash",
    "email": "akash@gmail.com",
    "mobile": "123-456-7890",
    "gender": "Male"
})

Output:

WriteResult({ "nInserted" : 1 })

Apart from this, document validation only checks for the values. Suppose, if you try to add the document with "nmae"(typo) as a key instead of "name", mongo will consider it as a new field and the document will be inserted in the DB. These things should be avoided when you are working with the production database. To support all this, MongoDB introduced the "jsonSchema" operator with “validator” construct from version 3.6. Let's see how to add the same validation rules as above and avoid adding new/misspelled fields.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

jsonSchema Validation

Run the following command in mongo shell to add the validation rules using "jsonSchema" operator.

db.runCommand(
  {
    "collMod": "users_temp",
    "validator": {
      "$jsonSchema": {
        "bsonType": "object",
        "additionalProperties": false,
        "required": [
          "name",
          "email"
        ],
        "properties": {
          "_id": {},
          "name": {
            "bsonType": "string"
          },
          "email": {
            "bsonType": "string"
          },
          "mobile": {
            "bsonType": "string",
            "pattern": "^[0-9]{3}-[0-9]{3}-[0-9]{4}$"
          }
        }
      }
    }
  })

Let's see now, what happens when we try to insert the following document.

db.users.insert({
    "name": "akash",
    "email": "akash@gmail.com",
    "mobile": "123-456-7890",
    "gender": "Male"
})

It will throw an error as we haven't defined gender field in the "jsonSchema".

WriteResult({
    "nInserted" : 0,
    "writeError" : {
        "code" : 121,
        "errmsg" : "Document failed validation"
    }
})

Same way, if you have typos in any field names, mongo will throw the same error.

The schema defined above is the same as the one which we used in document validation. Additionally, we added the "additionalProperties" field to avoid typos in field names and the addition of new fields in documents. It will allow only fields that are defined under "properties" field. Here is the overview of some properties which we can use under "jsonSchema" operator.

bsonType: array | object | string | boolean | number | null
required: an array of all mandatory fields
enum: an array of only possible values for any field
minimum: minimum value of the field
maximum: maximum value of the field
minLength: minimum length of the field
mixLength: maximum length of the field
properties: a collection of valid JSON schemas
additionalProperties: stops us from adding any other fields than mentioned under properties field
title: title for any field.
description: short description for any field.

Apart from schema validation, "jsonSchema" operator can also be used in find and match stage inside the aggregation pipeline.

Conclusion

Document/Schema validations are not required or desirable in all situations but generally, it's a good practice to add them in your database as it will increase the productivity of developers who are dealing with your database. They will know what kind of response to expect from the database since there won't be any random data.

In this article, we learned about the importance of schema validation in MongoDB and how to add validations at document level using document validation and "jsonSchema" operator.

Tags:

Database systems work better when there is a distributed workload among a number of running instances or rather data is categorized in an easy manner. MongoDB utilizes sharding such that data in a given database is grouped in accordance to some key. Sharding enhances horizontal scaling which consequently results in better performance and increased reliability. In general, MongoDB offers horizontal and vertical scaling as opposed to SQL DBMS for example MySQL that only promotes vertical scaling.

MongoDB has a looser consistency model whereby a document in a collection may have an additional key that would be absent from other documents in the same collection.

Sharding

Sharding is basically partitioning data into separate chunks and then defining a range of chunks to different shard servers. A shard key which is often a field that is present in all the documents in the database to be sharded is used to group the data. Sharding works hand in hand with replication to fasten the read throughput by ensuring a distributed workload among a number of servers rather than depending on a single server. Besides, replication ensures copies of the written data are available.

Let’s say we have 120 docs in a collection, these data can be sharded such that we have 3 replica sets and each has 40 docs as depicted in the configuration setup below. If two clients send requests, one to fetch a document that is in index 35 and the other whose index is at 92, the request is received by the query router (a mongos process) that in turn contacts the configuration node which keeps a record of how the ranges of chunks are distributed among the shards. When the specified document identity is found, it is then fetched from the associated shard. For example above, the first client’s document will be fetched from Shard A and for client B, the document will be fetched from Shard C. In general there will be a distributed workload which is defined as horizontal scaling.

For the given shards, if the size of a collection in a shard exceeds the chunk_size, the collection will be split and balanced across the shards automatically using the defined shard key. In the deployment setup, for the example below we will need 3 replica sets each with a primary and some secondaries. The primary nodes act as the sharding servers too.

The minimum recommended configuration for a MongoDB production deployment will be at least three shard servers each with a replica set. For best performance, the mongos servers are deployed on separate servers while the configuration nodes are integrated with the shards.

Deploying MongoDB Shards with Ansible

Configuring shards and replica sets of a cluster separately is a cumbersome undertaking hence we resolve into simple tools like Ansible to achieve the required results with a lot of ease. Playbooks are used to write the required configurations and tasks that Ansible software will be executing.

The systematic playbook process should be:

Install mongo base packages (no-server, pymongo and command line interface)
Install mongodb server. Follow this guide to get started.
Set up mongod instances and there correspondent replica sets.
Configure and set up the config servers
Configure and set up the Mongos routing service.
Add the shards to your cluster.

The top-level playbook should look like this

- name: install mongo base packages include: mongod.yml
  tags: - mongod

- name: configure config server
  include: configServer.yml
  when: inventory_hostname in groups['mongoc-servers'] 
  tags:
  - cs

- name: configure mongos server
  include: configMongos.yml
  when: inventory_hostname in groups['mongos-server'] tags:
  - mongos

- name: add shards
  include: addShards.yml
  when: inventory_hostname in groups['mongos-servers'] 
  tags:
  - mongos
  - shards

We can save the file above as mongodbCluster.yml.

Become a MongoDB DBA - Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

A simple mongodb.yml file will look like:

---
- hosts: ansible-test
  remote_user: root
  become: yes
  tasks:
  - name: Import the public key used by the package management system
    apt_key: keyserver=hkp://keyserver.ubuntu.com:80 id=7F0CEB10 state=present
  - name: Add MongoDB repository
    apt_repository: repo='deb <a class="vglnk" href="http://downloads-distro.mongodb.org/repo/ubuntu-upstart" rel="nofollow"><span>http</span><span>://</span><span>downloads</span><span>-</span><span>distro</span><span>.</span><span>mongodb</span><span>.</span><span>org</span><span>/</span><span>repo</span><span>/</span><span>ubuntu</span><span>-</span><span>upstart</span></a> dist 10gen' state=present
  - name: install mongodb
    apt: pkg=mongodb-org state=latest update_cache=yes
    notify:
    - start mongodb
  handlers:
    - name: start mongodb
      service: name=mongod state=started

To the general parameters required in the deployment of a replica set, we need these two more in order to add the shards.

shard: by default it is null, This is a shard connection string which should be in a format of <replicset>/host:port. For example replica0/siteurl1.com:27017
state: by default the value is present which dictates that the shard should be present, otherwise one can set it to be absent.

After deploying a replica set as explained in this blog, you can proceed to add the shards.

# add a replicaset shard named replica0 with a member running on port 27017 on mongodb0.example.net
- mongodb_shard:
    login_user: admin
    login_password: root
    shard: "replica0/mongodb1.example.net:27017"
    state: present

# add a standalone mongod shard running on port 27018 of mongodb2.example.net
- mongodb_shard:
    login_user: admin
    login_password: root
    shard: "mongodb2.example.net:27018"
    state: present

# Single node shard running on localhost
- name: Ensure shard replica0 exists
  mongodb_shard:
    login_user: admin
    login_password: root
    shard: "replica0/localhost:3001"
    state: present

# Single node shard running on localhost
- name: Ensure shard replica0 exists
  mongodb_shard:
    login_user: admin
    login_password: root
    shard: "replica0/localhost:3002"
    state: present

After setting up all these configurations we run the playbook with the command

ansible-playbook -i hosts mongodbCluster.yml

Once the playbook completes, we can log into any of the mongos servers and issue the command sh.status(). If the output is something like below, the shards have been deployed. Besides you can see the key mongodb_shard if it has been valued success.

mongos> sh.status()
    --- Sharding Status --- 
      sharding version: { "_id" : 1, "version" : 3 }
      shards:
        {  "_id" : "shardA",  "host" : "locahhost1/web2:2017,locahhost3:2017" }
        {  "_id" : "shardB",  "host" : "locahhost3/web2:2018,locahhost3:2019" }
{  "_id" : "shardC",  "host" : "locahhost3/web2:2019,locahhost3:2019" }

    databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }

To remove a shard called replica0

- mongodb_shard:
    login_user: admin
    login_password: root
    shard: replica0
    state: absent

Conclusion

Ansible has played a major role in making the deployment process easy since we only need to define the tasks that need to be executed. Imagine for example if you had 40 replica set members and you need to add shards to each. Going the normal way will take you ages and is prone to a lot of human errors. With ansible you just define these tasks in a simple file called playbook and ansible will take care of the tasks when the file is executed.

Tags:

Database clustering often involves configuring and maintaining a number of servers and instances, all with a collective purpose. By this we mean you can have different database servers at different hosts which are serving the same data.

For example, let’s say you have servers A, B, C, and D, you decide to install MongoDB on each but then later realize there is a new version you should have used. When you have a large number of servers and you need to update the MongoDB version, configuring them manually (one-by-one) has a lot of setbacks. These setbacks can include; taking too long to reconfigure (hence your site will have a long downtime) or making your DB prone to some configuration errors.

Besides, there are always repetitive tasks you would like to be executed automatically, instead of undergoing the same steps over-and-over, every time you want to do similar changes. At some point we also need to learn new modules as far as technology advancement is concerned that can help us boost the cluster performance

In simple terms, we need an automation systems which can easen all the mentioned undertakings. Puppet is one of the most preferred software systems for achieving this since:

It is easy and faster to configure and deploy MongoDB cluster.
Repetitive tasks can be easily automated such that they are executed automatically later.
The whole cluster infrastructure can be collectively managed from a single platform.
Easy provisioning for new nodes in cloud, hybrid or physical environment.
Orchestrate changes and events across a cluster of nodes.
Discover resources within minutes that can help you perform different tasks easily.
Scales well from 1 to 200k nodes.
Supported by a number of platforms

What is Puppet?

Puppet is a language that is used to get a machine to a desired state or rather is an engine that is used to interpret and apply some defined instructions to a serving system. Like Ansible, Puppet is also a configuration management tool used to automate and execute database cluster tasks. However, it is more advanced and well established considering that it is the oldest hence plenty of newly integrated features that would make it more sophisticated than the others. One of the major reasons I prefer Puppet personally is the capability it gives me to configure a large number of nodes connected together with load balancers, network devices or firewalls. Puppet is often used in large enterprises with complex environments.

How Puppet Works

Puppet uses the idempotency technique that helps it manage a certain machine from the time of creation and throughout its lifecycle even with configuration changes. The core advantage with this is, the machine is updated over a number of years rather than being built multiple times from scratch. In case of an update, Puppet checks the current target machine status and changes will be applied only when there is a specific change in the configuration.

Idempotency

The idempotency workflow is shown below:

The Puppet master collects details regarding the current state of the target machine and compares it to the machine level configuration details and then returns the details which are sent to the conversion layer.

The conversion layer compares the retrieved configuration with the newly defined configuration details and then creates a catalog which is sent to the target Puppet agents, in this case, the target nodes for which the changes are to be applied.

The configuration changes are then applied to the system to transform it to a desired state. After the changes have been implemented, the Puppet agent sends a report back to the Puppet master which is documented to define the new state of the system as the supplied catalog.

Puppet Basic Components

Puppet Resources
These are the key modelling components of a particular machine whose descriptions will get the machine to a desired state.
Providers
Providers are particular resources used to add packages to the system e.g. yum and apt-get. There are default providers but one can add more when in need of some packages.
Manifest
This is a collection of resources that are defined either in a function or a class coupled together to configure a target system.
The structure should be
```
resource:{‘module’:
	attribute => value
}
```
For example installing mongodb we can have a manifest file called Mongodb.pp with the following contents:
```
package {‘mongodb’:
		ensure => installed
     }
```
Modules
This is the key building block of Puppet which is basically a collection of resources, templates and files. They can be distributed in any operating system hence can be used multiple times with the same configuration.
Templates
Templates are used to define customized content and variable input. They use the Ruby syntax, i.e. if you want to define a port to listen to:
```
Listen <% =@Port_number %>
```
Port_number variable in this case is defined in the manifest that references this template.
Static Files
These are general files that may be required to perform specific tasks. They are located in the files directory of any module.

Puppet Installation

For the purpose of learning, we are going to install and configure puppet in a virtual machine which we will create in our local machine. First of all you will need to install virtualbox and vagrant. After installing, open a new terminal and create a Puppet directory probably on your desktop and run the command $ vagrant init. This will create a virtual machine and label it vagrant. Then we can log into this machine with the command $ vagrant ssh.

If you get a screen like the one below then your machine is up and running.

Otherwise if you are on a server machine you can ignore this step and proceed from adding the puppet package like below.

Add the puppet package with the command

$ wget https://apt.puppetlabs.com/puppet5-release-xenial.deb

And then unpack the package and install with

$ sudo dpkg -i puppet5-release-xenial.deb

We need to update our repositories so we run

$ sudo apt-get update

Install the puppet-agent by running

$ sudo apt-get install puppet-agent

After the installation is complete we can confirm by checking the version. You might need to log out of your virtual machine in order for Puppet path to be added to the environment then run $ puppet --version or if you have not logged out run $ /opt/puppetlabs/bin/puppet --version. If you get a version number like 5.5.14 then the installation was successful.

After installing MongoDB using the Mongodb.pp we created above, we can simply write some task to setup a database products and also add a user to this db.

‘Mongodb_database’ is used to create and manage databases within MongoDB

mongodb_database{‘products’:
	ensure => present,
            tries => 10
}

‘Mongodb_user can be used to create and manage users within a MongoDB database.’

To add a user to the ‘products’ database

mongodb_user {userprod:
  username => ‘prodUser’,
  ensure => present,
  password_hash => mongodb_password(‘prodUser’, ‘passProdser’),
  database => prodUser,
  roles => [‘readWrite’, ‘dbAdmin’],
  tries  => 10
}

Conclusion

In this blog we have learned what Puppet is, the merits associated with it, and its working architecture. Puppet is a bit more complex from the other management tools (such as Chef and Ansible) but it has a lot of modules that can be used to resolve issues around database management. In the next part, we are going to discuss how to connect remote machines so that they can be reconfigured using the defined manifest files.

Tags:

In the previous blog, we showed you how to set up our machine with the Puppet and then install and configure MongoDB. Since we are going to configure a number of nodes or rather machines we need a puppet master. In our case though, we will create a git repository where we will push our manifests and apply them to our machines.

To create a local git repository first select the path you want to use i.e./opt/. Then create git repository by running $sudo mkdir repository. Get root user permission to change the contents of this directory by issuing the command $sudo chown vagrant:vagrant repository. To initialize this directory as a git repository after issuing the command $ cd repository, run $ git init --bare --shared if you navigate to this directory you should now see something like

vagrant@puppet:/vagrant/repository$ ls -l

total 12

-rw-rw-r-- 1 vagrant vagrant  23 Jul 15 07:46 HEAD

drwxr-xr-x 1 vagrant vagrant  64 Jul 15 07:46 branches

-rw-rw-r-- 1 vagrant vagrant 145 Jul 15 07:46 config

-rw-rw-r-- 1 vagrant vagrant  73 Jul 15 07:46 description

drwxr-xr-x 1 vagrant vagrant 352 Jul 15 07:46 hooks

drwxr-xr-x 1 vagrant vagrant  96 Jul 15 07:46 info

drwxr-xr-x 1 vagrant vagrant 128 Jul 15 07:46 objects

drwxr-xr-x 1 vagrant vagrant 128 Jul 15 07:46 refs

-rw-r--r-- 1 vagrant vagrant   0 Jul 1 15:58 test.pp

This is the basic structure of a git repository and the options --bare and --share will enable us to push and pull files from the directory.

We need to set up a system that will enable communication between the involved machines and this remote master server. The system in this case will be referred to as a daemon. The daemon will be accepting requests from remote hosts to either pull or push files to this repository. To do so, issue the command $git daemon --reuseaddr --base-path=/opt/ --export-all --enable=receive-pack

However the good practice will be to create a file from which we can run this in the background.We therefore need to set the service by issuing the command sudo vim /etc/systemd/system/gitd.service. In the new file populate it with these contents

[Unit]

Description=Git Repo Server Daemon

[Service]

ExecStart=/usr/bin/git daemon --reuseaddr --base-path=/opt/ --export-all --enable=receive-pack

[Install]

WantedBy=getty.target

DefaultInstance=ttyl

Save the file and exit by pressing <Esc> then type :x and the press <Enter>. To start the server run the command $ systemctl start gitd. For the authentication use the password we set in this case vagrant. You should be presented with something like this

vagrant@puppet:/opt/repository$ systemctl start gitd

==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===

Authentication is required to start 'gitd.service'.

Authenticating as: vagrant,,, (vagrant)

Password: 

==== AUTHENTICATION COMPLETE ===

To check if the service is running $ ps -ef | grep git and you will get: 

vagrant@puppet:/opt/repository$ ps -ef | grep git

root      1726 1  0 07:48 ?     00:00:00 /usr/bin/git daemon --reuseaddr --base-path=/opt/ --export-all --enable=receive-pack

root      1728 1726  0 07:48 ?     00:00:00 git-daemon --reuseaddr --base-path=/opt/ --export-all --enable=receive-pack

vagrant   1731 1700  0 07:48 pts/0    00:00:00 grep --color=auto git

Now if we run $ git clone git://198.168.1.100/repository (remember to change the IP address with your machine’s network IP) in the root directory, you will get a newly created repository folder. Remember to configure your credentials by uncommenting the email and password in the config file. Run $ git config --global --edit to access this file.

This repository will act as our central server for all the manifests and variables.

Setting Up the Environment

We now need to set up the environment from which we will configure the nodes. First, switch to the vagrant directory and clone the repository we just created with the same command as above.

Remove the manifest directory in the vagrant folder by running $rm -r manifest/.

Make a new production folder with $ mkdir production and clone the same repository we created above with $ git clone git://198.168.1.100/repository . (don’t forget the dot at the end)

Copy and paste the contents of puppetlabs production environment into this production folder by issuingcp -pr /etc/puppetlabs/code/environments/production/* . Your production directory should now look like this

vagrant@puppet:/vagrant/production$ ls -l

total 8

drwxr-xr-x 1 vagrant vagrant  64 Apr 26 18:50 data

-rw-r--r-- 1 vagrant vagrant 865 Apr 26 18:50 environment.conf

-rw-r--r-- 1 vagrant vagrant 518 Apr 26 18:50 hiera.yaml

drwxr-xr-x 1 vagrant vagrant  96 Jul 2 10:45 manifests

drwxr-xr-x 1 vagrant vagrant  64 Apr 26 18:50 modules

-rw-r--r-- 1 vagrant vagrant   0 Jul 1 16:13 test.pp

We need to push these changes to the root repository so we run

$ git add * && git commit -m "adding production default files"&& git push

To test if the git configuration is working, we can delete the contents in the directory /etc/puppetlabs/code/environments/production/ by running $ sudo rm -r * in this directory and then pull the files from the master repository as root user i.e. $ git clone git://198.168.1.100/repository . (don’t forget the dot at the end). Only directories with contents are pulled in this case so you might miss the manifests and modules folders. These operations can be carried out in all machines involved either master puppet or client machine. So our tasks will be pulling the changes from the main server and applying the changes using the manifests.

Execution Manifest

This is the script we are going to write for helping us pull changes and apply them automatically to our other nodes. Not only do you have to use the production environment, you can add as many environments as possible then dictate puppet from which one to search. In the root production/manifests directory we will create the execution manifest as puppet_exec.pp and populate it with the following contents

 file { "This script will be pulling and applying the puppet manifests":

path => '/usr/local/bin/exec-puppet',

content => 'cd /etc/puppetlabs/code/environments/production/ && git pull; /opt/puppetlabs/bin/puppet apply manifests/'

mode => "0755"

}

cron {'exec-puppet':

command => '/usr/local/bin/exec-puppet',

hour => '*',

minute => '*/15'

}

File is a resource which has been described to execute the puppet manifests. Add an appropriate path for the file we are creating and populate it with the commands that are to be issued when it will be executed.

The commands are executed systematically that is, we first navigate to the production environment, pull the repository changes and then apply them to the machine.

We supply the manifests directory to each node from which it can select the manifest directed to it for application.

A duration over which the execution file is to be run is also set. In this case for every hour, execute the file 4 times.

To apply this to our current machine, $ cd /vagrant/production. Add everything to git by running $ git add * then $ git commit -m “add the cron configurations” and lastly $ git push. Now navigate to $ cd /etc/puppetlabs/code/environments/production/ and $ sudo git pull

Now if we check the manifests folder in this directory, you should see the puppet_exec.pp created as we had just defined.

Now if we run$ sudo puppet apply manifests/ and check if the files exec-puppet has been created $ cat /usr/local/bin/exec-puppet

The contents of this file should be

cd /etc/puppetlabs/code/environments/production/ && git pull; /opt/puppetlabs/bin/puppet apply manifests/

At this point we have seen how we can pull and push changes to our master machine which should be applied to all the other nodes. If we run $ sudo crontab -l, some important warnings are highlighted on the exec-puppet file created.

# HEADER: This file was autogenerated at 2019-07-02 11:50:56 +0000 by puppet.

# HEADER: While it can still be managed manually, it is definitely not recommended.

# HEADER: Note particularly that the comments starting with 'Puppet Name' should

# HEADER: not be deleted, as doing so could cause duplicate cron jobs.

# Puppet Name: exec-puppet

*/15 * * * * /usr/local/bin/exec-puppet

Configuring the Machines

Let’s say our vagrant file looks like

Vagrant.configure("2") do |config|

  config.vm.define "puppet" do |puppet|

   puppet.vm.box = "bento/ubuntu-16.04"

   #puppet.vm.hostname = "puppet"

   #puppet.vm.network "private_network", ip: "192.168.1.10"

  end

  config.vm.define "db" do |db|

    db.vm.box = "bento/ubuntu-16.04"

  end

end

In this case we have the puppet machine where we have been doing our configurations and then the db machine. Now we to automate the machine such that whenever the db machine is started, it has puppet already installed and the cron file already available to pull the manifests and apply them accordingly. You will need to restructure the contents of the db machine to be as follows

config.vm.define "db" do |db|

    db.vm.box = "bento/ubuntu-16.04"

    vm.provision "shell", inline: <<-SHELL

      cd /temp

      wget  https://apt.puppetlabs.com/puppet5-release-xenial.deb

      dpkg -i puppet5-release-xenial.deb

      apt-get update

      apt-get install -y puppet-agent

      apt-get install -y git

      rm -rf /etc/puppetlabs/code/environments/production/*

      cd /etc/puppetlabs/code/environments/production/

      git clone git://198.168.1.100/repository .

      /opt/puppetlabs/bin/puppet apply /etc/puppetlabs/code/environments/production/manifests/puppet_exec.pp

    SHELL

  End

Up to this stage, the structure of your puppet directory should be something like this

If now you run the db machine with command $ vagrant up db, some of the resources will be installed and the script we just defined can be found in the production/manifests directory. However, it is advisable to use the puppet master which is constrained to only 10 nodes for the free version otherwise you will need to subscribe to a plan. Puppet master offers more features and distributing manifests to multiple nodes, reporting logs and more control on the nodes.

Mongodb Puppet Module

This module is used in the installation of MongoDB, managing mongod server installation, configuration of the mongod daemon and management of Ops Manager setup besides the MongoDB-mms daemon.

Conclusion

In the next blog we will show you how to deploy a MongoDB Replica Set and Shards using Puppet.

Tags:

The cloud computing approach addresses some of the challenges associated with running data processing systems. Data-driven companies are pushing out rapid business transformation with cloud services, and many see cloud services as a substantial enhancement in automation, reliability, and on-demand scaling than the traditional infrastructure models which came before. The on-demand nature of the Software-as-a-Service (SaaS) paradigm means organizations can buy what they need, when they need it. Of course, the cost and cost-effective aspects are crucial, but not the only ones.

In the design on system architectures, we are always looking for the systems which fits the right number of users, at the right level of performance for each. We want to avoid performance issues & bottlenecks, and if those issues happen, we want a system which adapts to the changing demand.

We also want things faster. The agile development process is getting more and more popular; mainly because it accelerates the delivery of initial business value and (through a process of continuous planning and feedback) it can ensure that the ROI is maximized.

Lastly, we want a reduction in complexity. A key feature of MongoDB is its built-in redundancy. If you have two or more data nodes, they can be configured as a replica set or mongodb shards. Without proper automation in place, it can be a recurring task for several teams (network, storage, OS, etc.). Cloud automation can help you to reduce dependencies between the various groups in your organization. For example, you may not need to involve the network team when you create a new database system.

Cloud automation not only saves time and money but also make your organization more competitive in the challenging market.

In this blog, we will take a look at Atlas, the solution from MongoDB that tries to address all of these problems.

Getting Started with MongoDB Atlas

To start with MongoDB Atlas go to https://cloud.mongodb.com. In the registration form, you need to provide bare minimum information like email, company, country, and mobile number.

MongoDB Atlas does an excellent job in infrastructure provisioning, setup. The whole process uses a dynamic web interface that walks you through various deployment options. It's easy, intuitive and doesn't require specialized knowledge.

After the first login, you will be asked to build your first cluster in one of the three most significant clouds. Atlas works with Amazon AWS, Google Cloud, and Microsoft Azure. Based on your choice, you can pick up the location of the preferred data center location. To increase availability, you can set Multi-Region, Workload Isolation, or set various Replication options. Each Atlas project supports up to 25 clusters, but after the contact with the support, you should be able to host more.

You need to select the appropriate size of the server, coupled with IO and storage capacity. In this article, we will use the free version. It is free to start with MongoDB Atlas for prototyping, early development or to learn. The credit card is not needed, so you don't need to bother about hidden costs. The free edition called M0 Sandbox is limited to:

512MB storage
vCPU shared
RAM shared
100 max connections
There is a limit of one M0 cluster per project.

For dedicated clusters, MongoDB Atlas is billed hourly based on how much you use. The rate depends on a number of factors, most importantly, the size and number of servers you use. The price starts with 0.08/hr (M10, 2GB RAM, 10GB storage, 1vCPU) to M700 with 768GB RAM, 4096 GB storage, 96vCPUs from $33.26/hr. Obviously, you would need to include other cost factors like, for example, the cost of backups.

According to MongoDB calculations, an AWS a 3-node replica set of M40s and run it 24/7 for one month using the included 80GB of standard block storage would cost you around $947.

The basic setup works with replication. If you need sharding M30 instance type is a minimum (8GB RAM, 40GB storage, 2vCPUs, price from $0.54/hr).

MongoDB Atlas Network Access Initial Setup

One of the first steps we need to do after the cluster creation is to enable an IP whitelist. To enable access from everywhere you can set whitelist entry to 0.0.0.0/0 but it’s not recommended. If you don’t know your IP address Atlas will help you to identify it.

To keep your connection more secure you can also set up a network peering connection. This feature is not available for M0, M2, and M5 clusters. Network peering allows connectivity between MongoDB VPC and your cloud provider. Peer VPC network allows different VOC ti to communicate in private space, traffic doesn't traverse the public internet.

To start working with your new cluster create an initial user. Do it in the Database Access tab. MongoDB uses Salted Challenge Response Authentication Mechanism. It’s a security mechanism based on SHA-256, user credentials against the user’s name, password and authentication database.

Migration of Existing MongoDB Cluster to MongoDB Atlas

There is also a possibility to migrate your existing on-prem cluster to Mongo Atlas. It's done via a dedicated service called Live Migration Service. Atlas Live Migration process streams data through a MongoDB-controlled application server.

Live migration works by keeping a cluster in MongoDB Atlas in sync with your source database. During this process, your application can continue to read and write from your source database. Since the process watches upcoming changes, all will be replicated, and migration can be done online. You decide when to change the application connection setting and do cutover. To do the process less prone Atlas provides Validate option which checks whitelist IP access, SSL configuration, CA, etc.

What’s important here is the service is free of charge.

If you don't need online migration, you can also use mongoimport. Use mongo shell with minimum version 3.2.7 always use SSL. You can get test data from here.

mongoimport --host TestCluster-shard-0/testcluster-shard-*****.azure.mongodb.net:27017,testcluster-shard-****.azure.mongodb.net:27017,testcluster-shard-******.azure.mongodb.net:27017 --ssl --username admin --authenticationDatabase admin  --type JSON --file city_inspections.json

2019-08-15T21:53:09.921+0200 WARNING: ignoring unsupported URI parameter 'replicaset'

2019-08-15T21:53:09.922+0200 no collection specified

2019-08-15T21:53:09.922+0200 using filename 'city_inspections' as collection

Enter password:



2019-08-15T21:53:14.288+0200 connected to: mongodb://testcluster-shard-*****.azure.mongodb.net:27017,testcluster-shard-*****.azure.mongodb.net:27017,testcluster-shard-*****.azure.mongodb.net:27017/?replicaSet=TestCluster-shard-0

2019-08-15T21:53:17.289+0200 [........................] test.city_inspections 589KB/23.2MB (2.5%)

2019-08-15T21:53:20.290+0200 [#.......................] test.city_inspections 1.43MB/23.2MB (6.2%)

2019-08-15T21:53:23.292+0200 [##......................] test.city_inspections 2.01MB/23.2MB (8.6%)

...

2019-08-15T21:55:09.140+0200 [########################] test.city_inspections 23.2MB/23.2MB (100.0%)

2019-08-15T21:55:09.140+0200 81047 document(s) imported successfully. 0 document(s) failed to import.

To check data, login with mongo shell.

mongo "mongodb+srv://testcluster-*****.azure.mongodb.net/test" --username admin

MongoDB shell version v4.2.0

Enter password:

connecting to: mongodb://testcluster-shard-00-00-*****.azure.mongodb.net:27017,testcluster-shard-00-02-*****.azure.mongodb.net:27017,testcluster-shard-00-01-*****.azure.mongodb.net:27017/test?authSource=admin&compressors=disabled&gssapiServiceName=mongodb&replicaSet=TestCluster-shard-0&ssl=true

2019-08-15T22:15:58.068+0200 I  NETWORK [js] Starting new replica set monitor for TestCluster-shard-0/testcluster-shard-00-00-*****.azure.mongodb.net:27017,testcluster-shard-00-02-*****.azure.mongodb.net:27017,testcluster-shard-00-01-*****.azure.mongodb.net:27017

2019-08-15T22:15:58.069+0200 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to testcluster-shard-00-01-*****.azure.mongodb.net:27017

2019-08-15T22:15:58.070+0200 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to testcluster-shard-00-00-*****.azure.mongodb.net:27017

2019-08-15T22:15:58.070+0200 I  CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to testcluster-shard-00-02-*****.azure.mongodb.net:27017

2019-08-15T22:15:58.801+0200 I  NETWORK [ReplicaSetMonitor-TaskExecutor] Confirmed replica set for TestCluster-shard-0 is TestCluster-shard-0/testcluster-shard-00-00-*****.azure.mongodb.net:27017,testcluster-shard-00-01-*****.azure.mongodb.net:27017,testcluster-shard-00-02-*****.azure.mongodb.net:27017

Implicit session: session { "id" : UUID("6a5d1ee6-064b-4ba8-881a-71aa4aef4983") }

MongoDB server version: 4.0.12

WARNING: shell and server versions do not match

MongoDB Enterprise TestCluster-shard-0:PRIMARY> show collections;

city_inspections

MongoDB Enterprise TestCluster-shard-0:PRIMARY> db.city_inspections.find();

{ "_id" : ObjectId("56d61033a378eccde8a83557"), "id" : "10284-2015-ENFO", "certificate_number" : 9287088, "business_name" : "VYACHESLAV KANDZHANOV", "date" : "Feb 25 2015", "result" : "No Violation Issued", "sector" : "Misc Non-Food Retail - 817", "address" : { "city" : "NEW YORK", "zip" : 10030, "street" : "FREDRCK D BLVD", "number" : 2655 } }

{ "_id" : ObjectId("56d61033a378eccde8a83559"), "id" : "10302-2015-ENFO", "certificate_number" : 9287089, "business_name" : "NYC CANDY STORE SHOP CORP", "date" : "Feb 25 2015", "result" : "No Violation Issued", "sector" : "Cigarette Retail Dealer - 127", "address" : { "city" : "NEW YORK", "zip" : 10030, "street" : "FREDRCK D BLVD", "number" : 2653 } }

...

{ "_id" : ObjectId("56d61033a378eccde8a8355e"), "id" : "10391-2015-ENFO", "certificate_number" : 3019415, "business_name" : "WILFREDO DELIVERY SERVICE INC", "date" : "Feb 26 2015", "result" : "Fail", "sector" : "Fuel Oil Dealer - 814", "address" : { "city" : "WADING RIVER", "zip" : 11792, "street" : "WADING RIVER MANOR RD", "number" : 1607 } }

Type "it" for more

MongoDB Enterprise TestCluster-shard-0:PRIMARY>

Conclusion

That’s all for part one. In the next article, we are going to cover monitoring, backups, day to day administration and MongoDB’s new service for building Data Lakes. Stay tuned!

Tags:

In the first part of the blog “An Overview of MongoDB Atlas,” we looked at getting started with MongoDB Atlas, the initial setup and migration of an existing MongoDB Cluster to MongoDB Atlas. In this part we are going to continue to explore several management elements required for every MongoDB production system, such as security and business continuity.

Database Security in MongoDB Atlas

Security always comes first. While it is important for all databases, for MongoDB it has a special meaning. In mid 2017 the internet was full of news regarding ransomware attacks which specifically targeted vulnerabilities in MongoDB systems. Hackers were hijacking MongoDB instances and asking for a ransom in exchange for the return of the stored data. There were warnings. Prior to these ransomware attacks bloggers and experts wrote about how many production instances were found to be vulnerable. It stirred up vibrant discussion around MongoDB security for a long time after.

We are now in 2019 and MongoDB is getting even more popular. The new major version (4.0) was recently released, and we have seen increased stability in MongoDB Atlas. But what has been done to increase security for the NoSQL databases in the cloud.

The ransomware and constant press must have had an impact on MongoDB as we can clearly see that security is now at the center of the MongoDB ecosystem. MongoDB Atlas in no exception as it now comes with built-in security controls for production data processing needs and many enterprise security features out of the box. The default approach (which caused the vulnerability) from the older version is gone and the database is now secured by default (network, crud authorisations etc). It also comes with features you would expect to have in a modern production environment (auditing, temporary user access, etc).

But it doesn’t stop there. Since Atlas is an online solution you can now use integrations with third parties like LDAP authentication or modern MongoDB internet services like MongoDB charts. MongoDB Atlas is built atop of Amazon WebServices (AWS), Microsoft Azure, and Google Cloud Platform (GCP) which also offer high-security measures of their own. This great combination ensures MongoDB Atlas security standards are what we would expect. Let’s take a quick look at some of these key features.

MongoDB Atlas & Network Security

MongoDB Atlas builds clusters on top of your existing cloud infrastructure. When one chooses AWS, the customer data is stored in MongoDB Atlas systems. These systems are single-tenant, dedicated, AWS EC2 virtual servers which are created solely for an Atlas Customer. Amazon AWS data centers are compliant with several physical security and information security standards, but since we need an open network, it can raise concerns.

MongoDB Atlas dedicated clusters are deployed in a Virtual Private Cloud (VPC) with dedicated firewalls. Access must be granted by an IP whitelist or through VPC Peering. By default all access is disabled.

MongoDB requires the following network ports for Atlas...

27016 for shards
27015 for the BI connector
27017 for server
If LDAP is enabled, MongoDB requires LDAP network 636 on the customer side open to 0.0.0.0 (entire Internet) traffic.

The network ports cannot be changed and TLS cannot be disabled. Access can also be isolated by IP whitelist.

Additionally you can choose to access MongoDB Atlas via Bastion hosts. Bastion hosts are configured to require SSH keys (not passwords). They also require multi-factor authentication, and users must additionally be approved by senior management for backend access.

MongoDB Atlas Role-Based Access Management

You can configure advanced, role-based access rules to control which users (and teams) can access, manipulate, and/or delete data in your databases. By default there are no users created so you will be prompted to create one.

MongoDB Atlas allows administrators to define permissions for a user or application as well as what data can be accessed when querying MongoDB. MongoDB Atlas provides the ability to provision users with roles specific to a project or database, making it possible to realize a separation of duties between different entities accessing and managing the data. The process is simple and fully interactive.

To create a new user go to the Security tab on the left side and choose between MongoDB users and MongoDB roles.

MongoDB Roles

End-to-End Database Encryption in MongoDB Atlas

All the MongoDB Atlas data in transit is encrypted using Transport Layer Security (TLS). You have the flexibility to configure the minimum TLS protocol version. Encryption for data-at-rest is automated using encrypted storage volumes.

You can also integrate your existing security practices and processes with MongoDB Atlas to provide additional control over how you secure your environment.

For the MongoDB Atlas Cluster itself, authentication is automatically enabled by default via SCRAM to ensure a secure system out of the box.

With Encryption Key Management you can bring your own encryption keys to your dedicated clusters for an additional layer of encryption on the database files, including backup snapshots.

Auditing in MongoDB Atlas

Granular database auditing answers detailed questions about system activity for deployments with multiple users by tracking all the commands against the database. Auditing in MongoDB is only available in MongoDB Enterprise. You can write audit events to the console, to the syslog, to a JSON file, or to a BSON file. You configure the audit option using the –auditDestination qualifier. For example, to send audit events as JSON events to syslog use...

mongod --dbpath data/db --auditDestination syslog

MongoDB maintains a centralized log management system for collection, storage, and analysis of log data for production environments. This information can be used for health monitoring, troubleshooting, and for security purposes. Alerts are configured in the system in order to notify SREs of any operational concerns.

MongoDB Atlas LDAP Integration

User authentication and authorization against MongoDB Atlas clusters can be managed via a customer’s Lightweight Directory Access Protocol (LDAP) server over TLS. A single LDAP configuration applies to all database clusters within an Atlas project. LDAP servers are used to simplify access control and make permissions management more granular.

For customers running their LDAP server in an AWS Virtual Private Cloud (VPC), a peering connection is recommended between that environment and the VPC containing their Atlas databases.

MongoDB Business Continuity and Disaster Recovery

MongoDB Atlas creates and configures dedicated clusters on infrastructure provided by AWS, Azure and/or Google GCP. Data availability is subject to the infrastructure provider service Business Continuity Plans (BCP) and Disaster Recovery (DR) processes. MongoDB Atlas infrastructure service providers hold a number of certifications and audit reports for these controls.

Database Backups in MongoDB Atlas

MongoDB Atlas backs up data, typically only seconds behind an operational system. MongoDB Atlas ensures continuous backup of replica sets, consistent, cluster-wide snapshots of sharded clusters, and point-in-time recovery. This fully-managed backup service uses Amazon S3 in the region nearest to the customer's database deployment.

Backup data is protected using server-side encryption. Amazon S3 encrypts backed up data at the object level as it writes it to disks in its data centers and decrypts it for you when you restore it. All keys are fully managed by AWS.

Atlas clusters deployed in Amazon Web Services and Microsoft Azure can take advantage of cloud provider snapshots which use the native snapshot capabilities of the underlying cloud provider. Backups are stored in the same cloud region as the corresponding cluster. For multi-region clusters, snapshots are stored in the cluster’s preferred region.

Atlas offers the following methods to back up your data...

Continuous Database Backups

Continuous backups are available in M10+ Clusters and versions lower than server version 4.2. This is an old method of performing MongoDB backups. Atlas uses incremental snapshots to continuously back up your data. Continuous backup snapshots are typically just a few seconds behind the operational system. Atlas ensures point-in-time backup of replica sets and consistent, cluster-wide snapshots of sharded clusters on it’s own, which it uses S3 for.

Full-Copy Snapshots

Atlas uses the native snapshot capabilities of your cloud provider to support full-copy snapshots and localized snapshot storage.

MongoDB Atlas Data Lake

Using Atlas Data Lake to ingest your S3 data into Atlas clusters allows you to quickly query data stored in your AWS S3 buckets using the Mongo Shell, MongoDB Compass, and any MongoDB driver.

When you create a Data Lake, you will grant Atlas read only access to S3 buckets in your AWS account and create a data configuration file that maps data from your S3 buckets to your MongoDB databases and collections. Atlas supports using any M10+ cluster, including Global Clusters, to connect to Data Lakes in the same.

At the time of writing this blog following formats are supported.

Avro
Parquet
JSON
JSON/Gzipped
BSON
CSV (requires header row)
TSV (requires header row)

Conclusion

That’s all for now, I hope you enjoyed my two part overview of MongoDB Atlas. Remember that ClusterControl also provides end-to-end management of MongoDB Clusters as well and is a great, lower-cost alternative to MongoDB Atlas which can also be deployed in the cloud.

Tags:

Database system perform best when they are integrated with some well defined approaches that facilitate both the read and write throughput operations. MongoDB went the extra mile by embracing replication and sharding with the aim of enabling horizontal and vertical scaling as opposed to relational DBMs whose same concept only enhance vertical scaling.

Sharding ensures distribution of load among the members of the database cluster so that the read operations are carried out with little latency. Without sharding, the capacity of a single database server with a large set of data and high throughput operations can be technically challenged and may result in failure of that server if the necessary measures are not taken into account. For example, if the rate of queries is very high, the CPU capacity of the server will be overwhelmed.

Replication on the other hand is a concept whereby different database servers are housing the same data. It ensures high availability of data besides enhancing data integrity. Take an example of a high performing social media application, if the main serving database system fails like in case of a power blackout, we should have another system to be serving the same data. A good replica set should have more than 3 members, an arbiter and optimal electionTimeoutMillis. In replication, we will have a master/primary node where all the write operations are made and then applied to an Oplog. From the Oplog, all the made changes are then applied to the other members, which in this case are referred to as secondary nodes or slaves. In case the primary nodes does not communicate after some time: electionTimeoutMillis, the other nodes are signaled to go for an election. The electionTimeoutMillis should be set not too high nor too low for reason that the systems will be down for a long time hence lose a lot of data or frequent elections that may result even with temporary network latency hence data inconsistency respectively. An arbiter is used to add a vote to a winning member to become a master in case there is a draw but does not carry any data like the other members.

Why Use Puppet to Deploy a MongoDB Replica Set

More often, sharding is used hand in hand with replication. The process of configuring and maintaining a replica set is not easy due to:

High chances of human error
Incapability to carry out repetitive tasks automatically
Time consuming especially when a large number of members is involved
Possibility of work dissatisfaction
Overwhelming complexity that may emerge.

In order to overcome the outlined setbacks, we settle to an automated system like Puppet that have plenty of resources to help us work with ease.

In our previous blog, we learnt the process of installing and configuring MongoDB with Puppet. However, it is important to understand the basic resources of Puppet since we will be using them in configuring our replica set and shards. In case you missed it out, this is the manifest file for the process of installing and running your MongoDB on the machine you created

  package {'mongodb':

    ensure => 'installed',

  }

  service {'mongodb':

    ensure => 'running',

    enable => true

  }

So we can put the content above in a file called runMongoDB.pp and run it with the command

$ sudo apply runMongoDB.pp

Sing the 'mongodb' module and functions, we can set up our replica set with the corresponding parameters for each mongodb resource.

MongoDB Connection

We need to establish a mongodb connection between a node and the mongodb server. The main aim of this is to prevent configuration changes from being applied if the mongodb server cannot be reached but can potentially be used for other purposes like database monitoring. We use the mongodb_conn_validator

mongodb_conn_validator{‘mongodb_validator’:

ensure => present,

     server: ‘127.0.0.1:27017’,

     timeout: 40,

     tcp_port:27017

    }

name: in this case the name mongodb_validator defines identity of the resource. It could also be considered as a connection string

server: this could be a string or an array of strings containing DNS names/ IP addresses of the server where mongodb should be running.

timeout: this is the maximum number of seconds the validator should wait before deciding that the puppetdb is not running.

tcp_port: this is a provider for the resource that validates the mongodb connection by attempting the https connection to the mongodb server. The puppet SSL certificate setup from the local puppet environment is used in the authentication.

Creating the Database

mongodb_database{‘databaseName’:

ensure => present,

     tries => 10

}

This function takes 3 params that is:

name: in this case the name databaseName defines the name of the database we are creating, which would have also been declared as name => ‘databaseName’.

tries: this defines the maximum amount of two second tries to wait MongoDB startup

Creating MongoDB User

The module mongodb_user enables one to create and manage users for a given database in the puppet module.

mongodb_user {userprod:

  username => ‘prodUser’,

  ensure => present,

  password_hash => mongodb_password(‘prodUser’, ‘passProdser’),

  database => prodUser,

  roles => [‘readWrite’, ‘dbAdmin’],

  tries  => 10

}

Properties

username: defines the name of the user.

password_hash: this is the password hash of the user. The function mongodb_password() available on MongoDB 3.0 and later is used for creating the hash.

roles: this defines the roles that the user is allowed to execute on the target database.

password: this is the plain user password text.

database: defines the user’s target database.

Creating a Replica Set

We use the module mongodb_replset to create a replica set.

Mongodb_replset{'replicaset1':

   arbiter: 'host0:27017',

   ensure  => present,

   members => ['host0:27017','host1:27017', 'host2:27017', 'host3:27017'] 

   initialize_host: host1:27017

}

name: defines the name of the replica set.

members: an array of members the replica set will hold.

initialize_host: host to be used in initialization of the replica set

arbiter: defines the replica set member that will be used as an arbiter.

Creating a MongoDB Shard

mongodb_shard{'shard1':

   ensure  => present,

   members => ['shard1/host1:27017', 'shard1/host2:27017', 'shard1/host3:27017'] 

   keys: 'price'

}

name: defines the name of the shard.

members: this the array of members the shard will hold.

keys: define the key to be used in the sharding or an array of keys that can be used to create a compound shard key.

Tags: