Translate

Saturday, April 20, 2013

Why MongoDB doesn’t need Transactions and Constraints

Eye brows may be raised when we say MongoDB doesn’t support transactions and constraints. First let us look at why we need transactions and constraints in relational databases.

Let us think about a simple blog system in relational databases.

image

So in relational, one transaction need to be saved in three tables. So there can be a scenario where post will get updated and tags will not be updated.  To avoid this, you can run all the statements in a transaction.

Also, there can be another scenario where user can insert into tags without inserting into posts table. So the constraints are used to avoid such situations.

How are we saving this data in MongoDB.

image

So in MongoDB, tags and comments are embedded to the posts documents so that above scenarios will not exist. 

18 comments:

  1. MongoDB LinkedIN group
    First, the example is too simple. Rarely in a real life system you can fit everything in a collection.
    Secondly, let's make clear that transactions and constraints are desirable.
    Most NOSQL solutions don't implement them because they are difficult and would affect other characteristics of the DB like performance, replication, ... which are the reasons why people choose their solutions in the first place.
    MongoDB could easily implement transactions in a non-replicated, non-sharded environment, but 10gen choose not to. Their position is that no feature should be available only in a restricted mode. The target audience for MongoDB is large datasets that are replicated.

    By Daniel Coupal

    ReplyDelete
  2. MongoDB LinkedIN group


    Dinesh has a valid point. I don't think he was arguing whether transactions are good are bad (overlooking his obviously controversial title) - simply that for his example, he was able to use MongoDB and implement the requirements without any loss in data fidelity. In many cases, it might be able to eliminate the need for multi-table transactions but still support the exact same features, but you might need to re-examine your data model.

    Compared to most of the NoSQL solutions available, MongoDB probably comes the closest to having "transactions"... findAndUpdate is an atomic operation (as well as insert, and $set, etc.), and is consistent. This means that a simple form of transactions can be implemented in MongoDB. "MongoDB in Action" by Manning press shows how an e-commerce system can be implemented in MongoDB. (You could even implement a two-phase commit if you wanted to). However, I doubt that makes much sense since it's easier to do that sort of thing with a database that supports multi-table transactions if that's what you need.

    I agree with Daniel that NoSQL is really about large datasets. As the dataset grows larger, the need for denormalizing also grows, which is what NoSQL is all (mostly) about. But I've been surprised about how versatile NoSQL (MongoDB especially) is and have used it with good results in many applications that didn't immediately appear to be NoSQLish (I made up a new word :) )

    By Thomas O'Rourke

    ReplyDelete
  3. The question isn't whether MongoDB needs Transaction. MongoDB is just another data storage tool. The usefulness of transactions depends on your application and the ways its designed. We could easily transfer the example given to a RDBMS and implement it without the need for transactions.

    In my many years of working with databases (RDBMS and non-RDBMS) I'd like to be able to turn all sorts of features like transactionality on and off depending what I'm working on at that moment.

    In fact I feel the traditional DB sellers are missing an opportunity here. There are times where its useful to relax transactionality, locking, deadlock detection, referential integrity, etc as they're already being handled by the application.

    Likewise with NOSQL there are times when we don't need the features and there are times when we do. NOSQL isnt any different from a db - we have data, methods of representing it and storage tools - the similarities far out way the differences.

    ReplyDelete
  4. Tim Hawkins • You can easily mix mongo and mysql in the same design, we have 2.5 million skus in our ecomm system, with a mix of both.

    Opensky have a great description of something simular, also hybrid.
    http://www.slideshare.net/jwage/opensky-infrastructure

    ReplyDelete
  5. Aaron Rosenbaum • I disagree, even in Dinesh's simple model, that you can do without transactions.
    What if the primary author changes his username? This is supported in most blog tools. How do you make that happen and not lose records? And, of course, when you change system configuration, you also need ACID. MongoDB keeps its config separate from MongoDB's main datastore in a system that does use two-phase commit, for example.
    Data normalization, in some cases, can decrease the need for ACID...if keys exist in denormalized tuples (user names, security amps, etc), you are going to need ACID multi-record transactions to update those tuples. I think you would be hard pressed to implement any schema without some notion of keys...really hard to find data without them...

    ReplyDelete
  6. Tim Hawkins • Db.posts.update({author:"oldname"}, {$set: {author:"newname"}})

    These type of events (changing username) are quite rare, and dont happen at scale, so they can be delt with at the application layer.

    ReplyDelete
  7. Aaron Rosenbaum • But then the application layer needs to deal with rollback, commit, etc....Just saying "application layer" doesn't reduce the level of effort necessary to implement a TP monitor.

    ReplyDelete
  8. Tim Hawkins • Is very rare, would probably constitute only a small fraction of a percent of the total traffic.

    ReplyDelete
  9. Tim Hawkins •
    Rollback, transactions, ..... On a blog?

    99% of the worlds blogs run on mysql, and of those the vast majority have myisam storage engines under them, its not even remotely transactional or acid complient

    ReplyDelete
  10. Aaron Rosenbaum •
    Many of those have no High Availability, cannot sustain a single data center disaster and I'm sure many don't even have proper backup plans. That fact doesn't mean a database shouldn't have backup capability.

    ReplyDelete
  11. Aaron Rosenbaum •
    I absolutely agree - you shouldn't use begin/end/commit on single row writes and reads...I think the authors standpoint was that they were unnecessary, not seldom used.

    ReplyDelete
  12. Tim Hawkins •
    I think its that they are unnecessary because they are so seldom used. Part of the mongo philosophy is that you don't need functionality that is essentially designed for accounting and banking, in a database system designed for web applications, the use cases for transactions in most web based systems are so small that there is no point in supporting them in a product aimed at websites etc. even in the case of systems like eCommerce that you would think needs transactions, they are often not required. Most retailers cant track their stock so accurately anyway, thats why cycle counting was invented.

    In the rare cases where transactional capabilities are required, then hybrid systems are a perfectly acceptable solution.

    ReplyDelete
  13. Thomas O'Rourke •
    What are transactions? A simple definition is "All or nothing", i.e. if you have to update 3 different tables, if anything fails, don't update any of them - that's what this discussion is about.

    So, in NoSQL one way to design this is to say, instead of having 3 tables and relations, what if I chunk 3 table's data all into one table. That way I will get my all or nothing, because all the changes all get inserted or they don't! This is Denormalization - In general Denormalization reduces the need for transactions, whereas Normalization increases the need. (Aaron had those backwards I think?)...

    What if you really need 3 tables because you can't chunk all the data together. Then a second common NoSQL strategy is to (as Tim pointed out), handle failure. For example, if you try to insert a blog and a tag at the same time, and the blog insert succeeds, but the tag fails, you can always go back and delete the blog. Or how about this, why fail inserting a blog simply because a tag fails (Maybe because of some invalid characters or something)? Insert the blog, and tell the user the tag failed. If the user finds that unacceptable, then let them remove the blog - they own it anyways. You can add a "publish button" if you want to wait for everything to be correct before it goes live (most systems do this anyways). Many real world systems often have to deal with some amount of uncertainty anyways (many can't).

    If neither of these techniques works - well, then use a transactional database. Or create a hybrid system. Don't choose a transactional database engine simply because it's the default, or because the idea of having some amount of de-normalization seems non-optimal at first glance.

    Now all the relational db people are thinking - Why go through all the trouble of implementing in NoSQL? Because there are big gains in terms of scalability and performance (and simplicity) - and many times it's actually easier. As far as high availability, data-center disaster, etc.. It's insanely simple to set up a replica set in MongoDb which will handle automatic failover, etc., without any loss of data - this is one of main benefits of NoSQL systems.

    (As Tim pointed out) Blogs tend to be primarily insert only and have strong ownership - which means it's pretty easy to implement them with NoSQL techniques even if you have multiple tables. A lot of db developers have a light-bulb moment when they realize that a lot of what they were implementing doesn't require transactions (which doesn't mean that a lot of what they do *does* require transactions).

    ReplyDelete
  14. Tim Hawkins •
    If you checkout my blog, ( http://tshawkins.blogspot.com ), you can grab a copy of a presentation i gave only last week on mongo, and the example we used was the 3 table blog vs single collection.

    ReplyDelete
  15. Aaron Rosenbaum •
    My comment about denormalization vs. normalization was wrt key/broad changes. If data is duplicated - not normalized - then on insert, single write/no transactions. But on update, you need to change every duplicate of the data - you need transactions.

    In a normalized system, you change a users name in one place and all related data to that users ID can pickup the new name. In a denormalized system where every record stores the users name, you might need a transaction to keep the name consistent across all records. For a single insert of a new record with a new name, in relational you would need multi-record transactions, in a NoSQL system, you would only need a single insert.

    I would take issue that Storing data relationally vs. Denormalized tuples has anything in particular to do with whether or not multi-statement transactions are necessary.

    ReplyDelete
  16. Tim Hawkins •
    I think you have it the wrong way around, as Thomas pointed out denormalisation further reduces the need for transactions, its not that denormalisation is done to remove transactions. Denormalisation is normally done to improve performance, and gather all the run tine data into single documents, so that the number of sub-selects is reduced, and techniques like sharding can be efficiently performed. Its just fortunate that it also eliminates the need for transactions.

    Incidentally the same drivers also reduce the need for referential integrity, as most documents become self contained, which leads to interesting things like reducing the need for point in time snapshots for backups. If you are willing to accept that a backup is not a point in time, and the objects state could be spread across several hours, then non blocking backup schemes become feasible. Point in time largely is needed to support data-systems that have a lot of cross referenced data.

    ReplyDelete
  17. Thomas O'Rourke •
    I see your point about update - this is true, but it can be handled correctly in NoSQL.

    A lot of times we might just say a user can't change their username. For example, if you had stored a shipping address with an order, you don't want to change that anyways, because it is history - just like a user name in a blog. I don't want people changing their user name also because all the search engines will now be wrong and they are external systems. Try changing your linkedin username for example.

    Then again if I really need to change the user name (doesn't happen very often for obvious reasons), you can always either keep a history of all the previous user names (which is useful and probably necessary if you allow users to change their username in a blog), or I can handle it by using atomic and consistent properties of mongo inserts/updates. I would update a record that says "user name is changing from a to b". While this is present no other change is possible. And once I am satisfied that the users old name is changed in the three tables, then I will remove the flag. This also has the advantage of working with remote databases where transactions might not be possible (for example I need to make REST call to some external service which is really common in web systems).

    It's common in NoSQL to have lots of background processes running that are doing things to keep the data consistent. Often in a queue implemented in the database itself.

    When you do a findAndUpdate in mongo you can be sure it completes successfully and that it was atomic and consistent. You can select the level of certainty you want, all the way from "i don't care" to "propagated to at least n replicas and written to journal".

    ReplyDelete
  18. Thomas O'Rourke •
    Just adding to my previous comment since the time to edit expired (yet another example of a NoSQL strategy)...

    A common way is actually have a relation, for example store the _id of the user in the blog entry. But also store the username also so that you can display it without looking up the user in the related table. If the username changes, you can propagate it to all the locations where it is 'cached'. Of course you can do this same thing in a relational database. But why do I want to lock thousands of rows (or tables) to update something that it's ok if it takes a few minutes to propagate. (People might see my old username for a few second/minutes, what does it matter they've been looking at it for years...)

    Of course, there is no way out of a double failure - i.e. the change fails somewhere, and we unable to do the undo the already completed changes for some unexpected reason. This might leave one table changed, and the other not. But if I can successfully change my username in the users table, then it should be possible to change it in all the blog comments, otherwise something is really broken and needs fixing. That's life in the NoSQL world...

    ReplyDelete