BSON - Diving Deep
- Rishaab
- Jul 6, 2021
- 3 min read
Updated: Jul 16
BSON stands for Binary JSON which is a serialization format for binary-encoding JSON-like documents. BSON was developed at MongoDB in the year 2009 and is primarily used internally for storing documents. BSON also supports data types that are not natively supported by JSON and this makes MongoDB query language rich in a large number of data types.
BSON is similar to Google's Protocol Buffers which also serialize and deserialize the data. Unlike Protocol Buffers, BSON does not use IDL to represent the data and thus offers greater flexibility but at the cost of being less space efficient.
Installation
BSON could be installed out of the box and does not require mongoDB to be installed. For illustration purposes, we will use Python and install the BSON package using pip by following the below command.
pip install bson
Open the Python shell and import the package using the following command,
import bson
We can now dump the BSON representation of the JSON-like objects using the below command
bson.dumps(<JSON object>)
Representation
A BSON document is essentially an ordered key-value pair with additional information like type and size. The key-value pair is often called a BSON Element. The binary data is stored in little-endian format.
A BSON document is represented as,

where the first field is the total size of the BSON document in int32 bytes which is followed by 0 or more BSON Element(s) and finally a NULL byte as a delimiter
A BSON Element is represented as,

where the first field is the data type contained in the Value field, followed by the Field name represented as a C-style string and then the optional value, which could include any data depending on the data type. The value part itself could also be a BSON object in which case the same structure is followed.
Now that we have a higher level of understanding of the format, let's look into some hex representations to understand better. We will use bson.dumps for this.
Note: The examples below are best viewed on a computer.
Example 1 - Empty document
>>> bson.dumps({})
b'\x05\x00\x00\x00\x00'
Since there is no data in the document, the BSON essentially contains the size and a null-terminated byte.
Below is the breakup of the output
Total size (5 bytes) \x05\x00\x00\x00
NULL byte \x00
Example 2 - Single BSON Element
>>> bson.dumps({"key": "value"})
b'\x14\x00\x00\x00\x02key\x00\x06\x00\x00\x00value\x00\x00'
The breakup is as follows,
Total size (20 bytes) \x14\x00\x00\x00
Type (UTF-8 string) \x02
Field name key\x00
Value size (6 bytes) \x06\x00\x00\x00
Value value\x00
NULL byte \x00
Example 3 - Nested BSON documents
>>> bson.dumps({"key1": {"key2": "value2"}})
b'!\x00\x00\x00\x03key1\x00\x16\x00\x00\x00\x02key2\x00\x07\x00\x00\x00value2\x00\x00\x00'
NOTE: There is `!` in the first byte. Convert it to ASCII and then to hex to get the total size.
The breakup is as follows,
Total size (33 bytes) !\x00\x00\x00
Type (Embedded document) \x03
Field name key1\x00
Value (Embedded document)
Total size (33 byte) \x16\x00\x00\x00
Type (UTF- 8 string) \x02
Field name key2\x00
Value size (7 bytes) \x07\x00\x00\x00
Value value2\x00
NULL byte \x00
NULL byte \x00
Example 4 - Single BSON Element Array
Arrays are interesting in the sense that internally each element of the array gets translated to key-value pair, where the key is essentially the index of the element, ie. the array [1, 2] will be represented as {{0: 1}, {1: 2}} and this internally represents a new BSON document.
Let's see a sample
>>> bson.dumps({"key": ["value1", "value2"]})
b'+\x00\x00\x00\x04key\x00!\x00\x00\x00\x020\x00\x07\x00\x00\x00value1\x00\x021\x00\x07\x00\x00\x00value2\x00\x00\x00'
Following is the breakup,
Total size (43 bytes) +\x00\x00\x00
Type (Array) \x04
Field name key\x00
Value (Array)
Total size (33 bytes) !\x00\x00\x00
Element 1
Type x02
Field (Index 0) 0\x00
Value size (7 bytes) \x07\x00\x00\x00
Value value1\x00
Element 2
Type x02
Field (Index 1) 1\x00
Value size (7 bytes) \x07\x00\x00\x00
Value value2\x00
NULL byte \x00
NULL byte \x00
Drivers and Server
MongoDB client drivers are responsible for serializing and de-serializing the objects to the BSON. The MongoDB server also does this internally and provides classes to access the fields within the BSON objects.
The BSONObj class in MongoDB server codebase provides C++ representation for the BSON objects. The BSONObj can be traversed using different types of iterators. The iterator traverses through each BSON Element which contains the field name(s) and the value(s).
The codebase for the same is available here,
I hope you find this useful, if you like this article then please subscribe to my blog.
link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link link
Too good content..keep sharing