top of page

BSON - Diving Deep

  • Writer: Rishaab
    Rishaab
  • Jul 6, 2021
  • 3 min read

Updated: Jul 16


BSON stands for Binary JSON which is a serialization format for binary-encoding JSON-like documents. BSON was developed at MongoDB in the year 2009 and is primarily used internally for storing documents. BSON also supports data types that are not natively supported by JSON and this makes MongoDB query language rich in a large number of data types.


BSON is similar to Google's Protocol Buffers which also serialize and deserialize the data. Unlike Protocol Buffers, BSON does not use IDL to represent the data and thus offers greater flexibility but at the cost of being less space efficient.



Installation

BSON could be installed out of the box and does not require mongoDB to be installed. For illustration purposes, we will use Python and install the BSON package using pip by following the below command.

pip install bson

Open the Python shell and import the package using the following command,

import bson

We can now dump the BSON representation of the JSON-like objects using the below command

bson.dumps(<JSON object>)

Representation


A BSON document is essentially an ordered key-value pair with additional information like type and size. The key-value pair is often called a BSON Element. The binary data is stored in little-endian format.


A BSON document is represented as,

ree

where the first field is the total size of the BSON document in int32 bytes which is followed by 0 or more BSON Element(s) and finally a NULL byte as a delimiter


A BSON Element is represented as,

ree

where the first field is the data type contained in the Value field, followed by the Field name represented as a C-style string and then the optional value, which could include any data depending on the data type. The value part itself could also be a BSON object in which case the same structure is followed.



Now that we have a higher level of understanding of the format, let's look into some hex representations to understand better. We will use bson.dumps for this.


Note: The examples below are best viewed on a computer.



Example 1 - Empty document

>>> bson.dumps({})
b'\x05\x00\x00\x00\x00'

Since there is no data in the document, the BSON essentially contains the size and a null-terminated byte.


Below is the breakup of the output

Total size (5 bytes) 		\x05\x00\x00\x00 
NULL byte  					\x00


Example 2 - Single BSON Element

>>> bson.dumps({"key": "value"})
b'\x14\x00\x00\x00\x02key\x00\x06\x00\x00\x00value\x00\x00'

The breakup is as follows,

Total size (20 bytes) 		\x14\x00\x00\x00
Type (UTF-8 string) 			\x02
Field name 					key\x00
Value size (6 bytes)			    \x06\x00\x00\x00
Value							value\x00	
NULL byte 					\x00


Example 3 - Nested BSON documents

>>> bson.dumps({"key1": {"key2": "value2"}})
b'!\x00\x00\x00\x03key1\x00\x16\x00\x00\x00\x02key2\x00\x07\x00\x00\x00value2\x00\x00\x00'

NOTE: There is `!` in the first byte. Convert it to ASCII and then to hex to get the total size.

The breakup is as follows,

Total size (33 bytes)			!\x00\x00\x00		
Type (Embedded document) 		\x03
Field name 						key1\x00
Value (Embedded document)
	Total size (33 byte)				\x16\x00\x00\x00
	Type (UTF- 8 string)				\x02	
	Field name 						key2\x00
	Value size (7 bytes)					\x07\x00\x00\x00
	Value								value2\x00
	NULL byte						\x00
NULL byte						\x00


Example 4 - Single BSON Element Array

Arrays are interesting in the sense that internally each element of the array gets translated to key-value pair, where the key is essentially the index of the element, ie. the array [1, 2] will be represented as {{0: 1}, {1: 2}} and this internally represents a new BSON document.


Let's see a sample

>>> bson.dumps({"key": ["value1", "value2"]})
b'+\x00\x00\x00\x04key\x00!\x00\x00\x00\x020\x00\x07\x00\x00\x00value1\x00\x021\x00\x07\x00\x00\x00value2\x00\x00\x00'

Following is the breakup,

Total size (43 bytes)			+\x00\x00\x00
Type (Array)					\x04
Field name						key\x00
Value (Array)
	Total size (33 bytes)			!\x00\x00\x00
		Element 1
			Type						x02
			Field (Index 0)				0\x00
			Value size (7 bytes)				\x07\x00\x00\x00
			Value							value1\x00
		Element 2
			Type						x02
			Field (Index 1)				1\x00
			Value size (7 bytes)				\x07\x00\x00\x00
			Value							value2\x00
	NULL byte						\x00
NULL byte						\x00


Drivers and Server

MongoDB client drivers are responsible for serializing and de-serializing the objects to the BSON. The MongoDB server also does this internally and provides classes to access the fields within the BSON objects.


The BSONObj class in MongoDB server codebase provides C++ representation for the BSON objects. The BSONObj can be traversed using different types of iterators. The iterator traverses through each BSON Element which contains the field name(s) and the value(s).


The codebase for the same is available here,

I hope you find this useful, if you like this article then please subscribe to my blog.

Recent Posts

See All
MongoDB ChangeStreams - Resuming

Today we are going to explore a very powerful tool in the change streams called the resume token . For motivation purposes, consider,...

 
 
 

2 Comments



Isha Joshi
Isha Joshi
Jul 07, 2021

Too good content..keep sharing

Like

Thanks for submitting!

©2023 by Rishab Joshi.

bottom of page