How to read a MongoDB into Pandas DataFrame

MongoDB collections consists of binary JSON objects, the reading of which in Python is well covered here. However, I did not find a starightforward way to read the JSON objects into DataFrames, so here is one way I had found to complete the task.

Notice, that Pandas Series object behaves very similar to a Python Dict. However, the index of Series, unlike the keys of a dictionary, preserve the order. This implies that we can go through MongoDB, iterate over every JSON, and record the values by sequentially creating hierarchical Series index of tuples.


The idea is that, if, for instance, if we have dictionary-list object like this:

from pandas import DataFrame
{1 : [{'keyword1': { 'conversions1': 2,  'cost1': 1}}],
 2 : [{'keyword1': { 'value': 3 },
       'keyword2': { 'nan': 4 }}],
 3 : [{'keyword2': {'value': 5}}] }

then we can write a loop that converts it into a "dictionary" (albeit with complex ordered keys) like this:

d = {(1, 'keyword1', 'conversions1'): 2,
     (1, 'keyword1', 'cost1'): 1,
     (2, 'keyword1', 'value'): 3,
     (2, 'keyword2', nan): 4,
     (3, 'keyword2', 'value'): 5}

and then use unstack() method to create a DataFrame. In reality we don't have a dictionary with controllable order of keys, so we just define list of tuples, and a list of values:

tuples = [(1, 'keyword1', 'conversions1'),
          (1, 'keyword1', 'cost1'),
          (2, 'keyword1', 'value'),
          (2, 'keyword2', nan),
          (3, 'keyword2', 'value')]
values = [1, 2, 3, 4, 5]
names = ['dates', 'keywords', 'attributes']
s = DataFrame(values, index=pd.MultiIndex.from_tuples(tuples, names=names))[0]

Then .unstack the desired slices:

df = s.unstack(['keywords','attributes'])
df1 = s.ix[:,'keyword1'].unstack('attributes')
df2 = s.ix[:,:,'value'].unstack('keywords')

By the way, from Pandas DataFrame then you can then easily get R data.frame:

import pandas.rpy.common as com
r_df = com.convert_to_r_dataframe(df)

Here is the full example code.


Here is a real example from my application:

from pymongo import MongoClient
from bson.objectid import ObjectId
import pandas as pd
# Connecting to database
c = MongoClient()
# Querying the database
query = {} #"account_id" : "XXX-YYY-ZZZZ"}
cursor = c.adwords['keywords'].find(query) #.limit(3)
source = list(cursor)
tuples = []
values = []
names = ['Date', 'AccountId', 'CampaignId', 'AdGroupId', 'KeywordId', 'Attribute']
# To see all attributes: source[0]['keyword_list'][0].keys()
for o, observation in enumerate(source): # source[:2]
    for k, keyword in enumerate(observation['keyword_list']):
        for a, attr in enumerate(keyword):
            if attr not in names:
                        keyword['Campaign'], \
                        keyword['AdGroup'], \
                        keyword['KeywordId'], \
s = pd.DataFrame(values, index=pd.MultiIndex.from_tuples(tuples, names=names))[0]