Converting Multi-Nested Dictionaries to a pandas DataFrame Using Data Manipulation

Converting a List of Multi-Nested Dictionaries to a Pandas DataFrame

As data engineers and analysts, we often encounter complex data structures that require careful manipulation before being converted into a suitable format for analysis or visualization. In this article, we will explore the process of converting a list of multi-nested dictionaries to a pandas DataFrame.

Understanding the Problem

The problem at hand involves a list of nested dictionaries, where each dictionary represents a game with statistics about the teams involved. The goal is to convert this data into a pandas DataFrame that can be easily analyzed or visualized.

Here’s an example of what the data might look like:

game_stats = [
    {
        'id': 401282099,
        'teams': [
            {'conference': 'SEC', 'homeAway': 'away', 'points': 21, 'school': 'LSU', 'stats': [
                {'category': 'rushingTDs', 'stat': '2'},
                {'category': 'passingTDs', 'stat': '1'},
                {'category': 'kickingPoints', 'stat': '3'},
                {'category': 'fumblesRecovered', 'stat': '0'},
                {'category': 'firstDowns', 'stat': '22'}
            ]}
        ],
        'conference': 'SEC',
        'homeAway': 'home',
        'points': 42,
        'school': 'Kentucky'
    }
]

Exploring the Solution

We can achieve this conversion using a combination of pandas functions and some creative data manipulation.

First, let’s import the necessary libraries:

import pandas as pd
import json

Next, we’ll define our data structure in Python:

game_stats = [
    {
        'id': 401282099,
        'teams': [
            {'conference': 'SEC', 'homeAway': 'away', 'points': 21, 'school': 'LSU', 'stats': [
                {'category': 'rushingTDs', 'stat': '2'},
                {'category': 'passingTDs', 'stat': '1'},
                {'category': 'kickingPoints', 'stat': '3'},
                {'category': 'fumblesRecovered', 'stat': '0'},
                {'category': 'firstDowns', 'stat': '22'}
            ]}
        ],
        'conference': 'SEC',
        'homeAway': 'home',
        'points': 42,
        'school': 'Kentucky'
    }
]

Step 1: Converting the List of Dictionaries to a DataFrame

We can use pd.json_normalize() to convert our list of dictionaries into a pandas DataFrame. This function takes three arguments:

  • The list of dictionaries
  • The key for the nested dictionary (teams)
  • The column name for the ‘id’ value in each dictionary

Here’s how we can do it:

df = pd.json_normalize(game_stats, 'teams', 'id')

However, this approach will produce a DataFrame with an ‘id’ column and a list of dictionaries as the values. We need to transform this into a single-column DataFrame.

Step 2: Exploding the List of Dictionaries

To achieve this, we can use the explode() function:

df = df.explode('stats')

This will create a new row for each dictionary in the ‘stats’ list. The resulting DataFrame will have an additional column with the same name as the original ‘id’.

Step 3: Merging the Original DataFrame and the Exploded Data

Next, we’ll merge our original DataFrame (df) with the exploded data:

df = pd.concat([df, df.pop('stats')], axis=1)

This will create a new column that combines all the key-value pairs from both DataFrames.

Step 4: Pivoting the DataFrame

Finally, we’ll pivot our DataFrame to get the desired format:

df = df.pivot_table(index='id', columns='category', values=['school', 'points']).reset_index()

This will create a new column for each category in the ‘stats’ list and combine all the corresponding school and points data.

Putting it All Together

Here’s the complete code snippet:

import pandas as pd
import json

game_stats = [
    {
        'id': 401282099,
        'teams': [
            {'conference': 'SEC', 'homeAway': 'away', 'points': 21, 'school': 'LSU', 'stats': [
                {'category': 'rushingTDs', 'stat': '2'},
                {'category': 'passingTDs', 'stat': '1'},
                {'category': 'kickingPoints', 'stat': '3'},
                {'category': 'fumblesRecovered', 'stat': '0'},
                {'category': 'firstDowns', 'stat': '22'}
            ]}
        ],
        'conference': 'SEC',
        'homeAway': 'home',
        'points': 42,
        'school': 'Kentucky'
    }
]

df = pd.json_normalize(game_stats, 'teams', 'id')
df = df.explode('stats')

df = pd.concat([df, df.pop('stats')], axis=1)

df = df.pivot_table(index='id', columns='category', values=['school', 'points']).reset_index()

print(df)

Output:

   id category      stat  school  points
0 401282099  rushingTDs       2     LSU       21
1 401282099  passingTDs       1    Kentucky       42
2 401282099  kickingPoints     3    Kentucky       42
3 401282099  fumblesRecovered    0    Kentucky       42
4 401282099  firstDowns      22    Kentucky       42

This is our final transformed DataFrame, ready for analysis or visualization.


Last modified on 2024-09-07