Business Intelligence Exercise

Deliverables

  • Write a SQL script to create and populate data in a table to report performance of the conversion funnel.
  • Make a recommendation of how tod include “search” as part of the funnel.

Funnel Definition

Use of search is important in understanding the performance of the conversion funnel, but how to consider it as part of the funnel is the question.

Step Name Event Name Step number
HP Visits home_page 1
Accessed Store Page store_page 2
Accessed Checkout Page checkout_page 3
Completed and Order successful_checkout_page 4


Table Name = user_events

The data for a subset of US users for a 1 week time range:

Field Name Description
event_time timestamp without timezone
user_id string (hash)
event_type string
platform string
country string
region string
device_id string (hash)
initial_referring_domain string

In [1]:
import time
import mysql.connector
from datetime import date

import numpy as np
import pandas as pd

from BI_config import (USR, PASS, HST, DB)

mydb = mysql.connector.connect(user=USR, password=PASS,
                              host=HST,
                              database=DB)
curA = mydb.cursor(buffered=True, dictionary = True)
In [2]:
file = "business_intelligence_data.csv"
df = pd.read_csv(file)

print(df.info())
print(df.head())
print(df.nunique())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34843 entries, 0 to 34842
Data columns (total 8 columns):
event_time                  34843 non-null object
user_id                     34843 non-null object
event_type                  34843 non-null object
platform                    34843 non-null object
country                     34843 non-null object
region                      34725 non-null object
device_id                   34843 non-null object
initial_referring_domain    16230 non-null object
dtypes: object(8)
memory usage: 2.1+ MB
None
            event_time                           user_id           event_type  \
0  2018-01-10 00:00:09  e60a96af98eb7fc5d33e731f7ce2a6a2            home_page   
1  2018-01-10 00:00:19  ce1e97cf6c1dcf91ad09e0599f46b852            home_page   
2  2018-01-10 00:00:21  ce1e97cf6c1dcf91ad09e0599f46b852            home_page   
3  2018-01-10 00:00:22  e60a96af98eb7fc5d33e731f7ce2a6a2  store_ordering_page   
4  2018-01-10 00:00:24  554d97b7404288de4492bd852278e6da            home_page   

  platform        country      region                         device_id  \
0      iOS  United States  California  767552034c51c3802b525334772e1f2e   
1      Web  United States  California  04478521debfe3597d73e282bbbebed6   
2      Web  United States  California  04478521debfe3597d73e282bbbebed6   
3      iOS  United States  California  767552034c51c3802b525334772e1f2e   
4      iOS  United States    New York  dd3cba046870e4d70aa40efbbc8f2015   

  initial_referring_domain  
0                      NaN  
1                   google  
2                   google  
3                      NaN  
4                      NaN  
event_time                  32495
user_id                      3095
event_type                      5
platform                        3
country                         1
region                         35
device_id                    3283
initial_referring_domain        4
dtype: int64
In [3]:
#new funnel model based on search_event
funnel_2 = {'home_page':1,
          'store_ordering_page':2,
            'search_event':3,
          'checkout_page':4,
          'checkout_success':5}

funnel_df = pd.DataFrame.from_dict(funnel_2, orient='index')
funnel_df.reset_index(inplace=True)
funnel_df.columns=['event_type', 'funnel_step']

df = pd.merge(df, funnel_df, on='event_type')

Session Based Variable Generation

In order to be able to drill down on useer behavior in the funnel, I created the following session based variables:

  • session_id - unique session Id
  • event_duration - length of time spent on specific event
  • searched - dummy variable if session includes search event
  • ordered - dummy variable if session results in successful order event
In [4]:
df = df.astype({'event_time':'datetime64', 'event_type':'category', 'platform':'category'})
df.sort_values(['user_id', 'event_time'], inplace=True)
df.reset_index(drop=True, inplace=True)

df['session_id'] = 0
df['event_duration'] = 0
df['event_duration'] = df['event_duration'].astype('timedelta64[ns]')
df['searched'] = 0
df['ordered'] = 0


for i in range(len(df)-1):
    df.iloc[i, df.columns.get_loc('session_id')] = i
    if df.iloc[i, df.columns.get_loc('user_id')] == df.iloc[i-1, df.columns.get_loc('user_id')] and \
        df.iloc[i, df.columns.get_loc('device_id')] == df.iloc[i-1, df.columns.get_loc('device_id')] and \
        df.iloc[i-1, df.columns.get_loc('event_type')] != 'checkout_success':
        df.iloc[i, df.columns.get_loc('session_id')] = df.iloc[i-1, df.columns.get_loc('session_id')]
        df.iloc[i, df.columns.get_loc('event_duration')] = df.iloc[i, df.columns.get_loc('event_time')]\
        - df.iloc[i-1, df.columns.get_loc('event_time')]
    if df.iloc[i, df.columns.get_loc('event_type')] == 'checkout_success':
        df.iloc[i, df.columns.get_loc('ordered')] = 1
    if df.iloc[i, df.columns.get_loc('event_type')] == 'search_event':
        df.iloc[i, df.columns.get_loc('searched')] = 1
        
In [5]:
session_df = df.groupby(['user_id', 'session_id', 'event_time', 'region', 'platform', 
                         'initial_referring_domain', 'event_type', 'funnel_step']).agg({'event_duration':'sum'})
searched_ordered_df = df.groupby(['session_id']).agg({'searched':'sum', 'ordered':'sum'})
session_df = session_df.join(searched_ordered_df, on='session_id')

#print(session_df.head(20))
In [6]:
from sqlalchemy import create_engine
engine_str = ("mysql://%s:%s@%s/%s" % (USR, PASS, HST, DB))
engine = create_engine(engine_str)

session_df.to_sql('session_funnel', con=engine, if_exists='append')