This notebook attempts to reproduce the two tables found in Braverman and Shaffer's 2010 paper on behavioural markers for high-risk internet gambling. To get started, download the data titled '*How Do Gamblers Start Gambling: Identifying Behavioural Markers for High-risk Internet Gambling*' through the link below - you'll need the text files under 'Raw Dataset 2' and 'Analytic Dataset';

File names: **RawDataSet2_DailyAggregation.txt** and **AnalyticDataSet_HighRisk.txt**

Data description above implies RawDataSet2 contains actual betting data for players for the duration of the study, when it appears to only include a maximum of 31 days of betting data. This means the AnalyticDataSet cannot be faithfully reproduced using the raw data alone as the analytic data incudes full-duration behavioural measures (see final cell).

The `trajectory`

measure calculated here disagrees with the analytic data set, it specifically shows more extreme values for the gradient of the stakes. The reason for this issue is described below.

With the data downloaded, the first step is to import *gamba*, run the cell below to get started;

```
import gamba as gb
```

With *gamba* ready, we need to load in both the analytic and raw data sets from the link above - we need to recreate the analytical data set from the raw data;

```
raw_data = gb.data.read_csv('RawDataSet2_DailyAggregation.txt', delimiter='\t', parse_dates=['TimeDATE'])
analytic_data = gb.data.read_csv('AnalyticDataSet_HighRisk.txt', delimiter='\t')
print('raw data loaded:', len(raw_data))
print('analytic data loaded:', len(analytic_data))
```

At this point, the data can be prepared for use in the gamba library. This can be done with the purpose-built `prepare_braverman_data`

method in the `gamba.data`

module;

```
all_player_bets = gb.data.prepare_braverman_data('RawDataSet2_DailyAggregation.txt')
```

Now for the start of the study's replication - we begin by calculating the measures reported in the paper which include **intensity**, **frequency**, **variability**, **trajectory**, **sum of stakes**, **total number of bets**, **average bet size**, **duration of account betting**, and the **net loss incurred** for each player. These are all included in the `calculate_braverman_measures`

method in the `gamba.measures`

module;

```
measures = gb.measures.calculate_braverman_measures(all_player_bets) # this method saves them to a file called 'gamba_braverman_measures.csv'
measures.sort_values('player_id', inplace=True) # lets sort them by ID and display the first 3;
display(measures.head(3))
```

As a sanity check, we can display the original measures calculated for the three players above (after renaming the columns to more intuitive ones);

```
players = measures['player_id'].values[:3] # get only the first 3 values (those above)
display(analytic_data.head(3))
analytic_data['average_bet_size'] = analytic_data['p2sumstake'] / analytic_data['p2sumbet']
original_analysis = analytic_data[['UserID','p2bpd1m','p2totalactivedays1m','p2stakeSD1m','p2stakeSlope1m','p2sumstake','p2sumbet','average_bet_size','p2intvday','p2net']]
original_analysis.columns = ['player_id','intensity','frequency','variability','trajectory','sum_of_stakes','total_num_bets','average_bet_size','duration','net_loss']
original_analysis.sort_values('player_id', inplace=True) # after changing the column names, sort them by player ID (as above)
display(original_analysis.head(3))
```

This is a little puzzling as some of the measures align, yet others such as `total_num_bets`

and `duration`

appear to be underestimates compared to the original analysis, the `trajectory`

measure also appears more extreme. To find out what's causing this difference, we can explore the duration of the data in the raw data set;

```
raw_data = gb.data.read_csv('RawDataSet2_DailyAggregation.txt', delimiter='\t', parse_dates=['TimeDATE'])
all_player_ids = set(list(raw_data['UserID']))
max_duration = 0
for player_id in all_player_ids:
player_bets = raw_data[raw_data['UserID'] == player_id].copy()
player_bets.rename(columns={'TimeDATE':'bet_time'}, inplace=True)
duration = gb.measures.duration(player_bets)
if duration > max_duration:
max_duration = duration
print('unique players found:', len(all_player_ids))
print('maximum duration found:', max_duration)
```

The raw data contains a maximum of 31 days of betting data per player, therefore the analytic data set cannot be *completely* reproduced using the raw data alone, hence the original analytic data will be taken forward as opposed to an exactly replicated data set.

This means that as we cannot compute the measures exactly, the next best thing is to verify the accuracy of the clustering described in the paper, we can do this using the `k_means`

functions from gamba's clustering module;

This next cell aims to recreate the k-means method described on page 3 of the paper, under the heading *Statistical analysis*;

```
standardised_measures_table = gb.measures.standardise_measures_table(original_analysis)
clustered_data = gb.machine_learning.k_means(standardised_measures_table, clusters=4, data_only=True)
gb.machine_learning.describe_clusters(clustered_data)
```

The random initialisation of the k-means algorithm means that it is unlikely to exactly reproduce any previous k-means clusters on the data. This is a problem for *exact* replication, but, we can be sure that the algorithm is being applied and that clusters are being identified based on the descriptions above.

Note: this can be fixed by seeding probabilistic algorithms in the future!

What to do with this information is the next question, and as the raw data to analytic data attempt above showed some discrepancies, it's impossible to exactly replicate this particular study (as with other types of probabilistic analyses).