Joseph Ryan Glover

Random stuff.

Arbitraging Playdium

This afternoon the family went to the Playdium in Mississauga. For those not in the know, the Playdium is a big arcade, go-cart, batting cage and redemption game complex that takes your money and offers entertainment in return. I had a good time at Playdium this afternoon because my son and I discovered how to arbitrage a particular redemption game and secure some decent prizes. While we didn’t push our insights to the limit, in the following analysis I’m going to demonstrate how it would be possible to take home a Minecraft-themed Xbox One S with several hours’ worth of work and for under a hundred bucks.

First, do you know what a redemption game is? They are the games that you find at places like Dave and Buster’s (which Playdium very closely resembles) where you put credits into the game, perform an action and are rewarded with tickets. You get enough tickets and you can cash them out for prizes, usually really lousy prizes, and they are usually a  disappointment. But not today.

Second, the game we arbitraged is a redemption game themed around the Madagascar license. It involves four stations positioned around the game with a central chute that spouts little plastic balls. Each player has a net and is supposed to catch as many little balls in the net as possible in specified time frame. The important thing to understand is that the nets have very little range of motion and the movement of the balls is practically random. There is very little skill involved in playing and in performing some tests on the machine, if one doesn’t touch the net at all, you still manage to collect around 20 balls just through chance. At Playdium you get one ticket for every ball collected and given that the game costs 5 credits to play, that’s a fairly decent ticket return on credit spent.

Here’s an extremely long video of a fellow extolling the virtues of the machine.

So far this sounds like a typical money-eating redemption game. You pay credits, you get some tickets. But, the Madagascar game has a strange wrinkle: if you beat the current high score you earn a 1000 bonus tickets.

A thousand tickets is quite a payoff, so you’d think it would be rare and difficult to achieve. Not so, my son and I combined for nine instances of the 1000 ticket payout in about an hour’s play.

Look at the screen shot below and you’ll see an example of the high score for the machine, it’s 39, which is a much larger number of balls than I was ever able to gather. However, and this is the kicker, every time someone plays the game and loses, that number drops by one. Since there are four positions on the game, often you have four players working away, failing to beat the high score, and the high score drops by four. I found that if I played when the score was in the 30-32 point range, I could gather enough balls to beat it and score the 1000 tickets. Best of all, when you win, the high score resets … to 40. And then you do it all over again.

With the heart of the arbitrage detailed, let’s break down the numbers to reveal how to profit from this woefully calibrated machine.

To start, the machine costs 5 credits to play. What is a credit? They are an internal Playdium currency purchased with real dollars. For my visit I purchased 175 credits for $20 Canadian dollars. That means that a credit is worth 11.4 cents and the Madagascar game costs 5 x 11.4 = 57.2 cents to play.

Next, what is it that we are aiming to buy with our tickets? The most expensive item in the redemption store when I visited was a Minecraft-themed Xbox One S and it was priced at 70,000 tickets. Amazon.ca, at the time of this writing, was listing this Xbox for approximately $400 Canadian dollars, so that tells us how much it would cost to just go out and buy the machine.

Consider, for a moment, if we were just playing the Madagascar game straight, getting 1 ticket for every ball we captured and never achieving the 1000 ticket payout, which should be rare. Of the games I played, I averaged around 30 balls which means I averaged 30 tickets. Since the game cost 5 credits to play, and I purchased 175 credits, I would be able to play the game 35 times for my $20. During those 35 plays, if I earned the average number of tickets per play, I would earn 1050 tickets. If I wanted 70,000 tickets to get the Xbox I’d need to burn through 175 credits 67 more times but to buy that many credits, at the $20/175 rate, would cost me $1340 Canadian dollars. Since $1340 is much larger than $400, this is an impractical way to secure an Xbox.

Let’s consider an alternate scenario, one where I only win the 1000 ticket jackpot every time I play. If I was going to score 1000 tickets at a time I would only need to play the game 70 times. Playing the game 70 times would cost 70*5 = 350 credits which, at the $20/175 rate, is $40 Canadian dollars. Since $40 is much smaller than $400, this is a very good way to secure an Xbox.

There is a third strategy to consider, and that’s playing the game straight but every tenth game, when the high score is down to a manageable level, securing the 1000 ticket payoff. Playing this way would, over the course of 10 games, net you 1000 + 9*30 = 1270 tickets. This would cost 50 credits. To earn 70,000 tickets in this manner would require 2756 credits, which at the 11.4 cents/credit rate would cost you about $315 Canadian dollars. Still cheaper than the actual Xbox, but not by that much.

The key to keeping one’s costs down is to let others play the game when the jackpot is not winnable and to only step into a player position when the high score is achievable. I had a confirmation of this strategy when, as the score was getting lower, an employee of Playdium sidled up to the machine, furtively glanced about, and swiped in to play at a high score of 32. This guy knew what was up and I knew I wasn’t going to let him scoop my jackpot. I swiped in and captured 34 balls, snagging the prize.

One disadvantage of the waiting for other players strategy is that it has a time cost. It takes time for enough players to come through and lower the high score to a manageable level. After I developed my strategy I found that I was waiting about 5 minutes on average for the score to drop. If I had adopted the “only play when you can win the jackpot” strategy from the get-go I calculate that I would have had to wait 5*70 = 350 minutes to play through 70 jackpots. That’s 5.8 hours, which is an awfully long time to be seen to be creeping around a game at an arcade.

Of course, this is an ideal outcome, it may take more than 70 plays to get 70 jackpots (hopefully not that many more) and that’s going to add time and dollars to the endeavor. Thankfully, Playdium tickets are not perishable and you can store them on your card for when you return another time. This suggests a strategy of making several visits, with shorter time frames, and racking up tickets across multiple days. This might also help in case Playdium has their own version of a pit boss, who might notice that the Madagascar game is paying out an absurd number of tickets.

One thing this analysis doesn’t factor in is the value of your time. If your time is worth $100/hour then waiting around for 6 hours to secure a $400 videogame console is dumb. That’s why I’m leaking this secret to the world, I won’t be putting this strategy to use, but I did enjoy finding this arbitrage opportunity and cashing out some not-Xbox prizes for my kids. Hopefully some time-rich, cash-poor individuals looking for an Xbox will read this and capitalize. Or maybe Playdium will hire me to help balance their games better. Good luck!

 

Sentiment Analysis in Sebastian Barry’s Novel A Long Long Way

I’m kicking around ideas for my thesis in Irish literature and one avenue I explored was examining Sebastian Barry’s 2005 novel A Long Long Way using some techniques from the digital humanities. For those who may be unfamiliar the digital humanities is a new academic discipline that seeks to apply data analysis tools to traditional humanities disciplines, such as English literature, to see what they can see. There are a lot of different techniques that fall under the umbrella of the digital humanities but I’m interested in what’s known as a sentiment analysis which is a kind of quantitative text analysis that seeks to measure the emotional tenor of a text. Barry’s novel is about the experience of the Irish during world war one and though I read the book eight years ago the thing that has stuck with me is the emotional highs and lows that Barry sends his protagonist Willie Dunne through. People often say that something is an “emotional rollercoaster” but I’ve only ever felt that A Long Long Way really embodied that characterization.  In my mind, this makes the novel a perfect test case for sentiment analysis because if it works the way it is advertised, the computer should be able to identify the same emotional highs and lows I did as a human reader.

In sentiment analysis, there seems to be two different schools of thought. The first school takes a machine learning approach to train an algorithm to determine in sentiment on a corpus of documents (usually tweets) while the second approach takes a database of words with scored sentiment and then calculates the sentiment of a text based on these words. I took the second approach because I’m not sure what value is there is in having a machine guess about sentiment when it can simply be calculated.

Googling around the internet there are a variety of techniques and approaches to performing sentiment analysis. Matthew Jockers of the University of Nebraska-Lincoln is doing interesting working look at the emotional trajectories of novels and he has produced a lot of good R code that can be downloaded and played with. Jockers approach seems to be to parse the text of a novel, in sequence, and run it through a sentiment function that attaches positive or negative scores to words with “emotional valence”. Emotional valence is defined differently by different researches but Jocker uses the AFINN sentiment database in part of his research and this database contains a list of words scored between -5 and 5 based on how negatively or positively emotional they are. In practice, this means that profanities such as bastard or prick are given a -5 while adjectives such as breathtaking and superb are rated a 5. The database I downloaded contained nearly 2500 English words scored by emotional valence and I used the database to score Barry’s novel.

There are a lot of tutorials around the web about how to perform sentiment analysis in Python or R and I spent some time studying them and I concluded that most of these tutorials were doing more work than I thought necessary. Beyond the machine learning approach (which I rejected) even Jockers was doing some elaborate Fast Fourier Transform work to analysis his texts and this seemed excessive. After some research I decided that my approach would be dead simple:

  • I’d break the text into individual words
  • I would score each word using the AFINN database
  • I would calculate a 5000-word rolling sum of the word scores
  • I would plot that result and see if it said anything interest

The only strange bit in this approach is the 5000-word rolling sum. Using AFINN, words can only have a score between -5 and 5 and the reality is that the majority of Barry’s 93000+ words would be given a score of 0 since they aren’t in AFINN’s database. Plotting each word seemed like it would result in a meaningless series of squiggles around the axis. I reasoned that, if an author is trying to use emotional language to move the reader, they are likely going to be building across sentences, paragraphs and pages in a general emotional direction. For that reason it makes sense to sum a lot of the scores together to see if the emotional valence of the text is rising or falling.

The good part about taking a simple approach is that all of the work can be done in Excel. I had one worksheet with every word in the novel on its own row and I had a second worksheet with all the AFINN words and their scores. Then I used a VLOOKUP to assign a score to each of the novel’s words based on the AFINN score. This screenshot demonstrates how simple the setup is:

After that I calculated the 5000-word rolling sum (in column D) using just the SUM formula and then charted it using a standard Excel line chart.

Visually, the chart of the emotional valence of the novel confirms the rollercoaster feeling I had when I read it. To be fair, I reckon a lot of novels follow this kind of emotional back and forth, Jockers’ research certainly indicates that they do, but it’s nice to see that the computer can identify the fact that the text of novel is loaded with positive and negative emotion just as a human reader can. That alone seems like a victory.

A few things about the chart:

  • The horizontal axis is labelled “Narrative Time” and this is a term I took from Jockers. It’s the words of the novel, in order, from start to finish. Barry’s novel is 93486 words long so the horizontal axis can also be called word count. It helps find a position in the novel.
  • The title of the chart says “Mid-Point Rolling Sum”. It’s a mid-point because when I was calculating the rolling sum I decided not to sum the scores for the last 5000 words for a particular location in the novel but rather the last 2500 and the next 2500 that bracketed a location in the novel. The choice for midpoint is strictly based on interpreting the text. If I use the last 5000 words, all of the emotional text lies before the high or low point, i.e. the emotional scene that caused the peak or trough is coming to an end. If I use the midpoint then I’m right in the middle of the high or low emotional action. I prefer the later for pulling quotes.
  • While the novel has emotional highs and lows it spends a lot more time below the zero line than above it. I think this is to be expected for a novel about the butchery and madness of world war one.

In the chart above I highlighted several alternating peaks and troughs. Using the word number I was able to pull out some choice quotes from the novel at those locations.

Word 15021, Page 49 in the 2005 Penguin paperback edition of the novel.

Willie Dunne is the protagonist of the novel and he’s a world war one soldier in the trenches. This first low point is a scene where the German’s use poison gas for the first time against the allies.

He awoke to a yellow world. His first thought was that he was dead. It was the small hours of the morning and there were still torches and lights being used. Long lines of men were going back along the road, with weird faces, their right hand on the right shoulder of the man in front, about forty men in a chain sometimes. He thought horribly of the Revelation of St John and wondered if by chance and lack he had reached the unknown date of the end of the living world.

Word 21829, Page 71

Willie returns home on furlough and his sisters give him a bath and feed him cheese and bread. He also meets with his father who is very proud of him.

‘Annie, Annie!’ she called. ‘Annie will help us. Don’t you worry, Willie, we’ll get you clean all right.’

And she rolled up her black sleeves and went down the back stairs for the zinc tub that was stowed in its place under the lower landing. She nearly collided with Annie in the door.

‘Annie, dear, you will need to boil up some bathwater promptly and we will be washing Willie Dunne immediately.’

‘Willie, Willie,’ said Annie. She rushed forward and was about to put her arms around him.

‘Don’t touch me, Annie, I’m all lousy and God knows what.’

‘Well, we had better clean you off before Dolly gets back from school, because you won’t be able to hold her off!’

Word 34879, Page 114

Willie, in the trenches, kills a German soldier in hand-to-hand combat as he watches his fellow soldiers fall to the poison of an exploded gas shell. The dead German will stay in Willie’s thoughts for the rest of his life.

A grey monster in a mask came leaping into their trench. He looked enormous. Whether he was or not Willie did not know, but he looked as big as a horse. He stood over Willie and all Willie could think of were Vikings, wild Vikings sacking an Irish town. It must have been a picture from a school book. He had never seen a German soldier before so close up. Once he saw three dejected German prisoners, poor maggots of men with heads bowed, being escorted to some prison camp through the reserve area. They had looked so sad and small no one even thought to mock them. They engendered silence to see them. But this man was not like them. He put his two hands on Willie’s shoulders and for a moment Willie thought he was going to rip off the gas mask and instinctively he put his hands up to hold it on. For some reason, without himself actually registering it, he had got the funny tomahawk into his left hand and when he raised the hand the spike at the top of the short stick horribly drove into the underchin of the German.

Word 39879, Page 129 

Willie and his unit take baths in an abandoned bathhouse that has miraculously survived bombing and shelling. They luxuriate in the baths, the camaraderie and in the singing of one of their fellows

The soldier struck his first note and passionately gave a ballad from the days of the Crimea. It was very lonesome, tender, and bloody. There was a young girl in it, and a soldier, and a death. The listeners were stilled because in the song there was a melody that brought from their own memories coloured hints and living sparks of the past. The past was a valued thing but it was also dangerous to them in the toxic wastelands of the war. It needed a box of safety round it, and this small room for concerts was as good as they had found.

Word 53486, Page 167 

Willie has assaulted a fellow soldier (and friend) who is retelling a story of a rape in which the soldier participated.

“All right, all right.’ But he didn’t seem able to say it for another few moments yet. Then he nodded his punched face. ‘She died of what had happened to her. She was bleeding all those hours. She was not treated right. She was fucking torn to pieces, wasn’t she? And she died. And we tried to save her.’

‘You think so?’

‘It’s just a story, Willie, a story of the war.’

‘You can keep your story, Pete. You can keep it.’

And Willie lay back trembling on his bed.

Word 61932, Page 194

Willie attends a boxing event put on for the soldiers. The fight is violent but well received by the male audience; a cathartic, atavistic event.

Now the bell went again and it seemed Miko Cuddy was in a fever to finish the fight, no doubt at the prompting of his seconds — the very same name as the men in an old-fashioned duel, Father Buckley noted — who had probably measured the big Ulsterman with mental measuring tape and had fearfully taken in the long reach and the thick muscle of the arms.

So Miko Cuddy came forth like a veritable whirligig, like a windmill on the flat white plain of the ring, whirling, whirling his arms, and before he could do much damage, William Beatty came at him like a ballet dancer, side-stepping and jigging and bouncing and finessing every punch, like a man inspired by the very poetry and possibility of movement, and curled in another punch to the very same ear he had caught in the first moment of the fight, and Willie Dunne swore afterwards that he seemed to feel that very punch himself on his own ear, and O’Hara did point out that in his excitement he had indeed landed a gentle box there, but only a shadow of the real thing.

Word 74001, Page 232

Willie and some soldiers are discussing what it means to be a victor in a war like world war one. They conclude that there is no meaningful definition for the word in their conflict.

‘ They might be calling that a sort of victory, i’n’t that it?’

‘Some fucking victory,’ said Willie Dunne.

‘Some fucking war,’ said Timmy Weekes.

‘And so say all of us,’ said Willie Dunne.

And that was strong talk. And that was all right for a while. But as that strange silence that could descend on you, even among your companions, descended on Willie Dunne, all ease and that tincture of happiness like the sweet juice in an orange left his brain. It began to throb with that all too familiar throbbing. A dash of grog might take that away. An ill thought, a curse, or a good sleep might also.

Word 81416, Page 253

Willie is visiting the family of the deceased Captain Pasley, a man he respected. It is a peak in the emotional valence but it is not above zero. It is solemn and resigned.

‘Ah yes,’ said the rector, and Willie by old experience knew how the rector’s brain was whirring, registering the name that would be unlikely to be a Protestant one, though the first name maybe betokened a certain deference to the powers that be. But, to give the man his due, his tone didn’t alter. His own name was written in gold lettering just behind him as it happened, on the black notice that said the name of the church and the rector-in-charge. ‘Well, my friend, you will find them at the top of the hill. I’ll bid you good day and God bless you.’

‘Thank you, Rector.’

‘Thank you, William, for taking the time to talk to me.’

Willie felt curiously heartened by the words of the rector. In fact, he was close to weeping as he trod on up to the house among the trees.

Word 87716, Page 275

The Catholic Irish were promised Home Rule by the British, after the war. Many young men volunteered for this reason. The 16th Division, comprised mostly of Irish is confirmed as nearly wiped out in this section. What they died for would not be honoured.

There was no point saying anything about it. Something had come to an end before even the war was over. Poor Father Buckley. The aspirations of poor men were annulled for ever. Any fella that had come out in the expectation of Home Rule could rest assured his efforts and his sacrifice were useless. For all that his father would think of it, Willie thought that was very sad. Very fucking sad. And very mysterious.

What is fascinating about most of these passages is that they don’t themselves contain a lot of emotional language but they are representative of very emotional scenes in the novel. The sorrow that haunts the novel, especially the second half, is very delicate and while it can be felt by an attuned reader it can also, seemingly, be detected by the computer. Over spans of thousands of words Barry is able to change the emotional drift of his novel through word selection without beating the reader over the head with it and for that reason I think this kind of analysis is useful. If nothing else it makes me appreciate, even more, Barry’s story-telling craftsmanship.

Quantifying Streakiness in NBA 3-Point Shooters

In basketball parlance, a streaky shooter is a player whose shooting effectiveness is inconsistent: some games they are hitting nearly all their shots and other games they are stone cold. There’s a lot of talk around the Internet about streaky shooters and a lot of top 10 and top 15 lists are made (pity J.R. Smith). Despite the volume of listicles published I didn’t find a lot of quantitative analysis to back up what the authors feel are the league’s streakiest shooters and I aim to rectify that. In this post I lay out a method for quantifying player streakiness.

Since this is a first attempt at a new analytical method I am going to keep my focus narrow. I am going to only look at the 2015-2016 season, I am only going to look at the streakiness of 3-point shots and I am only going to look at players with at least 100 3-point attempts in the season. Lest you think this is too limiting, rest assured that even with these limits 212 players have made the cut.

Other analysts have looked at streakiness in the past and their attempts involved analyzing the standard deviation of a player’s per game shooting percentage. This makes sense if you want to study streakiness at a game level; players with a larger standard deviation have large swings in performance from game to game (i.e. are streakier). My issue with this approach is that it fails to account for longer hot and cold streaks that span multiple games. My proposed solution examines a player’s shooting performance across a season as a continuous, ordered stream of made or missed shots. This approach allows for sustained periods of hot or cold shooting to be identified and also allows me to normalize performance so that players can be compared.

Before I get into the analysis I need to comment about my interpretation of streakiness. If a player goes 7-for-8 from 3 in a game we would agree that he had a hot hand that game, that he was on a streak. That determination is independent of where in the sequence of 8 attempts the player got their miss. In other words, scoring three, missing one, and scoring four is not somehow worse than missing the first shot and then rattling off seven straight. They both result in 7-for-8. What this means is that my streakiness has some “forgiveness” built in; you can miss some shots during a hot streak and still be considered hot. Conversely, you can score one in the midst of brick city and still be considered stone cold. Streakiness therefore needs to be a historic measure that looks at a player’s recent past to determine whether they are hot-or-not.

Keeping the above caveat in mind, I propose the following the method to structure the shooting data. For each player with more than 100 3-point attempts in the 2015-2016 season, extract each of the 3-point attempts and chronologically order them from the start of the season to the end. For every make assign a 1 and for every miss assign a 0. After the player’s 10th 3-point shot of the season, start calculate a running 10-shot sum that continues to the end of the season. The running 10-shot sum is the “history” that accounts for the “forgiveness” I mentioned above and it allows an analyst to look at a player and ask “how’s he doing?” after any 3 point shot. Maybe the player is 3-for-10 over their last ten shots, maybe they’re 9-for-10. Whatever they are, this score out of 10 allows us to gauge streakiness across a season after any 3-point attempt (after the first 10).

Side note: why 10? Because humans have 10 fingers and 10 toes and we’re biologically predisposed to the decimal system? Also, no one in the data set had more than 10 3-point makes in a row in the 2015-2016 and in fact, only one person had one 10 3-points makes in a row (take a bow Austin Rivers).

 

Some Worked Examples

It is possible (and fruitful) to chart the running 10-shot sums by player. The chart below is for Aaron Gordon and the signal on the chart reveals his recent historic performance after each shot of 2015-2016 season.

A couple of things to notice:

The chart runs out to 133 but Aaron Gordon had 142 3-point attempts in 2015-2016. Yes, that’s true, but remember we lose the first 9 shots  of the season since the idea of a running 10-shot sum doesn’t make sense for a point in the season with less than 10 previous shots. Every player will have 9 fewer points on these graphs compared to their total 3-point attempts in the season.

According to his chart, Aaron Gordon seems pretty consistent in his 3-point shooting; throughout the entire season he never goes above 5-for-10 and only drops to 0-for-10 once near the end of the season. Most of the time he’s hovering at the 3-for-10 mark. Yes! Thank you for noticing that! There is a reason that I picked Aaron Gordon for this example and that’s because he is the least streaky player in the data set. He is, in other words, Mr. Consistency (more on this very important point in a bit).

Do you want to see someone who is, arguably, the opposite of Aaron Gordon? Here’s Sasha Vujacic:

A thing you might notice about Vujacic (other than that his line is orange) is that his trend is all over the place. He flirts around at 5-for-10, then plunges down for a period in the 0-for-10 area, then back up to 6-for-10, then down again, then up again (to a period of 8-for-10!) then down, then finishes the season strong. Phew, his signal is a little … volatile.

I bet you can guess why I showed you Vujacic. Yes, it’s because my metric says he is the streakiest player in the data set. He has stretches of the season where his shooting is bad and then good and then bad again. He is the poster child for inconsistency even though he had a season 3-point percentage of 36.3% (more on this later).

I think these two examples point to a truth about streakiness (and I am going to bold it for emphasis): to be streaky is to be inconsistent and the benefit of the running 10-shot sum is that gives us a new measure to gauge consistency. Because we’re not concerned about game shooting percentages we can put all players on an equal playing field (i.e. last ten shots), regardless of how many shots they take per game, and then measure a player’s performance across all eleven possibilities (i.e. 0-for-10 through 10-for-10) to get a sense of how consistent they are.

For example, if someone was to see Aaron Gordon nail a three and they turned to the coach and asked “how’s he doing?” it would not be unreasonable to say, sight unseen of the stats, “he’s 3 for his last 10”. Why? Because Gordon’s season chart indicates that he spends a lot of the season at 3-for-10. Conversely, if someone watched Vujacic nail a three and asked the same question the coach might just shrug since Vujacic could be anywhere along his trajectory, either in a downturn or an upswing, 1-for-10 or 7-for-10. Gordon, then, is the picture of consistency while Vujacic is the picture of inconsistency. Gordon is not streaky, he’s a steady 3-for-10 guy which means that steady equals consistent and hence streaky equals inconsistent.

Boiling Streakiness Down to a Number

This is all well and good but how do we measure streakiness (i.e. inconsistency) for everybody? The answer: histograms!

Looking at Aaron Gordon’s running 10-shot sum chart we (or rather Excel) can count off how many times he spends at each position. Adding them all up and making a histogram produces the following chart:

Look at that sharp, defined peak at 3-for-10! Nearly 47% of the time Gordon is in the 3-for-10 state for 3-point shots. This is exactly why the coach can guess, sight unseen, that he is 3-for-10, since he spends nearly half the time there. Of course, he spends time in other states, but they are all strongly adjacent, 2-for-10 and 4-for-10, which crisp fall offs in either direction. The man is consistent!

Now here’s Vujacic’s histogram.

Yikes! That’s a bit of a mess. It doesn’t even look like a normal curve, just a strange collection of zig zags. All we can say for sure after Vujacic nails a shot is that he is unlikely to be 10-for-10 (only the running-10-shot-sum-God Austin Rivers gets that courtesy). Vujacic is all over the place.

I know that I still haven’t showed you how to numerically quantify who is streaky and who isn’t; I just gave you another set of graphs. But seeing the histograms (especially Vujacic) helps conceptualize the next part of the discussion. I said above that Vujacic’s chart doesn’t look like a normal curve and if you squint and turn your head a bit, Gordon’s does. This is important because thinking of the histogram in terms of a probability distribution (like the normal distribution) is key to figuring out who is streaky and who isn’t. Specifically, we want to quantify who is most inconsistent and name them the streakiest player in the league (that matches the criteria).

To get to that point, let’s briefly discuss the uniform distribution:

As you can see, the uniform distribution is super exciting. It is, in mathematical terms, a straight line that doesn’t change. But, if you use your imagination and think of it in terms of a basketball player shooting 3-point shots in the 2015-2016 season you can imagine a player of perfect, unfettered chaos who is equally likely to be 0-for-10 as he is to be 10-for-10. This player is an Elder God of streakiness, every time they sink a shot it is equally likely that they are at any spot on the running 10-shot sum spectrum.

But! This chthonic nightmare has a use because we can use him to compare against every one of our players and assess how close to madness they are. This is done by summing the absolute difference between a player’s value at each position with that of the uniform distribution. I call this number the residual. What the residual ultimately quantifies is how different the player’s histogram is from the uniform distribution. The smaller the number, the closer they are to the chaos. A score of 0 would indicate a complete match.

Here’s a visual of Gordon’s comparison.

Gordon has the maximum difference from the uniform distribution because he has a sky high peak at 3-for-10 and zeroes from 6-for-10 onward. His residual score is 1.2741.

Now here’s a visual of Vujacic’s comparison.

Not for nothing does Vujacic have the streakiest signal in the data set! His signal most closely matches the madness of the uniform distribution and his residual score is 0.6554.

So finally we have a number. All we need to do is calculate this absolute difference score, sort it from low to high and find the streakiest players in the data set. Conversely, we can sort the table from high to low and find the most consistent players in the data set.

Here’s a table that lists the top-20 for both:

Some interesting results here. First off, Kawhi is streaky? According to the numbers, yes. Korver, Carter and Iguodala are tossed around as streaky shooters, fine. What the chart doesn’t show is that Draymond is number 24, Jimmy Butler is 28, Steph Curry is 29 and Harden is 31 (yes, Steph is considered slightly streakier than Harden!). And the Westbeast is not considered streaky! He’s in position 156! What’s the world coming to?

Here are Curry’s and Harden’s seasonal charts:

Both Curry and Harden have, obviously, way more attempts per season than Gordon or Vujacic and consequently they have a lot more downturns and upswings. While Curry never hits the 0-for-10 bottom during the season, Harden does it several times. However, Curry also has a lot of severe, whiplash-like changes between 1-for-10 and 7-for-10 that is ultimately accounting for his streakiness measure.

You can see that both of their histograms are much closer to Gordon’s than Vujacic’s and while they look pretty identical the residual calculation reveals that Curry (0.8326) is just slightly more inconsistent than Harden’s (0.8345). What it means is that the calculation is picking up on subtle differences that the human eye can’t pick out (thanks computers!).

Wrapping Up

I did have a notion that perhaps my methodology was biased against players with more 3-point attempts, perhaps more attempts meant more opportunities to go awry and enter the realm of streakiness. To investigate that I did a bog-standard correlation assessment between each player’s residual and the number of attempts they had in the season, I found a value of -0.126, which means they are not correlated. I also decided to test the correlation between the residual and a player’s seasonal 3-point percentage. I found a correlation of -0.374 which is a really soft negative correlation but it does suggest that players who are more inconsistent (i.e. streaky) have higher 3-point percentages, which is kind of counter intuitive (but the correlation is real small, so let’s disregard this line of thinking).

I’m honestly not sure what to make of this analysis. The reasoning seems sound to me but some of the results seem a little daffy. I really like the idea of running out the entire season of 3-point attempts as a sequence of 10-shot histories to quantify the streakiness and visual inspection of the charts seems to suggest that the players who are being flagged as streaky have swings in their performance across the season. I’ve attached the spreadsheet of data I used to produce this analysis at the bottom of this post so if anyone wants to have a look, run the analysis and poke holes in my efforts, I’d be glad to hear about the results. Have a good one.

The Excel spreadsheet with data is available here.

Daring Players to Shoot

In basketball you often hear about a defender “daring” an offensive player to shoot. This dare can be identified by the distance the defender gives a shooter to take their shot. The reasoning is that if a defender does not consider the offensive player to be much of a shooting threat they will play off them, giving the shooter space to launch a shot, confident that it will be a miss. In the following analysis I use SportVu data from the 2014-15 NBA season (the last full year of SportVu data available) to look at every three point shot taken and the distance to the nearest defender for each shot. By plotting the three point percentage versus average defender distance for every player with at least 82 three attempts I reveal the truth behind “daring to shoot”.

To begin, I needed every three point shot from the 2014-15 NBA season. I downloaded the data from NBAsavant.com using their shot tracker web tool. There seems to be a download limit of 50,000 records so I had to download two separate batches to get all of the 2014-15 season, which had 55,116 three point attempts. By selecting a zone of “three pointer”, defining two game dates that break the season into shorter chunks and by grouping the data by both player and date, I was able to obtain CSV files for the entire season. The CSV file I created is available here.

With the data in hand I then opted to use Python to manipulate the file. I used Python for this part of the analysis to get some practice with Pandas and specifically Pandas pivot tables. The following gist details each of the steps I went through to condition and prepare the data for analysis.

The script opens the CSV file for three point shots and creates two Pandas pivot tables. The first pivots player name against the shots_made_flag which is either a 1 for a made shot or a 0 for a miss. By both summing and counting the shots_made_flag I can get both the number of made shots and the number of attempts, this is the code on line 9. The dataframe produced by the pivot has lousy column labels so the code on line 12 renames them. Also, for the subsequent 3 point % calculation I explicitly converted the columns to numeric types on line 15. Finally, I calculated the 3 point % by using a simple function and caliing it on the dataframe using the apply() method on line 21.

On line 25 I created a second Pandas pivot table, this one pivoting on name and calculating the average defender distance for each three point attempt. Notice that I didn’t specify the aggregating function for this pivot table like I did for the first one and that’s because the default pivot aggregation is mean.

On line 28 I use the Pandas concat() function to fuse the two dataframes together. Since both are based on player name this has the delightful effect of joining the two pivots by player.

I mentioned above that I only wanted to look at players with at least 82 three point attempts on the season. While this may not directly translate to 1 attempt per regular season game I needed some cut off to throw out the low end and this seemed like a reasonable number. Line 31 of the code filters out those with less than 82 attempts.

Line 34 dumps the parsed and conditioned dataframe back out to CSV as I’m going to use R for the visualization (I know, I know, bad form to mix, but I want to practice both Python and R).

For the second part of the analysis I wrote this short R snippet to create a scatter plot of the data:

This code creates a scatter plot of 3 point % versus the mean defender distance for every three point shooter with more than 82 attempts on the season.

The plot has points for everyone of the 228 players that meet the criteria as well as a linear model (the blue line) with a 95% interval (the shaded region) created from the data points. What the chart tells us is … a whole lot of nothing.

The mean average defender distance across all players in the data set is 6.04 feet and the linear model is practically a horizontal line through the 6 foot line. Players with exceptional 3 point percentages like Luke Babbit @ 51.8% on 114 shots are given an average distance of 7.67 feet but then so is Lance Stephenson @ 17.1% on 105 shots given 6.04 feet. Kobe shot 29.3% on 184 attempts but defenders were in his face at 3.98 feet and Durant @ 40.3% on 159 shots was also crowded at 4.86 feet. Looking at the linear model with its R-squared of 0.00002646 indicates that there is essentially no relationship between these two variables. In other words, a player’s 3 point %, a strong indicator of their three point prowess and reputation, does not influence how closely a defender guards them when they are in position to launch a three. What these specific numbers, and the chart overall, seem to suggest is that “daring to shoot” is not really a thing.

Comparing Event Counts with the Poisson Means Test

An exceedingly common task for crime analysts is to count the number of events that occur in one time period (e.g. day, week, month, YTD) and compare them to a prior time period. The change between these two numbers if often the source of much strategy and project-making in law enforcement. What does it mean that we had more motor vehicle collisions this week compared to last? What should we do about the increased number of mental health apprehensions YTD? What are we doing right to make robberies go down this month? These types of questions, and the actions that are taken from the discussions they generate, are the daily bread of law enforcement. Unfortunately, one of the things that is seldom done is to check if the change in the event counts is statistically significant. That is, is there really a difference between these numbers or can we just chalk it up to randomness?

This is where the Poisson Means Test comes in to play. It is a statistical test that can be used to determine if the difference between two counts is statistically significant or not. In prior posts I have written about statistical significance being about deciding between two hypotheses. The first (or null) hypothesis says that there is no difference between the two sets of events. The second hypothesis (the exciting one) says that, yes, these numbers are different enough that we would only expect to see this level of difference by chance every, say, 1 out of 20 times (when dealing with a 95% confidence interval). In this case, what we want to be able to do is pick a confidence interval (I’m sticking with 95%) and look at two different event counts (let’s say robberies YTD this year and last) and say that ‘yes, something is meaningfully different about these two numbers’, according to the statistical test. And we want to be able to do this in Excel.

But before we get into Excel a little background on Poisson. A Poisson process is a process that generates independent events over a period of time. Because the events are independent the time between consecutive events is considered random. It’s a perfect process for modelling things like criminal occurrences because notionally they are unrelated and we can’t predict with specificity when the phone is next going to ring. I’m not going to get into any math in this post as it’s not super important for understanding how to use the test so just know that Poisson is about counting the number of times something happens in a time period.

In order to compare two sets of events we need to define a couple of terms. The first set of events has a total count C1 that occurred over time intervals T1. The count C1 is the total number of events but there is some flexibility in defining the time intervals and that flexibility has to do with the other event count you want to compare. In our example I have the number of robberies in Jan-Feb-Mar of 2015 and 2014. I could say that T1=1 quarter, or I could say T1=3 months or I could say that T1=90 days, it really doesn’t matter because T2, the time for the events counted in C2, is exactly the same, since we’re doing a YTD comparison between two years. However, not all the use cases are like that, it is very possible to compare Jan-Feb-Mar 2015 to April 2015 in which case you’d want T1 to be 3 (months) and T2 to be 1 (month). Basically, you pick the largest common time period you have between the two data sets and used that as the basis for T.

Enough chit-chat, an example will make it things clearer.

poisson_1

I am sure this looks common to everyone: two year’s worth of crime data, the difference in the events and the percent increase. Can you just hear the operations inspector demanding  to know why robberies are up 21% and the detectives scrambling for answers? The problem is that we don’t know if that change is because of anything real or is just due to the randomness of the data. But we can use the Poisson Means Test to help us figure it out and in this example, using the nomenclature above, we have C1=44, T1=1 (because it’s one quarter), C2=56 and T2=1. What we need to do next is send those numbers through a function to tell us whether the difference is statistically significant or not. I sure hope Excel has a function for that … but it does not. However, thanks to some talented programmers I found a solution.

I first learned about the Poisson Means Test from the blog of Evan Miller, a programmer and statistician who has developed a great piece of statistical software for Mac named Wizard. Evan has a blog post on the subject of event counting where he explicitly tells readers to never make decisions on plus/minus differences without first checking the Poisson Means Test. To help facilitate this he has created a great online tool for checking two event counts to see if there is a meaningful difference between the counts.

The online tool works great but I need to be able to run these kinds of tests in bulk in Excel (where all crime data resides) and I wanted to see if I could make Excel do the test. Following Evan’s notes in his blog post and peeking under the hood at his tool’s javascript I discovered that most of the heavy lifting for the tool is performed by a javascript library known as jstat. Jstat provides numerous special functions including the, ahem, ‘Inverse of the Lower Regularized Incomplete Gamma Function’.  I took the time to figure out what that means so you don’t have to and what I learned is that Excel doesn’t offer a function to do it. But, Excel does offer VBA, and the jstat source code is available on GitHub, so I ported all the javascript functions I needed into VBA user defined functions and made them available through the following GIST.

Now, I don’t expect you to understand this code because I don’t understand a lot of it. I know what it is doing but there is a lot of code in here that I suspect comes from some flavor of a Numerical Recipes book. The point is, we don’t need to know how it works, the geniuses at jstat already figured it out, I just need to make sure that it works in VBA and I didn’t mess it up during the translation. I’m pretty confident in my coding after a lot of testing and you can test it if you want by trying Evan’s tool and my Excel macro together and seeing if you get the same results (you will). You can also, if you are adventurous, try out the R library zipfR, which has a function named Rgamma.inv, which does the same thing.

So how do we use the VBA to do something useful? Check the screen shot below where I have added a fifth column to my table named P-value. This is your usual statistical p-value that, for a 95% confidence interval, needs to have a value less than 0.05 in order to indicate statistical significance. As you can see in the image, I have called the custom function PoissonPValue from the VBA code and passed in the four values I identified above: C1, T1, C2, T2. The output of the function call is the value is 0.271, which means that, statistically speaking, there is no difference between 44 and 56, so that 21% increase is nothing to get worked up about (certainly not something to change strategy over).

poisson_2

I know that this is a hard truth to swallow because it runs counter to the thinking of most law enforcement personnel. They want to believe that if a number goes up then something is to blame and that something can be handled to make the number go back down. However, from personal experience I can attest, much of the week-to-week and month-to-month differences are not statistically significant. The reality is, randomness happens, and unless we want to be chasing our tails running after statistical ghosts, we would be well advised to concentrate our efforts on differences that the stats indicate are real.

To wrap up, the Poisson Means Test is an incredibly valuable tool that you should put in your toolbox. It’s not just for your weekly/monthly/annual crime report either. You can use it to test for meaningful differences between any two sets of events. So, for example, say someone ran a project over the last two months and wants to know if you can see an impact in the numbers. If you test the events from a period before against the 2-month project period you can see if there is a stat-sig difference between the counts. Now, that’s not to say that the project is responsible but at least you can demonstrate statistically there is a difference. Anecdotally, I have found new applications for the test every week and I like it because it adds a further layer of rigour to our analysis and I think that’s important. I hope you do too.

Revisiting Aoristic Analysis

Attention users of aoristic.policeanalyst.com. I have retired the web tool and replaced it with a downloadable Excel spreadsheet. Discussion and reasoning below.

Back in 2012 I wrote a post about Aoristic Analysis where I introduced a web-based tool for conducting aoristic analysis. The reason I built the tool as a web page was that the code allowed me to split events down to the second so that they could be properly apportioned across time units. The drawback of making a web based tool is that law enforcement analysts, typically a suspicious bunch, never liked the idea of submitting their data (which is only dates, by the way, it is completely context-less) to a web server. Fair enough, we’re a paranoid bunch.

I recently had an email discussion with someone who was using my tool but wished they could have some software to do the work locally. I resolved to create an Excel version of the tool that would work in seconds like the web site. I have completed that work and the result is now available for download here: Aoristic_v3.xlsm.

The spreadsheet is a macro-enabled XLSM and that’s because I use VBA to perform the aoristic analysis. There are three custom functions –  AoristicHourOfDay, AoristicDayOfWeek and AoristicMonthOfYear  – and all the code in the spreadsheet can be seen in the following GitHub GIST:

Understanding the Code

If you open the XSLM file you will notice that there are worksheets with example usage of each of the functions. Since all three functions follow a similar approach I will discuss the logic behind AoristicHourOfDay used on the ‘Hour of Day’ sheet.

aoristic_v1

The ‘Hour of Day’ sheet has in column B the start time for an event and in column C the end time for the event. Following that there are 24 columns, one for each hour of the day, where the AoristicHourOfDay function is called. Looking at the embedded code above you can see on line 23 the function declaration for AorisiticHourOfDay and that it accepts a start and end time as well as the particular hour of interest for calculation.

I will now go through the function line by line to explain the reasoning behind it:

Line 24, 25: The function begins by first determining how many hours and how many seconds have elapsed between the start and end time.

Line 27-30: A check to make sure that the end time is not before the start time. If it is, an error is raised and the function exits.

Line 32-35: Next, the amount of one ‘unit’ is determined by dividing 1 by the number of seconds. Because this code is interested in apportioning an event down to the second, it is necessary to be able to assign units down to the second. For example, say an event was 2 hours long (7200 seconds) then every second is worth 1/7200 of the total event.

Line 37: A variable captures the starting hour based on the event start time.

Line 39-41: The code checks to see if the total number of hours for the event is 1 and the hour passed into the function is the start hour. If both of these conditions are met then the hour should be assigned the entire event (i.e. a value of 1) and the function can safely exit. This is the trivial case.

Line 43-68: If these conditions are not met then the more elaborate assignment method is used. The aoristic variable for this hour is initialized to 0 and a variable for the current hour is initialized to the starting hour for the event. Next, a for loop is initiated that loops from 1 to the total number of hours calculated for the event.

There is a lot going in this for loop so I will break it down, section by section.

Line 45: Within the loop the code first checks to see if the current hour equals the hour passed into the function. If the hours match it means that we need to assign some portion of the event to the hour.

Line 47-51: Next, the code checks to see if the current hour is equal to the start hour and if it matches the code determines the remaining seconds in the hour and assigns that to the share variable. The reason this is done is because, for the start hour, we only want the balance of the hour assigned and to determine the balance of the hour we need to subtract the current number of minutes and seconds from 3600, the number of seconds in an hour.

Line 53-55: Next the code checks to see if the loop counter is greater than 1 and less than the total number of hours. If this is the case the hour is given 3600 seconds. The reasoning here is that if the hour matches and the event started at an earlier hour and ends at a later hour then the event existed across this particular hour and should therefore be assigned a full 3600 seconds.

Line 57-61: The final if statement checks to see if the loop counter is equal to the total number of hours. This is the same as asking if it matches the last hour and if it does we include only those seconds in the part of the hour before the event ended. This is kind of like the reverse of the reasoning for the first if statement.

Line 62: After each of the if statements are complete the variable for tracking the aoristic total for the hour is incremented by adding to it the value of the share multiplied by the unit calculated above.

Line 64-67: Outside the if statement the current hour variable is incremented and reset to 0 if it passes the 24 hour mark.

One thing people may be curious about is why I did this in loop instead of just checking with the if statements and that has to do with the fact that an event can span days and therefore the same hour could be hit again and again. The loop allows for this calculation in a way that the if statements on their own could not.

Wrapping Up

The sample spreadsheet has a lot of example scenarios that illustrate different edge cases for the code. The key check on each of them is that the sum total of the values for all 24 hours should sum to 1. Excel also makes it easy to calculate percentages by hand and compare them to what the code has spit out. Having said that, while I have tested it to my satisfaction there is always the possibility for error. If you come across any odd behaviour or incorrect results please let me know and I will fix it and put a new version up.

I think this Excel method is superior to the web based tool as it allows the same level of detail as the web page plus it allows analysts to look directly under the hood and see how the work is being done. Best of all, you can download the spreadsheet and run everything locally and not have to worry about submitting sensitive data over the web.

One final tip: if instead of the share of event you want to instead know total time spent in each hour, create a new column that contains the event duration in seconds and multiply that by each of the portions in each of the hours. This new table will then show how much time was spent in each hour for each event.

Medians and Excel Pivot Tables

One of the issues with pivot tables in Excel 2010 (and earlier versions) is that it isn’t possible to calculate median using the Value Field Settings. In this post I am going to first extol the virtues of the median (again) and then demonstrate a formula for calculating the median in a cross tabulation.

Medians Again?

Why do we care about the median and why can’t we just use the mean? The mean is the proper measure of central tendency when what we want to average is normally distributed. In those situations, adding everything up and dividing by the number of elements will give the value right at the peak of the distribution. However, when our distribution of values is heavily skewed the mean is less representative than the median. In those cases it makes more sense to take the middle value (i.e. the median) of all the values when they are ordered from smallest to largest.

The classic example of using the median is household income. There are a lot more small income households than large income households and those big households can really distort the mean. For example, let’s say that we have 7 households with incomes (in $10,000s) of 1, 1.4, 1.8, 2, 2.1, 2.4 and 13.2. The mean of those numbers is 3.5 while the median is 2. Clearly 2 is a better representation of the centre of the data than the mean, which has been dragged higher by the one large income.

A similar issue occurs with any data set that is zero based.  For example, say you want to do an analysis to determine how long offenders wait between their arrest and when they are fingerprinted. Many services schedule identification appointments days or even weeks after the arrest and it might be interesting to see what the average time is between arrest and appointment. The shortest possible wait is 0 days, the fingerprinting is done on the day of the arrest, and in general one would expect a shorter time rather than a longer one. While there may be some 28-day waits we don’t want these outliers to unduly skew our average so we choose to take the median rather than the mean.

An Example

Let’s take this example further by considering a spreadsheet were we have 1000 rows of arrest times and fingerprint times and the duration in days between them.

median_1

The easiest way to determine the median would be to use the formula =MEDIAN(D2:D1000) but that would only give us the median for the entire data set. When analysing our data with pivot tables we are often interested in cross tabulating based on a particular element of the data and in this example we want to see if average wait time is impacted by the day of the week of the arrest.

To begin the analysis we first insert a new pivot table and use it to determine average (by which Excel means mean) wait time for each day of the week.

median_2

Have a look at these mean values, particularly Saturday. What is up with that? I’ll tell you: there are errors in the data set. Someone has keyed the dates in incorrectly and there are outliers that are horribly skewing the results. This is the other benefit of working with the median: medians resist the impact of outrageous outliers in a way that the mean doesn’t. We could go back in and do some data cleansing but let’s leave it for now so we can see how the median will handle it.

If you look at the list of options in the Value Field Settings you’ll see that Median is not an option. Since pivot tables don’t provide the option for determining the median we need to do some copying and pasting and use a formula to make it happen. Whenever I am investigating the median I first copy-and-paste-values the results of the mean pivot table to a new work sheet so that I can use the labels in my function calls. The new worksheet looks like this.

median_3

To calculate the median for each of the days of the week it is necessary to pull only those rows of data in the Data worksheet for that match the particular day. Once I have those rows I use the MEDIAN function to calculate the specific median for that day of week. To do this I use the formula in cell D4 below:

median_4

It is important to notice that the formula in the screen shot is surrounded by curly braces. You don’t type the braces in because Excel adds them when you create what’s known as an Array Formula. Array formulas are different than regular Excel formulas because they work with a whole series of data values rather than a single data value. To tell Excel that you intend a formula to be an array formula you must press CTRL SHIFT ENTER (all three keys at once, like CTRL ALT DEL) at the end, rather than just hitting ENTER. You’ll know that you did it right because Excel will add the curly braces.

In this particular array formula what we’re doing is using an IF formula to check every one of the Day of Week values in column A of the Data worksheet to see if it equals Sunday, or the value in B4. If the value does equal Sunday we return the wait time from column D for that row. The IF statement will check every row and return a range of values (all of the ones for Sunday) to the MEDIAN function which in turn calculates the median wait time for Sunday. The formula can then be filled down for each of the days just like you would with a regular formula.

median_5

You can see in the screen grab that we have median values for each of the days of the week. Clearly the outlier that impacted Saturday has been handled by using the median but also note the other days (except Monday) now have significantly smaller values. This is further evidence that the median is the correct choice for this data as using the mean, even when the data is clean, would lead to erroneous overestimates.

Conclusion

The bad news is that Excel can’t calculate medians in a pivot table. The good news is that with a simple formula and the labels produced by the pivot table it is possible to calculate the median for different cross tabulations. It does require a little experimentation with array functions but this particular example is pretty tame and serves as a decent introduction to this very powerful technique.

 

Police Analytics in HQ Magazine

I wrote a Police Analytics article for the most recent issue of the OACP’s H.Q. Magazine. I tried to communicate the value of data to police services and how my Service has approached the challenge.

This article was originally published in the WINTER 2014-15 issue of H.Q. Magazine, the official publication of the Ontario Association of Chiefs of Police.

At the beginning of 2014, Halton Regional Police Service (HRPS) created a new unit with a mandate to exploit police data assets to support evidence-based decision making. Combining scientists, programmers and police experts the new unit serves as a clearinghouse for quantitative analysis and report development for customers throughout the Service. From frontline platoons and investigative bureaus to administrative units and the executive branch this unit, named Police Analytics, has changed the way analytic insights are generated and delivered to in-house customers.

Historically, police services have spent considerable resources gathering data; first in the form of paper documents and subsequently in the form of records management systems. The primary function of this data has been to document activity and support investigations but as these data stores have grown in size and complexity the possibility of conducting large-scale analysis to produce efficiencies and insights has been recognized. The challenge is that the data is not in a format conducive to performing analysis but if this challenge could be overcome then tremendous value would be realized.

Business Intelligence and Police Analytics

Recognizing this opportunity, HRPS chose to invest in a Business Intelligence (BI) software system to better exploit these data assets. Business Intelligence is a combination of hardware and software designed to transform existing data resources into a form that is better suited to reporting and analysis. The work is highly technical but the end result is a single database that contains the entirety of a Service’s data. The HRPS IT department, working closely with our BI vendor, spent over 18 months creating a BI database tuned to the Service’s needs and the final result was unparalleled access to our information. But data alone is not enough, you also need skilled analysts who can intelligently and imaginatively utilize that data to produce results, and those skilled analysts work in our new Police Analytics unit.

Police Analytics was envisioned as a different kind of law enforcement analysis; it isn’t crime analysis or intelligence analysis but is instead a data-centric role that provides quantitative analytical products to all levels of the Service. We realized early on that we wanted individuals with a background in math, engineering or the sciences so that they would be capable of performing complex statistical work. Further, the new analysts needed to be highly technical so that they would be comfortable working with databases and writing software to perform their analysis. This combination of skill sets echoes many of the talents of programmers and developers in the world of tech companies and so that was the model we adopted. To get the best people we decided to hire expertise from outside law enforcement and in so doing we created a tech start up inside the Police.

Challenges for a New Unit

Like any start up, there have been growing pains. Our initial conversations focused on where in the organization to position this new unit and while, from a technical perspective, the unit could logically fit in IT, from an analytical perspective it was decided that the unit should fall under the Chief’s Staff. The access provided by working directly for the executive allows the analysts to have a more direct line to senior command—better to communicate analytical findings and field requests—and working alongside other executive units such as planning, audits and policy meant that the analysts could develop a holistic understanding of how data flows throughout the organization. The placement of a highly technical unit outside of the traditional IT infrastructure was a novel undertaking and providing the needed access for police analysts to do their work meant that policies and practices had to be adapted. Consensus was reached through negotiation and collaboration between departments and we were able to ensure data integrity and successfully address security concerns.

The next challenge was one of branding. We had constructed a high-functioning unit that produced useful analysis but we needed the Service to know about it. To address that issue we started an internal campaign advertising ourselves throughout the Service as the source for statistics and analytical products. We positioned ourselves as a resource for Service members of any rank to get data and advice to support their projects and we emphasized the value of having a solid quantitative foundation to produce successful project outcomes.

Evidence-based Decision Making

Our outreach efforts focused on promoting a culture of data-driven, evidence-based decision making and we encouraged Service members to think about data collection and how subtly adjusting business practices could lead to better data which in turn would lead to better analysis. As an example, our computer systems allow officers to push a button every time they change activity but some officers had gotten in the habit of not pushing the button and this lead to data gaps. To address this issue we communicated to officers how consistently updating their status led to much more accurate activity reporting that better captured all of the work they performed throughout the day. When officers discovered the benefits of pushing the buttons, namely credit where credit is due, they modified their behaviour and adopted a data-driven mentality.

We’ve worked hard to change behaviours and clean up our data sets and we’ve started to see rewards for those efforts. One of our successes has been a project that studied call volume and peak period staffing. Calls for service fluctuate throughout the day but our officers work 12-hour shifts and that leads to peaks and valleys in officer busy-ness. By accurately capturing officer status changes we obtain an extremely detailed view of what patrol officers are doing throughout their shifts and with this level of detail it is possible to broadly categorize officer time as either ‘busy’ or ‘available’. For our analysis, we aggregated hundreds of thousands of time-stamped officer activities to construct daily and hourly profiles of when our officers have the heaviest work load and using that data senior commanders are able to construct additional shifts to move officers to busy times. The end result is a reduction in officer busy-ness during peak periods and a levelling of the overall work load across day and night shifts and because the system is data driven we are able to measure the impact of the shift changes and quantitatively demonstrate our success.

Increased Data Awareness

Beyond analytical work the police analytics unit also studies organizational business practices to identify “pain points”. If a business process is cumbersome or time consuming we brainstorm as a unit how to rectify the problem. This practice has led to the development of a number of reporting tools that repackage existing data assets into more useable formats. A few examples include summary reports of field contacts for front line officers, absence management reports for human resources, and occurrence mapping tools for crime analysis. The point of these reports is not that they contain analysis but that they take existing data that is stored in isolation and synthesize it into an easily read report and where before an individual tasked with reviewing these entities may have spent hours clicking through the RMS they can now see everything in one report.

Perhaps our biggest success in this vein is the introduction of our officer activity report that allows Service members to review their activity. Our RMS captures officer activities such as arrests, charges, and tickets and our new reporting tools allow supervisors to review monthly summaries of this information for their units in a web-based, on-demand format. This reporting tool offers many improvements over the old, self-reported monthlies including accuracy, standardization and time savings. This tool has eliminated the need for members to waste time collecting and collating data that has already been captured in our databases and has resulted in a greater awareness, both for members and for command, of officer activity.

Lessons Learned and the Future

With our success establishing the police analytics unit HRPS has learned a number of lessons that may be instructive to other Services looking to create a data-focused unit:

  • You need buy-in from the highest levels of the organization. When attempting to create a data-driven culture of analysis that message needs to be embraced and communicated through management.
  • Success is built around people and not technology. Simply buying a piece of software does not solve your problems; you need a small core of dedicated experts who believe in the mission to find success.
  • Success takes time and progress will be slow. It is not possible to successfully influence the culture of a large organization quickly and that’s doubly true when the changes are related to a complex matter such as data.
  • Change is an iterative process. Only when you start looking will you see how your processes need improvement and once those improvements are in place you’ll need to gather more data and you’ll likely see more necessary changes.
  • The unit needs access to senior management. The police analytics unit needs the latitude to communicate proposed changes to senior officers and receive approvals quickly.

HRSP is proud of what our police analytics unit has accomplished in a short time. We have made meaningful contributions to improving the efficiency and effectiveness of the Service and have successfully promoted the value of data-driven decision making in the organization. We also spent 2014 promoting our vision for police analytics to other Services in Canada as well as at conferences in North America and Europe where our ideas have been enthusiastically received. In 2015 we plan to ramp up our software development and update our web-based tools so that officers in the field can access information on their tablets and phones. The volume of data captured by the police is going to keep growing and specialized systems and analysts are needed to extract value from those assets. We believe Police Analytics is the future of law enforcement analysis and we encourage everyone to consider how it might benefit their organization.

 

Prioritizing Using Exponential Decay

I had an officer ask the other day if I could identify the top 10 subjects in our records management system that were involved in mental health occurrences. On the surface this is a straightforward request: identify the people with the most activity but it gets trickier when you factor in the passage of time. For example, is someone who had 3 incidents over 6 months ago more of a concern than someone who had 2 incidents in the last 8 weeks? Clearly recency should factor into how a person is prioritized but history should also weigh in. To address questions like this we can use an exponential decay equation in Excel to preferentially weight recent incidents while still giving some weight to the past. In this post I will outline how to apply the method.

The first step is to obtain some data. Depending on the nature of the incident I will set an arbitary historical cut-off; in this example we’ll set it a year. This then means that we need to extract every mental health occurrence over the last year along with the date of the occurrence and a unique person identifier.

exponential_1

For each occurrence we then calculate the age of the incident in days by subtracting the date of the occurrence from today. This is done using Excel’s NOW() function to get the current date and the INT() function to trim off the time since we are working strictly with days. the following screen shot shows the formula in the function bar.

exponential_2

Next we want to create a column for the exponential decay formula. Most people are familiar with the idea of exponential decay, it is a mathematical formula where a quantity decreases at a rate proportional to its current value. In practice that means that the value drops off quickly and first and then more slowly as the value shrinks. The chart below shows a graph of exponential decay for a number of different constants. As can be seen in the chart we can tune how fast the values drop off by adjusting the constant.

640px-plot-exponential-decay-svg

In Excel, the formula for exponential decay is =EXP(-constant*time) where we get to specify the constant and time, in our case, is the age of the incident. Since we’re dealing with days and we only have a year of data we can decide how much we want past incidents to influence a person’s priority. For example, if we say that a mental health incident that happened today is worth 1 point, how much is an incident that happened 6 months ago worth? If we say that it’s work 0.5 points then we can calculate our constant as: -1*LN(0.5)/180. Where LN() is Excel’s natural logarithm function, 0.5 is point value we want to calculate and 180 is the (approximate) number of days in 6 months. This calculation results in a value of 0.003851 which we can use as our constant.

exponential_3

Back in our data worksheet we can finally create our new column that calculates the decay score for each incident based on its age. We use the exponential decay formula =EXP(-1*0.003851*age) to calculate the decay score for each incident as in the following screen shot. Note that the decay value will always be less than 0 but never reach 0.

exponential_4

Now that we have a decay score for each occurrence we need to create a pivot table of all the data. Highlight everything and select ‘PivotTable’ from the ‘Insert’ tab. This will open a modal window and you can just hit the ‘OK’ button to create your pivot worksheet.

exponential_5

On the pivot table worksheet drag the Identifier field to the Row Labels box and the Decay Score field to the Values box. Click on ‘Count of Decay Score’ in the Values box and select the ‘Value Field Settings’ option at the bottom of the pop-up window. Change the selected option from Count to Sum and click ‘OK’.

exponential_6

Almost done. Back in your pivot table right-click the Sum of Decay Score column and sort ‘Largest to Smallest’. This will sort the persons by their aggregated decay scores and the people at the top are the highest priority individuals.

exponential_7

This method works because the exponential function embodies the principles we were originally trying to capture: it gives more weight (a higher score) to recent incidents but it also gives some score to historic incidents as well. For every occurrence a person is involved with they have the potential to add up to 1 point to their total score (if the incident happened just today) but as the incidents age they influence the total score less. In our example above the person with the 3 incidents over six months ago has, at most, a score of 1.5 (since 0.5 is the 6 month mark) while our 2 incidents in the last 8 weeks person is likely to have a score a little less than 2. So this system weights these people the same by balancing the influence of recent and history. The best part is that we can tune how much weight we give to the two factors by adjusting the exponential decay constant. If we decide that an incident 6 months is only worth 0.1 point rather than 0.5, which greatly privileges recency over history, we can adjust the constant and rescore everyone. On the other hand, if we wanted to make a 6 month incident worth 0.9 points, we could recalculate the scores and privilege history over recency.

I like this method of prioritizing because it’s simple to implement in Excel (no plug-ins!), it’s logically justifiable and it allows the impact of recency and history to vary according to the expertise of the analyst. I hope you find it useful and good luck prioritizing!

Analysis of Ontario Primary School Class Sizes Part Two: Examining Split Classes

800px-FVES_Classroom

To practice my skills as a data analyst I decided to look into some of Ontario’s Open Data data sets. The primary school class size data set interested me because I have children in the primary grades and the topic is often in the news. 

Summary of Analytical Findings

  • Public school split classes have increased from 47% of all classes in 2007-08 to 54% of all classes in 2013-14 while Catholic school split classes have increased from 27% in 2007-08 to 48% to 2013-14.
  • Counting only classes with a JK or K student, Public school split classes have increased from 73% in 2007-08 to 89% in 2013-14 while Catholic school split classes have increased from 26% in 2007-08 to 74% in 2013-14.
  • In 2007-08 30% of Grade 1, 2 and 3 classes had 18 or fewer students while in 2013-14 that number is down to 23%.
  • Grade 1, 2 and 3 mean class size has grown slightly from 19.1 students per class in 2007-08 to 19.3 students per class in 2013-14. The standard deviation for class size has shrunk from 1.99 students in 2007-08 to 1.75 students in 2013-14 which indicates that both large classes  (>20 students) and small class (<20 students) are becoming less common.
  • On average, since the 2007-08 school year, 9% of public school board split classes with 19 or more students have been split with 4 or fewer students from a different grade. The Catholic school board has averaged 11% over the same period.

Introduction to the Analysis

As discussed in Part One of this analysis, I have a child in the primary grades in Ontario and I am interested in how the Ontario Government is working to enact its 2003 class size reduction initiative (PDF) which sought to have 20 or fewer students in 90% of all primary classes and 23 or fewer students in every primary class by the 2008-09 school year. While part one of my analysis focused on examining claims made by the government regarding class size, part two focuses on split classes, or classes that contain students from more than one grade.

A First Look at Split Classes

The following chart illustrates the percentage of split classes in the public and Catholic school boards between the 2007-08 and 2013-14 school years. The percentage is calculated by dividing the number of split classes each year by the total number of classes for the year, for each school board. The analysis includes every class in the Open Data data set, not just the Grade 1, 2 and 3 classes used in Part One of the analysis.

overall_splits

The chart shows a gradual increase in splits for the public board and a marked increase for the Catholic board. Considering the rise of all day kindergarten I decided to look at only JK and K split classes. This resulted in the following chart:

jk and k split classes

Which reveals that since 2007-08 the Catholic school board has embraced split JK and K classes and this shift is what accounts for the majority of their 21 percentage point gain in overall split classes since 2007-08.

When looking at splits where JK and K classes are excluded we see a gentler increase that has actually decreased in 2013-14 from earlier highs.

non jk and k split classes

The Impact of Splits

Based on the results from Part One of the analysis the Ministry has been almost entirely successful at eliminating large classes and it seems that creating more split classes is one way they achieved this result. To their credit they acknowledge this point on their FAQ page.

combined_v0

To my reasoning, if split classes are able to help reduce the size of large classes than that means there must be excess capacity available to absorb those students (i.e. space to make a split). This thought led me to examine the trend in class sizes distribution between 2007-08 and 2013-14 and I discovered that small classes used to more prevalent than they are today. For example, in 2007-08 30% of Grade 1, 2 and 3 classes had 18 or fewer students while in 2013-14 that number is down to 23%. In other words, small classes are becoming less common.

Another example of this tendency can be seen in the following table which lists the mean class size and standard deviation (both in units of students) for every Grade 1, 2 and 3 class in each year.

stdev

The standard deviations indicates that over time the spread in the class sizes has shrunk as more classes achieve the desired 19 and 20 student class size. What this reveals is that while large classes are getting smaller, small classes are getting larger as they both move towards the desired standard.

The Personal Angle

The whole topic of class sizes and splits piques my curiosity because my son’s grade one class was a split that consisted of 15 grade 1s and 4 grade 2s. A look at the Open Data date file for my son’s school reveals that there was another grade 2 class with 19 grade 2 students and a third class, also a split, with 9 grade 2s and 10 grade 3s. It is noteworthy that all three classes have 19 students and therefore meet the “20 or fewer” guarantee of the Ministry of Education but in order to make it happen four grade 2s had to be housed away from their cohort in a grade 1 class. If those four students were distributed to the other two grade 2 classes their class numbers would have been raised to 21 and the classes would have failed the 20 or fewer test. My question is, was the cost of creating a grade 1/2 split and separating the four grade 2s from their grademates really less than having two Grade 2 classes with one extra student?

I performed one last piece of analysis to try and discover how common asymmetric split classes like my son’s Grade 1/2 split are. By looking only at split classes in Grades 1, 2, 3 and 4+ and then only at splits that have a class size of 19 students or larger and a split size of 4 or fewer I discovered that, on average since the 2007-08 school year, 9% of all public school board splits, or approximately 475 classes a year, have had this level of split asymmetry. The Catholic school board average is higher with 11% of all their splits being highly asymmetrical but due to their lower number of classes it amounts to only about 212 classes a year.

The Ministry published a PDF titled “An Introduction to Combined Classes” that speaks about the benefits of split classes and the reality that even single grades have a spectrum of students with different abilities. I don’t disagree with the points made but my larger concern is about having my child taken out of his cohort, as the 4 grade 2s mentioned above were, for an entire school year. My child will be with his grademates for years, they will be his childhood friends as he grows into his teen years, and I don’t want him to lose a year of social integration so that a school board can meet the Ministry’s class size numbers. This isn’t just my concern either, today was the first day of school and not a single Grade 2 parent waiting to find out what class their child was entering wanted them to be in the Grade 1/2 split. The very idea carried a stigma with both parents and children.