Joseph Ryan Glover

Random stuff.

The Art of Filtering Duplicates in Excel

Every day analysts export data from myriad sources in an effort to consolidate raw data into useable intelligence. In order to make good decisions it is important that the data be as accurate as possible and one aspect of accuracy is ensuring that our information contains no duplicate data. When working with only a couple dozen rows of data sussing out duplicates is easy but when you have thousands of rows of data visual inspection just doesn’t cut it and it becomes necessary to employ some automated techniques. In this blog post I’m going to detail some of the Excel tools and strategies available for handling duplicate data.

The Basic Example
The simplest example is the straight removal of duplicate rows. Below is a stripped down data set with 3 columns and 5 rows of data.

duplicates_0

Clearly the fourth and fifth rows are duplicates. To filter this data we’ll use Excel’s built in ‘Remove Duplicates’ tool that can be found on the ‘Data’ ribbon in the ‘Data Tools’ section (see screen grab below).

duplicates_1

To use the tool first highlight the data in spreadsheet and then click the ‘Remove Duplicates’ button and this will call up the ‘Remove Duplicates’ window. There are some options here but in the simple case, the default values are sufficient. Click the ‘OK’ button and Excel will delete the duplicate rows leaving only unique rows.

duplicates_2

Filtering on Columns
In the screen shot of the ‘Remove Duplicates’ window you can see that each of the data columns has a checkbox beside it. These checkboxes allow you to search for duplicates using only the checked columns as the matching criteria. This may seem a bit esoteric so an example will help demonstrate how useful filtering on certain columns is.

In the screen shot below is a date set that contains some occurrence numbers and addresses connected with the occurrences.

duplicates_3

Imagine that from this data set we want only the unique occurrence numbers and we don’t care about the street addresses. To get just the numbers again highlight all the columns and click the ‘Remove Duplicates’ button. But this time, when the ‘Remove Duplicates’ window comes up, unclick the ‘Address’ and ‘Type’ checkboxes and click ‘OK’.

duplicates_4

The result is that Excel will filter out three rows and leave two, one for each of the occurrence numbers in the data set.

duplicates_5

Filtering Sequenced Data
One question you may have about filtering by specific columns is how does Excel decide what rows to keep and what rows to delete? It’s actually pretty straightforward: Excel keeps the first row it finds and dumps the rest and we can exploit this behaviour to achieve a helpful result.

Have a look at the data set below that lists three occurrences and a series of disposition shifts for each occurrence.

duplicates_6

You’ll notice that over time the disposition of an occurrence shifts as new information comes in. For example, many occurrences start as a ‘Default’ but shift to Report To Follow (RTF) or Unfounded (UNF) as the incident evolves. This is important because when calculating crime stats it’s crucial to know the final disposition for an occurrence so that it can properly reported (especially unfoundeds).

To get the last recorded disposition we can first sort the data by ‘Occurrence’ ascending and then by ‘DateTime’ descending. If we then highlight the data and use ‘Remove Duplicates’ (with only the ‘Occurrence’ column checked) we can exploit Excel’s programming to remove all rows except the first one for each occurrence number. This, very conveniently, leaves us with only the final dispositions for each occurrence number thanks to our clever sorting of the data.

duplicates_7

Rolling Your Own
All of the above examples use Excel’s built-in ‘Remove Duplicates’ tool but it is possible to manually identify duplicates. Why would you want to do this? Primarily because Excel deletes duplicate rows and doesn’t given you the opportunity to review them. Let’s look at an example using the data from the Filtering Sequenced Data example.

The first thing to do is sort the data based on the duplicate column. While it is possible to use multiple columns (which require multiple sorts) this example will just filter on the ‘Occurrence’ column. Look in the screen shot below and you’ll notice that I added a new column and named it ‘Dupe’. Also notice that in the first cell under ‘Dupe’ (cell B2 in this example) I put a 0 while in the subsequent cells I used Excel’s IF function to check the current row’s occurrence number against the previous one. As can be seen in the function bar of the screen shot the formula in cell B3 is =IF(A3=A2, 1,0).

duplicates_8

What this function is doing is examining the occurrence number in cell A3 and determining whether it is the same as the occurrence number in cell A2. Recall that the IF function evaluates statements (in this case whether A3=A2) and if the statement is TRUE it returns the second input (1 in this case) and if it is FALSE it returns the third input (0 in this case). With the statement in place in B3 I then filled the function down to B6 and Excel intelligently populated each cell of ‘Dupe’ as in the screen shot above.

You’ll notice that the IF functions have accurately identified which rows are duplicates by putting a 1 in the ‘Dupe’ column. Just like with Excel’s ‘Remove Duplicate’ tool only subsequent records are flagged as duplicate. This means that if you sort the data by both ‘Occurrence’ and ‘DateTime’ you can obtain the same results as in the Filtering Sequenced Data section.

One thing to note though, if you re-sort your data the ‘Dupe’ column will break. This is because the IF function always checks the current row against the previous one and if the occurrence numbers are not in the proper order the results won’t make any sense. To lock the results in you should copy and paste as values the ‘Dupe’ column. This will replace the function calls with their respective 1s or 0s and now you can sort the data just like you otherwise would. I occasionally use this method to first review the duplicates and then sort ‘Dupe’ so that I can delete all the rows with 1. The end result is a list of unique rows.

Wrap Up
Filtering duplicates is a fact of analytical life and it’s important to be proficient at culling excess data so that stats aren’t double counted. I hope this post was helpful in identifying the various ways to get a handle on dupes.

Excel’s VLOOKUP Function for Crime and Intelligence Analysts

VLOOKUP is one of those functions that causes a lot of grief for analysts. I think it’s because the function is kind of complex to implement and most people don’t use it often enough to commit to memory exactly how it works. In this blog post I am going to explain exactly how to use VLOOKUP with the aid of two analyst-orientated examples.

An Introductory Example

At its core VLOOKUP (the V stands for vertical by the way) is used for looking up one value based on another … vertically. To demonstrate what I mean have a look at the screen shot below in which a list of Universal Crime Report (UCR) codes (column A) is listed side-by-side with the description of the crime (column B).

vlookup_1

Together column A and column B constitute a “look up table” because you can use a known value in column A to find the corresponding value in column B (and vice versa). But why do we need look up tables? We need them because it’s likely that when you export a list of crime stats from your database that you’ll see the codes rather than the descriptions. And while the UCR code is shorter and easier to manage there will eventually come a time when you want to replace the code “1110” with the friendlier “Murder 1st Degree” (maybe when you are preparing a final report to send around) and that’s when you use VLOOKUP to look up the description based on the code

The key thing to remember about VLOOKUP is that when you’re using it you always have two different sets of data: the first is the data you are working on and the second is the lookup table. The screen shot below shows a sheet with both sets of data; the working data is on the left and the lookup table is on the right in grey. Notice that the data being worked on has an occurrence number and a UCR code but no Crime.

vlookup_2But how to use it? Have a look at the above screenshot again, specifically cell C2, and note that the function in that cell is =VLOOKUP(B2, $D$1:$E$16,2,FALSE). Let’s break that function down into pieces.

  • For the first input to VLOOOKUP, we’re instructing Excel to use the value in cell B2 (the known UCR code) to find the description to put into cell C2.
  • For the second input we’re telling VLOOKUP to look for the value of B2 in the cell range $F$1:$G:$16 which is the entire range occupied by the lookup table. (As an aside, you’ll notice that the range values all have dollar signs in front of each letter and number. This is because when we fill this function down we don’t Excel to act ‘intelligently’ and start shifting the references. Using the dollar sign locks the reference so that no matter how we fill it, it will always point to the same set of cells.)
  • For the third input we use the value 2. Notice that our look up table has only two columns (D and E) in this example, one that is the code and one that is the crime. Since we already know the code (from column B) what we want is the description and since the description is in the second of the two columns in the lookup table we specify that we want the result to come from column 2.
  • The final input to the function is ‘FALSE’. By entering FALSE we are telling Excel that we want an exact match. That means that if Excel can’t find UCR code 1110 it’s going to return #N/A. There are situations where you might not want exact matches but they are less common and sticking with FALSE will usually get you what you want 90% of the time.

With the function input it’s just a matter of filling the function down to populate all of the cells in column C with the crime description based on the UCR code. If you click on any of the other cells in column C, as I did in the example below, you’ll see that the lookup table, locked as it is by the dollar sign $ notation, is always referencing the same range of cells. I bring this up again because the VLOOKUP mistake I see most often is that people don’t lock the range and get weird results when they fill the formula down. Make sure to use those dollar signs!

vlookup_3

A More Advanced Example

While the example above is the most common usage of VLOOKUP I frequently use it for another purpose: to compare two lists of things (e.g. occurrence numbers, suspect names) to determine if the lists are the same and highlight those elements that are different. The following screen shot demonstrates how I lay out my EXCEL file to perform this work.

vlookup_4

The two columns of occurrence numbers in column A and C are almost the same and in fact they only differ by 3 values. To identify those three values we can use VLOOKUP to look up one set of occurrence numbers against the other and see what happens. The above screen shot shows the function in cell B5 to be =VLOOKUP(A5, $C2:$C$20, 1, FALSE). Just like before we are checking each value against a table of values but this time it just so happens that the lookup table only has one column (instead of the two above) and that’s why we have a “1” as the third input.

If you fill the function down the occurrence numbers get filled in and where VLOOKUP can’t find a match for the cell in column A in column C Excel outputs #N/A. That last sentence may seem overly precise but it is meant to illustrate an important point: the VLOOKUP detailed above is only checking the values in A against the look up table and therefore will only catch those values in A that aren’t in C.  To get the full picture you also need to perform the reverse lookup, that is, catch the values that are in C but not in A. To do that you can put the following function call =VLOOKUP(C2, $A2:$A$20, 1, FALSE) in cell D2 and fill down. Now, just like before, we’ll see matching occurrence numbers if the values are in both columns and #N/As for those values that are in C but not in A. Taken together these two VLOOKUPs will help identify how the two lists differ.

Getting Even More Advanced

One more thing, I really don’t like Excel’s blaring #N/A error value. It’s ugly. Also, I don’t know about you, but I don’t really care about when the occurrence numbers match, I’m more interested when they don’t match and I’d prefer not to see anything for the positive matches. Let’s see what we can do about fixing this up.

To accomplish this, we’ll need to use the VLOOKUP function in conjunction with another Excel function, ISERROR, that we haven’t talked about. I’m not going to pretend this isn’t getting complicated, because it is, but I’m doing it anyway. In cell B2 I replace the old VLOOKUP function call with this new one:

=IF(ISERROR(VLOOKUP(A2, $C$2:$C$20,1,FALSE)), “Diff”, “”)

I know that looks intimidating but let’s break it down. The part about VLOOKUP we already know, it’s the same as above, but what we’re doing now is feeding the output of VLOOKUP to the function ISERROR. ISERROR’s only purpose is to return a TRUE or FALSE based on whether an error occurs within the VLOOKUP and, as we know, 3 errors occur, one for the each of those #N/As. For each of those three #N/As the ISERROR function will return TRUE and for all the rest it will return FALSE. This behaviour is very handy because the IF function checks for TRUE or FALSE when trying to figure out what to do. In this case, if it gets a TRUE from ISERROR (that is, it’s true that there was an error) then it prints the word “DIFF” but if it gets a FALSE, it doesn’t print anything because there was no error.

I know that’s a brainful but have a look at the screen shot below and you can see that each of the cells that was previously an #N/A is now a “DIFF”, because those are the ones that caused a VLOOKUP error that in turn caused ISERROR to be TRUE which caused the IF function to print the value for TRUE. All the rest, the ones without VLOOKUP errors, display nothing, just as I wanted.

vlookup_5

Phew, that’s a lot of Excel-fu for a post that started up just about VLOOKUP. I hope the post (or at least the first two parts) give you the confidence to use VLOOKUP more often in your daily work. And I hope that the third part, as intimidating as it looks (but it’s really not, just follow the logic of the TRUEs and FALSEs)  will give you ideas about how you can use the function to accomplish more complex tasks. Good luck!

Understanding the Difference Between Mean and Median

As analysts we are often interested in averages. But an average is a tricky thing because the word average actually covers several related concepts. In this blog post I am going to discuss two different concepts of average – the Mean and the Median – and why it’s crucial to understand their differences.

The Mean is what we typically mean when we use the term average. The Mean is calculated by adding everything up and dividing by the number of items. The resulting value is supposed to represent the centre of all the values. People intuitively understand the Mean because a lifetime of being exposed to the concept of the bell curve (or the normal distribution as it is called in math circles) has taught them that the big peak in the middle of the curve is the average.

The Median is less well known but just as easy to conceptualize. Consider all of the items in your collection and sort them from smallest to largest. The Median is then the value of the item that sits in the middle of the sorted list. In other words, it’s the value of the item that splits the collection in half with an equal number of items above and below it.

When dealing with data that follows the bell curve, that is data that produces a symmetric and unimodal (single peak) distribution, the Mean and the Median have the same value and we don’t have any problems. The issue is that a lot of data we’re likely to encounter in our day-to-day work— such as call response times, time spent on patrol, man hours spent on an occurrence—are not accurately represented by a symmetric bell curve but instead follow a skewed distribution that is weighted more to one side than the other.

But what does this mean for averages? It means that for data that is positively skewed (the bulk of the data is on the left side of the chart) that the Mean is going to be larger than the Median and vice versa for negatively skewed data. The problem arises when the difference between the Mean and Median gets large enough to meaningfully change the stat that you are reporting.

For example, the number of man-hours spent on an occurrence is the kind of data that is going to be positively skewed. This is because of the nature of the data: 0 is the lowest value that can exist but the upper limit is not bounded. If you create a histogram of the data you’re likely to find a whole lot of occurrences with a small to medium amount of time being spent (the peak on the left) but also a few occurrences with a lot of time being spent (the long tail on the right). I’ve created the following simplified data set below to illustrate the scenario.

You can see in the screen shot that the occurrence that took 24 hours is clearly an extreme example but it has a significant impact on the Mean. If a Commander came to you and asked for the average number of man-hours spent on occurrences do you feel that the Mean of 8.8 would be representative, considering that only one occurrence actually took longer than that? Clearly the Median value of 5 hours is more representative of man-hours and is a more accurate average.

This result comes about because the Mean is sensitive to the inclusion of extreme values because of the way it is calculated while the Median, which is relatively indifferent to the inclusion of extreme values because it only concerns itself with the middle value for a data set, is more likely to provide an appropriate number that is stable in the face of outliers.

The screen shot also conveniently illustrates the function calls used to calculate both the Mean and Median in Excel. Frustratingly, Microsoft has opted to use the function name AVERAGE for the Mean further erroneously cementing them as synonyms. Thankfully the Median is a straightforward call to the MEDIAN function.

You’ll notice a third function in the list: SKEW. I discuss above about the positive and negative skewness of data and the SKEW function is useful for coming to grips with the direction and magnitude of a data set’s skewness. In the occurrence man-hours example I state that the data is positively skewed and that’s reflected by the positive (greater than 0) value returned by the SKEW function and that indicates that the Mean likely overestimates the Median. Contrarily, if SKEW returns a value that is less than 0 you’ll know that the data is negatively skewed and that the Mean likely underestimates the Median. Finally, if the value is close to 0 your data set likely follows the bell curve or normal distribution and the Mean and Median will be nearly the same.

The key take away from this post is that it is important to not just blindly take the average (by which I mean the Mean) of a stat without first considering how the data is distributed. So much law enforcement data is skewed one way or the other that it makes sense to take a minute and run the MEDIAN and SKEW functions in Excel to make sure you’re reporting the most appropriate number to your superiors.

IALEIA SWOC November Excel Workshop

I just got back from the November Excel Workshop put on by the South Western Ontario Chapter of the International Association of Law Enforcement Intelligence Analysts. I attended both to present and to learn and I had a great time doing both. I want to give a heartfelt thank you to Mick, Manny and Hazel for inviting me to present on Excel topics for law enforcement analysis and to all the analysts who took time out from their schedules to come and make the profession stronger. I hope we can do it again in the future.

At the workshop I delivered two presentations and I produced a PDF document of the material for each. The first, 30 Excel Functions All Analysts Should Know, is an introductory level document detailing useful Excel functions with examples and screen shots. The second PDF is titled Advanced_Functions_Add_Ins_and_VBA and is a mix of intermediate to advanced level Excel topics that specifically details how to use Excel’s FREQUENCY function, Microsoft’s Analysis ToolPak for t-Tests and how to create a simple User Defined Function in Excel. In the coming weeks I will expand on the exercises in the PDFs as blog posts and also blog about those sections that didn’t make the cut for the workshop.

Both lessons also had some supporting Excel files for performing the exercises: 30_Functions_Exercises and Advanced_Functions. Please note, all this data is fictional and is in no way sensitive. In the Advanced Functions workshop I make use of a Add-In named PatternUI.xlam and it can be downloaded from here: http://blogs.office.com/b/microsoft-excel/archive/2007/11/16/chart-pattern-fills.aspx

Once again, thanks to everyone that came out today for making the workshop a success.

Year-over-year Crime Stat Reporting with ISO Week Date

Does your organization report a weekly crime stat? Is part of that report a measure of how the week compares to the same week the year before? If so, it’s important that you understand how ISO week dates work so that your report offers an accurate comparison between this year and prior years. In this post I am going to first discuss what an ISO week date is and then I am going to explain how it helps create better crime stat reports.

First, what is an ISO week date? Hopefully you’re familiar with the International Organization for Standardization (ISO). They publish a lot of standards and ISO 8601 is the standard that deals with a calendar system that gives each week of the year a number. For example, this article was written on October 21, 2012, which, in ISO date week notation, is written 2012 W42 4. Breaking the date down: the first number is the year, the second number is the week of the year (42 in this case) and the third number is the day of the week (the 4th day is a Thursday because the system states that Monday is the first day of the week). Most of the time a year has 52 weeks, sometimes it has 53 weeks to handle leap years. According to Wikipedia the system is most often used in government and business for keeping fiscal years.

This is pretty straightforward to understand but the tricky parts comes when determining the first week of the year as the first week introduces slight discrepancies between the ISO system and the traditional Gregorian calendar that people are used too. To wit: ISO 8601 defines the first week as the week with the year’s first Thursday in it. Using 2012 as an example, the first Thursday of 2012 was January 5th and, recalling that the ISO week begins on Monday, this means that the first day of the ISO year, that is 2012 W01 1, was January 2nd, 2012. But what about January 1st, 2012? It was actually the last part of week 52 of 2011 or 2011 W52 7. Yes, I know, it’s weird, but I think its utility outweighs its weirdness.

If you’ve made it this far in the post you’re probably thinking: “This seems overly complicated, why should this concern someone reporting crime stats?” A valid question. Consider a weekly crime report that covers October 8th through October 14th from 0000 hours to midnight. That’s a whole week and conveniently it is also week 41 of 2012. Now let’s say that for comparison our theoretical crime report also tabulates the stats for October 8th through October 14th for 2011. Is there a problem with this? Does it matter that the 2011 numbers run from October 8th (a Saturday in 2011) to October 14th (a Friday in 2011) instead of Monday to Sunday like it does in 2012? Is it enough that we capture one of each weekday in a weekly crime report? Is it relevant that the Saturday from 2011 is part of ISO week 40 while the Saturday from 2012 is from week 41?

I think this last point is key. Calls for service have a definite seasonal trend which means that for any particular day of the week, Saturdays for example, the calls for service will vary according to the week of the year. This means that, historically speaking, the number of calls for service on the Saturday in week 41 are likely going to be consistently different than the number of calls for service on the Saturday in week 40 and the same goes for every other day of the week. Basically, days are not created equal and that means that if you want to compare apples to apples for crime reports you should really compare equivalent time periods. The easiest way to do this is, you guessed it, by using ISO week dates and comparing week 41 from 2012 to the week 41 from 2011. In our example, compare October 8th through October 15th 2012 to October 10th to October 16th 2011.

But is this reasonable? Why is the week the unit of analysis here? In non-leap years October 8th through 14th is always the 281st through 287th day of the year, isn’t there consistency in that that gets thrown out if we adopt the weekly approach? Perhaps but I don’t think day-of-year consistency is more useful than the weekly one. Most analysts recognize that calls for service have a distinct weekly pattern—there are more calls for service on Friday and Saturday then other days of the week—that is imposed upon the larger seasonal trend. By adopting the ISO week date system we can align reporting with this natural frequency found in the crime data and by locking into the frequency we can exploit it to make comparisons easier.

So what’s the bottom line? Because calls for service show a seasonal trend you should endeavor to compare the same weeks when performing a year over year analysis. And because calls for service show a weekly trend it makes sense to lock your reporting to a standard week that makes comparisons easy and straightforward.  The best way to accomplish both of these goals is to adopt the established ISO week date system for weekly crime reports.

Digging in the Past

On the occasion of my wedding to my wife in October 2004 my late father’s cousin Barbara gave me an envelope that was cryptically addressed “For the future”. Inside the envelope was a Pedigree Chart (courtesy of the The Church of Jesus Christ of Later-day Saints, naturally) on which Barbara had filled in the family tree on my paternal grandfather’s side going back several generations. This was a true gift because, due to my grandfather’s assassination at the hands of the IRA in 1976, my father and my grandmother spoke very little if at all about that side of the family. My wife and I marveled at the genealogy of it all and then stored the letter safely with the rest of our wedding cards.

I didn’t think much about Barbara’s gift until today, October 17, 2012, when a chance encounter with a co-worker who hails from Northern Ireland got me talking and thinking about my heritage. This co-worker was quite keen and he dug up several web pages that I had found in the past but also offered some new ones, such as the 1901/1911 Irish Census and the Wikipedia page for the Mayors of Londonderry which clearly notes that Gerald Stanley Glover, my great uncle, was Mayor from 1950 to 1952 and again from 1961 to 1963. I knew he was Mayor at one point, but it’s nice to see the specifics.

When I got home from work my curiosity was fired and I decided to use Barabara’s gift and the Internet to see what I could find. I first checked the 1901/1911 census for William Glover, the father of my grandfather, Joseph Ballantine Glover. Now, grandpa was born in 1916, so he wouldn’t be on the census, but William was born in 1855 and passed away in 1920, so he should be there, and he was. The details are a little scant, but I can confirm it’s him because he’s listed as married to Mary E. Glover or, as she is detailed on my genealogy form, Mary Elizabeth Ballantine. I searched a bit more for my great great grandfather John Glover, born 1818 and died 1907 (so he should have been caught in the 1901 census) but to no avail.

I decided then to research the Ballantine side of the tree, a name my father, brother and grandfather all share. I knew that the Ballantines owned a timber company in Londonderry, and that my grandfather was the director of the timber yard (as confirmed by this somewhat angry article near the bottom).

Having confirmed what I knew I moved up the tree to see if I could find out more about Joseph Ballantine, Mary Elizabeth’s father. I first looked for him on the 1901 census as he lived 1834 to 1905, but to no avail. I then Googled his name and found this web page that details the 1907 Belfast/Ulster Street Directory. It lists several entries for Ballantine, Joseph, Ltd. on Strand Road: once under Steam Saw Mills, once under Timber Merchants and once under Turners.

So this is interesting, but Joseph Ballantine died in 1905 so if his company is going strong in 1907 and is in an Ulster business directory (and was still going strong in the 1970s when grandpa worked there) it must be being run by someone. William Glover is listed as a farmer and since I was unsure if Mary Elizabeth had any brothers I went back to the Irish Census to dig a bit more. I mentioned how earlier I searched for Joseph Ballantine to no avail. That’s not exactly true, I found a Joseph Ballantine but at the time of the census in 1911 he was only 38, far to young to be the Joseph Ballantine born in 1855. But perhaps he was a son and therefore the sister of Mary Elizabeth. Here’s his census listing:

Interestingly, if you click the “Show all information” checkbox on the census it expands to reveal that his occupation was “Timber Merchant”. That pretty much clinched it for me and now I assume that Joseph Glover (junior for lack of a better term) took over Joseph Ballintine Ltd. from Joseph Ballintine the elder or, and this will be explained in the next few paragraphs, it might have been solely his company and not his father’s at all.

What makes me say that? Well, during my Googling for Joseph Ballantine I kept having entries returned from the Dictionary of Irish Architects 1720-1940 so I clicked on the link and found this:

It’s interesting, but perhaps it’s a different Joseph Ballantine. More Googling of the name led me to this page of “Marriages recorded in the Town of Strabane and parish of Camus-juxta-Mourne extracted from the Derry Journal, Londonderry Standard and Londonderry Sentinel 1860-69”. The page is long but if you search for “Ballentine” you’ll find this snippet:

So there he is, Joseph Ballentine, identified as an architect, marrying Anne Wilimina, daughter of Mr Robert Simpson, also an architect. I should mention that Barbara’s gift lists both Anne Wilimina Simpson and Robert Simpson in the appropriate spots. What the gift doesn’t have is their wedding date and for the record, that February 25th was in 1862.

So wonderful, my great great grandfather was an architect. That’s pretty cool. Did he build anything that is still around? I’m so glad you asked. This lovely brochure describes The Apprentice Boys Memorial Hall in Londonderry and explicitly notes that the builders of the original structure included Joseph Ballantine:

Here is a copy of the brochure from my server in case the other one ever disappears: mem-museum-booklet

And here is a nice Youtube video of the hall:
The Apprentice Boys Memorial Hall in Londonderry

So my thinking is that if Joseph Ballantine senior was an architect he either started a timber company to compliment his work or encouraged his son to do so. I don’t think I have enough details to make a decision either way on it but it’s clear that when William Glover married Mary Elizabeth he also married into a timber business and their son (my grandfather) decided to make the family business his career.

And, that’s all I have for one night. I hope this has been entertaining and educational and I hope that the great, all-knowing Google may find it, index it and make it available for future Glovers to peruse.

Bonus facts:

Joseph Ballantine Glover was the president of the Londonderry Chamber of Commerce from 1967 to 1968, and it’s listed on their history page.

On the page he is listed as an FCA, which likely means he was a Fellow of the Chartered Accountants of Ireland or the Northern Ireland equivalent.

While I can’t confirm that it is definitely him, Joseph Ballentine of Londonderry is listed as a member since 1888 in the 1890 rolls of the Royal Society of Antiquaries of Ireland. The link is a Google Book link and the publication is the Journal of the Royal Society of Antiquaries of Ireland. Seeing as how he was an architect it makes sense that he would like antiquities.

 

The genealogy chart from Barbara has the spelling of Ballantine as Ballintine. When I type Ballintine into Google it corrects me. The Irish Census has it as Ballantine but the marriage note for Joseph Ballantine has him down as Ballentine. The only thing I know for sure is that Word thinks that every variation is a mis-spelling.

Cumulative Distribution Function Charts in Excel

Most analysts know what a histogram is: it’s a chart where each piece of data is put into one of several bins and then the bin counts are plotted in order to gain insight into how the data is distributed. Histograms are useful as they have an intuitive appeal because it’s easy to see how the data is distributed just by looking at the chart. Of course, they aren’t perfect. The whole binning process is pretty subjective and depending on how the bins are selected the results can be different, which is deceptive. If only there was an alternative that could address these shortcomings…

Well, there is: histograms have an underutilized sibling known as the Cumulative Distribution Function (CDF) chart which compliments and, I think, improves on the histogram by doing away with some of the histogram’s problems and offering some useful innovations of its own. Namely, the ability to read off the chart what percentage of the data falls between two points, something that is not possible with a histogram. In this post I will discuss why I think you should make the CDF part of your analytical toolbox and also demonstrate how easy it is to create a CDF in Excel.

Convince me

First, two example charts that describe my dataset: a completely made up list of 30,000 calls-for-service and how long they last from initial call to the closing off of the incident. The first chart is a standard histogram and the second is the corresponding CDF. For both charts the horizontal axis is the same: it’s hours. For the histogram the vertical axis is a count of how many calls are in each bin while the vertical axis of the CDF is a measure of percentage (why percentage will become clear in a bit).

cdf_1

So the histogram is the same old boring story. We have a normal distribution that is skewed to the left (fatter on the left than the right). Looks like most calls take about five hours and only a few go above 12 hours. Yawn.

cdf_2

Ok, the CDF looks great. But what’s it good for? What can it tell us that the histogram can’t? Remember that I mentioned you can use a CDF to determine what percentage of the data falls between two points? Let’s explore that now.

Suppose I want to know how many calls take less than 5 hours (that is, between 0 and 5 hours). I can read that right off the graph by going across the horizontal axis until I reach 5 and then going up until I hit the trend line. I then read across to the vertical axis and see that the value is about 60%. What this means is that 60% of all calls take 5 hours or less. That’s a much firmer number than ‘the most calls are around 5 hours’. Similarly, what if the question was how many calls take between 5 and 10 hours. Well, first read off the 10 hour value, which is about 99% and we know the 5 hour value is 60% so just take the difference 99-60 to arrive at 33% of all calls take between 5 and 10 hours. Let’s see you do that with a histogram (hint: you can’t do it).

I’m convinced. How do I make one in Excel?

Creating a CDF is very straightforward and the following five steps describe the entire process. I’ve included a quick Excel screen grab to illustrate what the first few rows of the data will look like when you’re done (it’ll make more sense when you’re done the steps).

cdf_3

Step One: Start with your data in Column A in Excel sorted from smallest value to largest. My data had a lot of zero values, so that is why the column starts off with a series of zeros.

Step Two: In Column C (yes I intentionally skipped B) put the number 1 in C1, 2 in C2, 3 in C3, etc.  Here’s a useful hint to speed this up: use =C1+1 for C2 and fill down.

Step Three: In B1 enter how many data points you have. I had 30,000 so I dropped that number in there.

Step Four: For B2 use the formula =C2/$B$1 and fill down. The idea is that you want to divide the value in Column C by the total count. When you get to the bottom of your data the last value should be 1.

Step Five: Now that the data has been prepped highlight all of columns A and B and insert a new Scatter chart. What you’ll get is the monotonically increasing (i.e. always going up) trend line. You can edit the styling so that the graph is a line rather than points and make whatever other changes make it look it pretty. I recommend fixing the vertical axis at 1 (since the data can never go above 100%) and the minimum horizontal axis at 0, if that’s appropriate. Voila, now you have a CDF chart.

But how do CDFs address the histogram’s shortcomings? Did you notice any bins during the calculation? Nope. CDFs use the value of every data point and therefore do away with binning and the loss of information binning introduces. And because there are no bins the shape of the CDF can’t change like it can in a histogram so that source of subjectivity is neatly side-stepped. And you want to know what else is great? It’s easy to compare two data sets with CDFs by simply plotting them on the same chart. Comparing two histograms on the same plot is not particularly enlightening besides noting how the peaks compare.

In conclusion, CDFs are like grilled cheese: great and easy. And like grilled cheese everyone should try one at lunch (or whenever you next do a frequency analysis). The CDF is a great compliment to traditional histogram analysis and will hopefully become a valuable tool in your graphic analysis toolbox.

An olive branch to histograms

Making a histogram is actually more work than creating a CDF. But if you really want to create one you can either use Microsoft’s Analysis ToolPak (what happened to the ‘c’ MS?) or perform some more elaborate Excel JuJu. Good luck.

Being Careful with Averages

In crime analysis we are used to calculating averages and when I say average I mean the “mean”—e.g. add up ten numbers and divide by ten—that kind of average. But in crime analysis we are also frequently interested in examining rates or ratios. For example: how many hours of patrol time are spent responding to calls. When we combine the two there is the potential to be tripped up since, if we are talking ratios, we need to be very careful about how we calculate an average. Depending on the approach we take we’ll get different numbers.

An example will help illustrate the problem.

Let’s say that your CAD system keeps track of how many hours an officer spends on patrol and, of that time, how much of it is spent responding to calls. The following table illustrates a small sample of this kind of data.

mean_1

From the data in the table I calculated a ‘busy-ness’ ratio for each officer; this is Call Time divided by Patrol Time. From those values I in turn calculated the average Busy-ness, which is 0.51, or 51% busy. But wait, let’s add up all the Patrol Time (78.6 hours) and all the Call Time (40.6 hours) and find the ratio. (40.6/78.6)=0.52 or 52% busy. These numbers are different because of the mathematical principle that says that the ratio of averages (0.52) does not necessarily equal the average of ratios (0.51). But which one is ‘correct’? Well, they both are, and the one you go with all depends on what question you are trying to answer.

For the ratio of averages (0.52) we are asking a question about the entire police service: “how much patrol time did we spend on calls?” For this question 52% is the correct answer because it takes into account the total time spent on patrol and the total time spent on calls. If an officer spends 3 hours on patrol or 12 hours, all of that is taken into account in the final number.

The average of ratios (0.51) is trickier as it answers a question about officers going on patrol: “how much of their patrol can an officer expect to spend on calls?” The difference here arises because the average of ratios treats all values contributing to the mean as having equal weight and this is a fine assumption when the question is focused on how the averages behave. In other words, the length of the shift doesn’t matter, only how much of the shift was spent on calls.

But why am I telling you this since the difference seems to be pretty small? It’s because there are situations where the difference can be significant. For example, look at the modified table below where by adjusting the patrol times I was able to increase the difference between the averages to 4%. Making a 4% mistake on something like officer busy-ness can be the difference between hiring more officers and not. That’s not a mistake I would want hanging around my neck.

mean_2

The take away from this post is to be mindful when taking averages because so many of the numbers crime analysts look at are ratios. Whether we’re looking at officer busy-ness or per capita car thefts or tickets issued per traffic blitz it is important to understand the question that the average is supposed to answer.

Bonus commentary for pedants
But wait, you might be saying, how is the 0.52 number a ratio of averages? No average was even calculated; we just summed the numbers and divided the Patrol Time by the Call Time. That’s a ratio of sums. Yes, that’s true, but if you had wanted to be a real stickler you could have found the average Patrol Time (78.6/8 = 9.825) and the Call Time (40.6/8 = 5.075) and divided those numbers and arrived at the same answer (5.075/9.825) = 0.52. It works this way because the divisor in both the averages is the same (it’s 8 representing the 8 shifts included in the calculation) and therefore it cancels out when the ratio is taken.

Testing the Significance of Day-of-Week Analysis with the Chi-Square Statistical Test

A common weapon in the crime analyst arsenal is the day-of-week analysis. One problem with the technique is that many people accept the results uncritically and assume that if, for example, there have been several more incidents on Friday than any other day then Friday is the best day for a focused action. The problem is that day-of-week analysis is seldom accompanied by a confidence interval or evaluated to find out if the results are statistically significant. Without performing the statistical follow-up it’s impossible to know if the results actually represent anything other than randomness.

In order to evaluate statistical significance for something like a day-of-week analysis we need to use the Chi-Square Statistical Test. The chi-square test is what’s known as a non-parametric test which is a fancy way to say that it will work with nominal data (that is categories) and that the hypothesis being measured does not require normal distribution or any variance assumptions (a bonus for testing things like a day-of-week breakdown). The test works by performing a goodness-of-fit analysis between the observed data and an expected distribution. The point is to determine if the differences between the observed and expected are caused by something other than chance.

To make things more concrete let me introduce the actual data I will be evaluating. The following image lists some (fictional) motor vehicle collision (MVC) counts over the course of several weeks. Looking at the data with an uncritical eye might lead some to assume that Fridays are the worst day for collisions and that we should do a traffic blitz on that day. The purpose of the exercise is to see if that’s a valid assumption.

screen-shot-2012-06-22-at-11-17-08-pm

So this is our “observed data” but what about the “expected distribution”? In order to answer we need to identify another property of the statistical significance test: the null hypothesis. In any statistical test we need two hypotheses: the research hypothesis and the null hypothesis. The research hypothesis is what we’re hoping to show, in this case, that the day of week means something for MVCs. The null hypothesis is the opposite and in our example it means that the day of the week doesn’t factor into when collisions take place. Since the null hypothesis states that the day doesn’t matter then we should expect the same number of collisions each day. If the total number of collisions is 140 then the expected value for each day is 140/7 = 20. With this new info have a look at the following image that lists both the observed and theoretical distributions.

chi_1
The chi-square calculation is pretty straight forward: subtract each observed value from its expected value, square it, and then divide by the expected value. So for Sunday the value would be (14-20)^2/20 = 1.8., for Monday it’s: (16-20)^2/20 = 0.8 and so on for each of the 7 days. Then you sum up each of the calculated values to arrive at a total score (8.45 in this case), as in the image below.

chi_4

The next step is to check the score against what’s known as a chi-square critical value table, which is pre-computed tables of critical values for different confidence intervals and degrees of freedom. The tables are all over the Internet but I used this one for this example. I wrote above about confidence intervals, and I’m going to pick the 95% interval just because, which means that my alpha value (look at the table) is 0.05 (i.e. 1 – 0.95). The second value is degrees of freedom which is a measure of how many elements are being tested, minus one. For this example, 7 days in the week, minus one, means we have 6 degrees of freedom (df). Knowing these two pieces of information we can look up the chi-square critical value. Go down to the row for df=6 and across to the column for alpha=0.05 and the number is 12.592.

12.592 is greater than 8.45. That means we cannot reject the null hypothesis. That means we can’t say that the observed MVCs across the days of the week is due to anything other than chance. It means that we can’t say that any particular day is better than another to do a traffic blitz because, statistically, the observed data could have happened by chance. This is a troubling truth if we’ve been telling officers to focus on Fridays.

So this post is kind of a bummer if you’re used to directing officers based on day-of-week analysis. A lot of officers don’t want to hear about statistical significance (well, actually, none, I don’t know any who would want to hear about it) but the hard truth is that math is telling us that what we think is a pattern (e.g. more MVCs on Fridays) is often not distinguishable from chance. On the bright side, when we do get to reject the null hypothesis we are able to objectively demonstrate that, in fact, yes, something is going on that warrants attention. And of course the chi-square test is applicable to all sorts of categorical data sets, many of which demonstrate statistical significance, so hopefully you can add it to your toolbox for use in the future.

Important Bonus Math:
There are a few caveats when working with chi-square. First, it’s a data hungry method, the more values you have, the better your test will be. Second, it is skewed to rejecting the null hypothesis. I don’t know exactly why, that’s what a smart stat guy wrote, but it means that if you are forced to accept the null hypothesis, you’re really in bad shape. Third, don’t run the test if you have an expected distribution with less than 5 in each of the cells. This means, don’t let the expected value be less than 5 for each of days of the week, that’s too little data (see the first caveat) and your results won’t be good.

The Poor Man’s Proximity Tool: The Haversine Formula

The haversine formula is a method for finding the great-circle distance between a pair of latitude-longitude coordinates. In crime analysis I use the formula for finding incidents that have taken place within a particular distance, for example within 500 meters, of a location.

My approach uses Excel to calculate the haversine distance between a reference location and the latitude-longitude of every other location in my spreadsheet. I then sort the results to find the locations that are within the distance that I’m interested in. I’ve worked the following example in Excel to illustrate the method.

Imagine that you have a list of 1000 occurrences with the latitudes and longitudes of each of the occurrence locations. Now also imagine that you have a reference location and that you want to determine how many of the 1000 incidents occurred within 500 meters of the reference location. I set my spreadsheet up like this:

haversine_1

Note that in column B I have entered all of the latitudes and in column C I have entered all the longitudes. I have also entered my reference latitude and longitude in cells E2 and E3 so that they can be referenced by the haversine formula. Also note that both latitudes and longitudes are decimal degrees as opposed to degree minutes seconds.

In column D I enter the Excel-ified version of the haversine formula which I have reproduced below. It gets a bit complex because Excel likes to deal with radians as opposed to degrees when using the COS and SIN functions so it is necessary to use Excel’s RADIANS function to convert the latitudes and longitudes. The final multiplication by 6371 represents the Earth’s radius in kilometres.

haversine_2

A note for pedants: I realize that the Earth is not a perfect sphere and that means that the calculated distances will be somewhat off if you are dealing with large distances. But, for the distances we are concerned with–distances of 10s of kilometres–the impact is negligible and can be ignored.

By filling column D with the expression I calculate the separation distance between the reference point and each incident. Now, if the distance column is sorted from smallest to largest I can easily see the incidents that occurred within 500 meters (or 0.5 kilometres) of the reference location.

haversine_3

I find this method useful for determining the proximity of all sorts of things. For example, if the reference coordinates are for an intersection the method can be used to find all motor vehicle collisions that occurred within 100 meters. Or say there was a break and enter, a proximity search can be done against field contacts that occurred within 1 kilometre of the occurrence address. Really anything that has a recorded latitude and longitude can be used. Best of all, it doesn’t require a GIS, just Excel and the formula.