Jason Punyon


I’ve been meditating lately. I started in December, and I haven’t done it every day, but I enjoy it. The topic of this morning’s meditation was gratitude. It led me through feeling grateful for different things. Something someone did for me. Something someone I don’t know did. Something from nature. Something I did. Something small. Something big.

My task for the week was to deploy an A/B test of a new job recommendation algorithm. Our weekly meeting was at 1PM so it was high time I got down to it. I had to train a logistic regression model in R, extract the coefficients/port them to our C# library, and release an update. I didn’t know how to do any of that R work or ML stuff three years ago. I was grateful.

Then I had to update the consuming code. I worked on this codebase before I moved to the Data Team. You never step in the same river twice, but I know this river. I’ve never been around anything I’ve built for more than a few years, getting to see it grow and change, work and not work. I was grateful.

I stepped in it and released a decent size bug for unregistered users to dev, the site would just error out for them. It hadn’t happened locally, so I had to work from the stack traces and I got to THINK REALLY HARD (it’s an older technique, but it checks out). I used to worry I was going to get bit by bugs. I stopped worrying, I know I’m going to get bit by bugs, but also bug bites just aren’t that bad anymore. I was grateful.

This took a while and I told everyone on the team that I had really loused things up and if they would be so kind as not to push prod so I could figure out what was up. Gervasio, one of the devs on the team, showed me this neat way they can block the builds so even if someone ran the build anyway it wouldn’t deploy. I was grateful.

I finally worked out the hitch in the giddyup, and with about ten minutes to spare before the meeting my test was galloping free in the wild. I trained a model, touched two languages, three codebases, fucked up the release, and fixed it all in a matter of hours. I was grateful.

This mundane cromulence. Why even mark it with words? A few months back we had to let a lot of people go. It was a thing I was scared of for a long time before that and ashamed of since. Gratitude is a piece of the puzzle I’ve been missing. Remembering that I contributed to building the experience I had developing today and feeling grateful for being a part of that lifted me up. Some things we set in motion years ago didn’t work out like we thought and that is sad, but some have so exceeded my expectations. I haven’t been grateful about work in a long time and it was nice to be there today. I want to be there tomorrow.

Fun With Random Numbers: More Random Projection

Last time we learned about a method of dimensionality reduction called random projection. We showed that using random projection, the number of dimensions required to preserve the distance between the points in a set is dependent only upon the number of points in the set and the maximum acceptable distortion set by the user. Surprisingly it does not depend on the original number of dimensions. The proof that random projections work is hard to understand, but the method is very simple to implement in just a few steps. To reduce the dimensionality of a set of m points in an N dimensional space:

  1. Get your data into a m × N matrix D
  2. Choose your maximum allowable percent distortion ε
  3. Calculate the number of dimensions for your reduced space, n = ln(m) / ε2.
  4. Sample the entries of an N × n matrix from the standard normal distribution, R.
  5. Multiply D by R.

Today we’re going to add a couple steps to this method and find another surprising result.

Let’s color the world

Let’s get some data. We’re going to work with the cities15000 dataset from geonames.org. Let’s download it, clean it up, and take a look…

if (!file.exists("~/cities15000.zip")) {
  download.file("http://download.geonames.org/export/dump/cities15000.zip", "~/cities15000.zip")  

cities = readr::read_tsv("~/cities15000.txt", 
                         col_names = FALSE, 
                         col_types = "ccccccccccccccccccc", 
                         quote = "") %>% 
  transmute(Name = X2,
            Latitude = as.numeric(X5),
            Longitude = as.numeric(X6))

cities %>% 
  ggplot(aes(Longitude, Latitude)) + 
  geom_point(size = 0.25) +
  labs(title="Cities of the World, Population >= 15,000")


Lovely. Let’s get our data in order. We’ve got two dimensional points, and we’re going to randomly project them. Now this is weird. We’ve been talking about dimensionality reduction for the past post and a half, but this time we’re going to increase the dimensionality of our data using random projection. Don’t worry, I promise everything’s gonna be fine. Let’s get our projection on.

m = nrow(cities) #Number of points
N = 2            #Original Dimensionality
n = 32           #New Dimensionality

D = cities %>% select(Longitude, Latitude) %>% as.matrix()
R = matrix(rnorm(N * n, 0, 1), N, n) / sqrt(n)

projected = D %*% R

Everything is the same as before. We got the data in the data matrix, we sampled the random matrix, we multiplied. We’ve projected our 2 dimensional data up to 32 dimensions. Let’s take a look at one of these points in the projected space…

##  [1]  -8.5298772   5.2487328   6.4803452   0.7202221  -4.5681409
##  [6]  -3.5437398  -1.1376153   2.1322984 -13.9543986   9.5068033
## [11]  11.9501116   4.1434071  -6.2996788  -2.8503986   3.0405665
## [16]  -6.8101166   2.4264599  -0.4621604 -13.5103668  -7.4556873
## [21]  11.0385507   3.5353682  -3.1107269   0.5312211   6.7301643
## [26]   5.3124139   2.1059562   1.6779630   2.7099141  13.2124557
## [31]   3.0985446  -2.9377113

OK, so each row in the projected matrix has 32 numbers. The next step in the process is to turn each row of 32 numbers into 32 bits using their signs. Positive numbers will map to 1s and negative numbers will map to 0s.

projected_bits = matrix(as.numeric(projected >= 0), m, n) ##Compare with zero to get the sign of each number...
projected_bits[1,] ##And now we have 1s and 0s
##  [1] 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0

Our projected data is now a collection of 32-bit strings, which I’m going to call hashes. Let’s visualize the data by coloring the points on the original map by their projected hashes.

cities_with_hashes = cities %>% bind_cols(Hash = apply(projected_bits, 1, paste0, collapse=""))
cities_with_hashes %>% 
  ggplot(aes(Longitude, Latitude, color=Hash)) +
  geom_point(size = 0.5) + 
  theme(legend.position = "none") + 
  scale_color_manual(palette = jason_colors) +
  labs(title="Cities of the World, Population >= 15,000",
       subtitle="Colored by projected hash")


Well well well. It looks like there’s some radial symmetry going on. It’s a little hard to see though. Let’s fill in all the space to make it a bit more apparent…

all_space = crossing(Longitude = seq(-180, 180, .5),
                     Latitude = seq(-90, 90, .5)) %>%
  mutate(Id = row_number())
all_space_projected = all_space %>% select(Longitude, Latitude) %>% as.matrix() %*% R

all_space_projected_bits = all_space_projected >= 0
all_space_with_hashes = all_space %>% bind_cols(Hash = apply(matrix(as.numeric(all_space_projected_bits), nrow(all_space), n), 1, paste0, collapse=""))

all_space_with_hashes %>%
  ggplot(aes(Longitude, Latitude, color=Hash)) +
  geom_point(size = 0.5) + 
  theme(legend.position = "none") + 
  scale_color_manual(palette = jason_colors) +
  labs(title="Space of the World",
       subtitle="Colored by projected hash")


My my my. So our modified random projection scheme has organized 2d space into these spoke-like regions, and all the points in each spoke have the same hash (note: duplicate colored regions are false duplicates, the color palette we’re using is only so large). That’s rather surprising.

Brace Yourself

What’s more surprising is that the spokes next to eachother have hashes that are closer to eachother than spokes that are farther away. What do we mean by close? We’ll use the Hamming distance, which is just a fancy way of saying “count the bits where two hashes differ”. To calculate the Hamming distance between two hashes we’ll sum the bits of the XOR of the two hashes.

new_york_index = all_space %>% filter(Longitude == -74, Latitude == 41) %>% .$Id
new_york_bits = all_space_projected_bits[new_york_index,]

distances = apply(all_space_projected_bits, 1, function(row) { sum(xor(new_york_bits, row)) })

We picked New York City as our reference point, and calculated the Hamming distance to all other points in the space…let’s plot.

all_space %>% 
  bind_cols(HammingDistanceToNewYork = distances) %>%
  ggplot(aes(Longitude, Latitude, color=HammingDistanceToNewYork)) + 
  geom_point() +
  scale_color_viridis() + 
  geom_point(aes(x=-74, y=41), color="red") +
  annotate("text", -74, 50, label="New York City", color="red", size=7) +
  theme(legend.position = "none")


Holy moly. What’s going on here? Black means that point has a low hamming distance to New York, while yellow means that point has a high Hamming distance to New York. Our random projection method has transformed our original 2d coordinates into a locality sensitive hash. Points that are close to each other in real space, have hashes that are similar to each other in Hamming space. Let’s look at another city, Rio de Janeiro…


Still works. Pretty amazing.

Let’s fix it!

What do you mean, fix it!? It is, amazingly, not broken in the first place! Well, having space organized into spokes is nice and all, but that’s not really how we think of closeness as humans. Can we alter the method to change the shape of the regions so that the version of closeness the hashes see is more like the closeness that we think of as humans? The answer is yes.

Now I don’t have any real theoretical motivation for what I did here, I just thought really hard and, luckily, I ended up where I wanted to be. The thing I observed is that all the spokes emanate from the origin because that’s the point by which all the other points in the space are defined. The Latitude of a point is defined in reference to the origin, which has Latitude 0. Likewise for the longitude. My thought was to extend the definition of each point in the space to be in reference to not just the origin, but another point as well. That means instead of running the random projection on just a 2d point, I need to come up with some more dimensions. So in addition to the original Latitude and Longitude of the points, I’m going to add Longitude+180, Longitude-180, Latitude+90, and Latitude-90. If we rerun the random projection and plot the hashes of each point…

N = 6
n = 32
R = matrix(rnorm(N * n, 0, 1), N, n) / sqrt(n)

all_space = crossing(Longitude = seq(-180, 180, .5),
                     Latitude = seq(-90, 90, .5)) %>%
  mutate(Id = row_number(), 
         Lon2 = Longitude + 180, Lat2 = Latitude+90,
         Lon3 = Longitude - 180, Lat3 = Latitude-90)
all_space_projected = all_space %>% select(Longitude, Latitude, Lon2, Lat2, Lon3, Lat3) %>% as.matrix() %*% R

all_space_projected_bits = all_space_projected >= 0
all_space_with_hashes = all_space %>% bind_cols(Hash = apply(matrix(as.numeric(all_space_projected_bits), nrow(all_space), n), 1, paste0, collapse=""))

all_space_with_hashes %>%
  ggplot(aes(Longitude, Latitude, color=Hash)) +
  geom_point(size = 0.5) + 
  theme(legend.position = "none") + 
  scale_color_manual(palette = jason_colors) +
  labs(title="Space of the World",
       subtitle="Colored by projected hash")


OK, so that’s kinda big and blocky. Let’s up the number of dimensions in the projected space from 32 to 128…


OK, a bit less blocky…1024?


Alright alright alright. Now we changed the method…did we lose the locality sensitivity? If we do the hamming-distance-to-new-york dance again…

new_york_index = all_space %>% filter(Longitude == -74, Latitude == 41) %>% .$Id
new_york_bits = all_space_projected_bits[new_york_index,]

distances = apply(all_space_projected_bits, 1, function(row) { sum(xor(new_york_bits, row)) })
bound = all_space %>% 
  bind_cols(HammingDistanceToNewYork = distances)

bound %>%
  ggplot(aes(Longitude, Latitude, color=HammingDistanceToNewYork)) + 
  geom_point() +
  scale_color_viridis() + 
  geom_point(aes(x=-74, y=41), color="red") +
  annotate("text", -74, 50, label="New York City", color="red", size=7) +
  theme(legend.position = "none")


We find out that no, we haven’t. Dark points are closer to New York, brighter ones are farther away. I wanted to see what it would look like to walk New York’s nearest neighbors in order in the hash space so I made this video.

I remain flabbergasted by what you can do with random projection. That it works at all is surprising, and that my handwavey modifications didn’t break it is doubly so.

Fun With Random Numbers: Random Projection

So, there’s this bit of math called the Johnson-Lindenstrauss lemma. It makes a fairly fantastic claim. Here it is in math-speak from the original paper…

Fantastic, right? What does it mean in slightly more lay speak?

The Fantastic Claim

Let’s say you have a set of 1000 points in a 10,000 dimensional space. These points could be the brightness of pixels from 100x100 grayscale images. Or maybe they’re the term counts of the 10,000 most frequent terms in a document from a corpus of documents. Whatever they are, each point has 10,000 dimensions.

10,000 dimensions is hard to work with, so you’re interested in reducing the dimensionality of your data. The goal of such a reduction would be to make the points easier to work with (have fewer than 10,000 dimensions), but preserve the relevant structure so you can treat the reduced dimensional data as if it were the high dimensional data (analyze it, machine learn it, whatever).

One piece of the structure you might be interested in preserving is the distance between the points. What does it mean to preserve distance? Let’s pick a percentage called ε and say that we want all distances between points calculated in the low dimensional space to be no more than ε% different from the distance between the same points calculated in the original high dimensional space. In pseudo math you might say that good dimensionality reductions lead to distances that obey DLowDimension= DHigh Dimension ± ε% for all points.

So now we can get to the JL lemma’s fantastic claim.

The JL lemma says that the number of dimensions you need in your lower dimensional space is completely independent of the number of dimensions in the original space. Your original data could be 10,000-dimensional or 10,000,000,000-dimensional, and the number of dimensions you need in the lower dimensional space in order to preserve distance depends only on the number of points and our boundary ε. The formula for the number of dimensions you need is:

nLowDimension = ln(Number of Points) / ε2

OK, let’s stop and take a breath. This result is absolutely-positively-100%-certified-grade-A-what-the-hell-is-up-with-the-universe to me. It’s beautiful and it makes my brain hurt. When I first thought about it I couldn’t do anything but stand up and pace around for a while. Take a minute or two to do the same.

The Fantastic Method

Welcome back.

Brace yourself. Even more fantastic than the claim that such dimensionality reductions exist and that they don’t depend on the dimensionality of the original space, is the method of doing such a reduction. Here’s the code…

N = 10000 #The Number of dimensions in the original space
n = 1000  #The number of dimensions in the low dimensional space

random_matrix = matrix(rnorm(N * n), 0, 1), N, n)) / sqrt(m) #The magical random matrix.
X = GetData() #returns a nPoints x N matrix of your data points

X_projected = X %*% random_matrix

It’s…just random numbers. To do one of these good dimensionality reductions, you just sample an N × n matrix from the standard normal distribution and multiply. IT’S JUST RANDOM NUMBERS. Probably time for another walk and think.


Welcome back, again. At this point I was at a total loss. We only need ln(NumberOfPoints) dimensions to have a good projection, and doing the projection is just sampling a random matrix and a matrix-multiply? Like, for real? This is a true fact? I tried to understand the proofs here but they’re a tad inscrutable (and not just to me). It happens. I’m an engineer with some math background, not a mathematician. The next best thing is to just try it, right? The claim is so simple to test, and the method so simple to implement, that if I just run the code enough times I can quiet down my brain’s utter disbelief and maybe this can become a tool in my toolbox.

OK, let’s do it.

N = 10000 #The Number of dimensions in the original space
m = 1000  #The number of points in the dataset
epsilon = .1

X = matrix(rnorm(m * N, 100, 10), m, N) #Let's just pick some normal random data.
distX = dist(X) %>% as.matrix() #This is the distance matrix in the original space.

n = as.integer(ceiling(log(m) / epsilon ^ 2)) + 1 #The number of dimensions in the projected space.

random_matrix = matrix(rnorm(N * n, 0, 1), N, n) / sqrt(n) #The magical random matrix
random_matrix = matrix(rnorm(N * n, 0, 1), N, n) / sqrt(n) #See it's totally random, really there's nothing special about it.
random_matrix = matrix(rnorm(N * n, 0, 1), N, n) / sqrt(n) #Look, the deck is really shuffled. Completely random. No tricks.

X_projected = X %*% random_matrix #Project the points into the lower dimension.
distX_projected = dist(X_projected) %>% as.matrix() #The distance matrix in the projected space.

distances = data.frame(Distance = as.numeric(distX),

We’ve projected the original 10,000 dimensions down to 692 dimensions. We expect that all the distances are distorted by less than 10%. Let’s look at a histogram of the distortion of the distances…

distances %>%
  filter(Distance > 0.0) %>% #Ignore the distance from a point to itself.
  mutate(Distortion = (ProjectedDistance - Distance) / Distance) %>% 
  sample_frac(.1) %>%
  ggplot(aes(Distortion)) +
  geom_histogram(bins=40) + 
  geom_vline(xintercept = epsilon, color = "red", lty = 2) +
  geom_vline(xintercept = -epsilon, color = "red", lty = 2) + 
  labs(title = "Distribution of Distortion") + 


And we see that they’re basically all in range. Fantastic.

Trip Report: ICML 2015

One of the awesome benefits of working at Stack Overflow is the Conferences and Education policy. You get a ballpark budget ($3500) and time (3 days) each year to update your skills and knowledge. In return for all this conferencing and educationing you write up a summary of what you learned and your experience at the conference.

This year I went to the International Conference on Machine Learning in Lille, France. ICML is an academic conference broken up into three parts: 1 day of tutorials, 3 days of “The Main Conference”, and 2 days of Workshops.

TL;DR: I loved the tutorials, the keynotes, and the workshops. The paper sessions were hit-or-miss but I might’ve been doing it wrong (sorry, I’m new to the job). It was a great conference, I’m gonna go again next year.

The Tutorials

The tutorials were my favorite part of the conference. I wish the whole conference were tutorials like the ones I went to on this first day. There were three two-hour-long sessions that started at the surface and tried to dive somewhat deeply into their respective topics.

Natural Language Understanding: Foundations and State-of-the-Art

This was my favorite tutorial. It was given by Percy Liang, a computer science professor and natural language researcher at Stanford. One of the presenter’s goals was to seed the minds of ML theorists and practitioners with the essential phenomena and sharp corners of natural language in hopes of motivating new ideas. The amount of information he related in such a short time was impressive and there’s no way I can do it justice in a paragraph. Read through the slides to get an idea. There’s video coming in September.

Advances in Structured Prediction

The next tutorial was a two-parter on Structured Prediction from Hal Daume III and John Langford. I liked this tutorial because I knew the least about structured prediction going in and I felt like I had at least a tenuous grasp on it by the end. Structured prediction is about answering the question “What can you do if you’re going to give up the independence assumption and model a joint probability distribution?” The tutorial went through what structured prediction was and existing methods, then talked about their method (Learning to Search), empirical results, performance, and advertised their fast/flexible implementation in Vowpal Wabbit. slides

Modern Convex Optimization Methods for Large-scale Empirical Risk Minimization

The final tutorial of the day was another two-parter on convex optimization. Machine learning is frequently formulated as minimizing (optimizing) a loss function. Certain loss functions have mathematical properties that make them efficiently optimizable and we call those functions “convex”. Two subsets of methods within convex optimization are: “primal” methods (gradient descent) and “dual” methods (coordinate descent). The first part of the talk covered primal methods and the second part of the talk covered the dual methods and concerns at-scale like minibatching, parallelization, and distributed implementations.

I liked this talk because it was mathy and about performance characteristics on big problems and we really dig that kind of thing at Stack Overflow, however this seemed like it was the furthest tutorial from me practically. It’s nice to know that people are attacking that size of problem but our problems just aren’t big enough to warrant some of these techniques. part 1 slides part 2 slides

The Main Conference

Each day of the main conference had a keynote address and then 3-4 sessions of 3-5 short talks on submitted papers organized into different tracks. There were also poster sessions where you could meet and greet the researchers and dive into their paper with them in more depth.


Susan Murphy’s “Learning Treatment Policies in Mobile Health” was an interesting tale at the corner of healthcare funding, clinical trial design, machine learning, and mobile development. Smartphones might be able to provide effective, low-cost, personalized, just-in-time treatment for things we don’t fund particularly well in the US like drug addiction treatment. How to even study the problem is a can-of-worms factory. Conflating factors abound where you’re vying for a subject’s attention via their phone in the real world instead of in an in-patient treatment setting. Trying to learn effective treatment policies through the noise is a crazy-ambitious goal. I was really impressed by their devotion to the ethics of the research. Things like refusing to have the phone light up/deliver messages when it was moving above a certain speed. No point in trying to help the person’s drug addiction if messaging them causes them to crash their car.

Jon Kleinberg’s “Social Phenomena in Global Networks” talked about how we’re recording social activity more granularly than ever and insights that can be gleaned from looking at a person’s network neighborhood in a social graph and the interactions the person has with their network over time. He discussed the different features to extract from the graph in order to solve the problem of identifying a person’s romantic partner.

Two High-Stakes Challenges in Machine Learning

Both of those were super interesting but they were eclipsed by Leon Bottou’s talk on two high-stakes challenges in machine learning. The first challenge is that machine learning components are fundamentally different from the things we’re used to in software engineering. ML components aren’t rigid like sorting routines, they’re squishy. The guarantees they provide are weak. If the training set data distribution is different from the data distribution in situ for any reason, all your guarantees go out the window. Your component just ambles around drunkenly in an alien world grasping for vestiges of the only reality it knows.

“AHA!” you think. “I ain’t no fool. We won’t fall victim to the data distribution drift problem with statically trained models because we won’t use ‘em! We’ll learn online!” I, for one, greatly admire the bold leadership it takes to suggest that we actually fight fire with fire. Some would say “Maybe we’ve got enough learning,” but you say “Maybe we haven’t got enough learning enough!” I like you.

The problem with learning online is that you’ve created an even shakier foundation for your engineering division, because feedback loops are everywhere. If your organization has more than one online learning ML algorithm in a particular pipeline you’ve got a system where one piece can learn from another piece that is learning from the first piece. These feedback loops can manifest themselves at two online algorithms in the pipeline, imagine what happens when you’re trying to learn online in many places across a large engineering organization.

The second challenge is that the experimental paradigm that has driven machine learning forward over the last few decades is reaching its limits. Machine learning has a single experimental paradigm: split up the data into a test set and training/validation sets, train on the training/validation sets, and test final performance on the test set. Machine learning is an outlier among experimental sciences in that it has a single paradigm. Other sciences have to work much harder to design their experiments and interpret their results.

One of the problems with this single paradigm and its focus on datasets is that large datasets are biased. Training and testing on biased datasets leads to unrealistic expectations of performance. The image below pretty much tells the story. When classifying cars, models trained on one of the large datasets perform worse when tested on the other large datasets almost universally. If cars were represented in these datasets in an unbiased way you’d expect much smaller differences in performance.

And while we’re at it, we aren’t even learning the right things if we want to acheive the heights of our increased ambitions. Our solution to classifying is statistical classification. But it’s hard to find a purely statistical basis in which to ground some problems like the one brought up in this slide…

These insights were really eye-opening for me coming from a young engineering organization that’s only recently become more rigorous about adding ML components to our stack.


Paper sessions were scheduled around the keynotes. Each session had a topic, some topics spanned multiple sessions over the three days. The sessions all went similarly, a presenter would present their paper in ~15 minutes, then there was time for a few questions, then…next presenter.

This ended up being pretty grueling by the end of the third day. There wasn’t really time to just think about any of the things that were presented to you. Trying to put together the-problem-you-didn’t-even-know-was-a-problem-3-minutes-ago and the research-at-the-edge-of-human-understanding that was just presented as the solution to that problem was a difficult process to repeat 3 times an hour, 6 hours a day, for 3 days. That’s not to mention synthesizing the information from the day’s keynote, and the poster sessions. These were long days. Jet lag wasn’t helping either.

At any time there were 6 paper sessions going on, so my strategy was to just pick the topic that I was most interested in. This was a bit of a high variance strategy as the sessions had a hit-or-miss quality to them. Other people used a more informed strategy of picking specific papers/presenters they knew to be good and then chose their sessions based on that. There’s no “Map of the Machine Learning Stars” to consult though…my lack of familiarity with a lot of the field precludes that kind of strategy. Maybe when I’ve been around a bit longer I’ll have a better handle on that kind of thing.

The Workshops

The final piece of the conference was the workshops. These were little mini-conferences organized by groups repesenting subfields of machine learning. I was happy to attend the 2-day workshop on the latest in Deep Learning. Invited speakers gave hour long talks on their research and took questions. The first day was supervised learning, the second was on unsupervised and reinforcement learning.

The highlight of the two days was Jason Weston’s presentation on the bAbI tasks. Leon Bottou mentioned this work in his keynote as an example of a way to evaluate machine learning systems that wasn’t just another instance of the normal training/validation/test set paradigm. The context for the work is in Question Answering. The computer is presented a small story and asked a simple question like this…

Mary moved to the bathroom.
John went to the hallway.
Where is Mary?

You can classify the questions asked by the kind of knowledge necessary to answer them. The example above is of task #1 “Basic Factoid QA with a Single Supporting Fact”. Here’s an example of task #14 “Time manipulation”…

In the afternoon Julie went to the park.
Yesterday Julie was at school.
Julie went to the cinema this evening.
Where did Julie go after the park?

There are 18 more of these tasks that all test a different aspect of the machine’s “intelligence”. Counting, Negation, Deduction, Positional reasoning are some of these aspects. The bAbI tasks are interesting because they attempt to probe a specific ability of the program, much like a unit test. Cataloging the abilities of an ML system in this way was something I haven’t come across before. I’m really drawn to the idea.

Wrap Up

ICML 2015 was a great conference. It reminded me of being in the physics research community and rekindled my latent love for academia. I’ve been away from it for a long time. ICML 2016 is in NYC next year which is a lot closer to home. I’ll definitely be going.

Providence: Failure Is Always an Option

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

The last five blog posts have been a highlight reel of Providence’s successes. Don’t be fooled, though, the road to Providence was long and winding. Let’s balance out the highlight reel with a look at some of the bumps in the road.

I said the road was long. Let’s quantify that. The first commit to the Providence repo happened on February 13, 2014 and Providence officially shipped with Targeted Job Ads for Stack Overflow on January 27, 2015. We worked on Providence for about a year.

You read that right; we worked on Providence for about a year. I’m telling you that this system…

…took a year to build. It’s kind of impressive, actually. The only new things on that diagram are the Daily Process, the Read API and a couple of redis instances. This certainly isn’t a big system by any means. We only sling around a few GB a day.

So…What Happened?

We failed slowly while solving the wrong problems trying to build a system that was too big on speculatively chosen technologies.

Too Big

When we started building Providence, we started with the idea that we were gonna keep data about persons for the most recent 365 days. Here’s a graph I made today:

That’s showing the distribution of people we saw on February 11, 2015 by the last time we saw them. As you go further back in time the percentage of people who were last seen on that day drops off precipitously (that’s a log scale). Keeping a year of data around makes absolutely no sense if you have that graph. The potential benefit (a hit on an old person) is far outweighed by the operational cost of a system that could keep that much data around. We didn’t have that graph when we sized the system, we just picked a number out of the sky. Our choice of a year was an order of magitude too high.

Speculative Technology Decisions

The problem with keeping around a year of data is that…that’s a lot of data. We decided that SQL Server, the data store we had, was inadequate based on the fact that we were going to have to do key-value lookups on hundreds of millions of keys. We never really questioned that assertion or anything, it just seemed to go without saying.

This led us to a speculative technology decision. We “needed” a new datastore, so the task was to evaluate the ones that seemed promising. I spent about 2 weeks learning Cassandra, did some tiny proofs of concept, and without evaluating other options (aside from reading feature lists) decided that it was the way to go.

The terribleness of this decision would be borne out over the ensuing months, but far more terrible was this way of making decisions. Based on my most cursory of investigations of one technology I, with what can only be described as froth, was able to convince Stack Exchange Engineering and Site Reliability we needed to build a cluster of machines to run a distributed data store that I had all of two weeks of experience with. And we did it.

While I certainly appreciate the confidence, this was madness.

Another decision we made, apropos of nothing, was that we needed to update predictions for people every 15 minutes. This led us to decide on a Windows Service architecture, which wasn’t really our forte. In addition to that, we also were pretty frothy about using C#’s async/await as TheWay™ to do things, and we had some experience there but not a bunch.

Solving the wrong problems

Early on we spent a good deal of time on the offline machine learning aspect of Providence. This was one of the few things we got right. Even if we had all our other problems nailed down, Providence still wouldn’t be anything if the underlying models didn’t work. We knew the models had to work and that they were the hardest part of the problem at the beginning, so that’s what we worked on.

The Developer Kinds model was finished with offline testing in June. The next step was to get it tested in the real world. That next step didn’t happen for four months. The linchpin of the system sat on the shelf untested for four months. Why?

Some of the fail of speculative technology decisions is that you don’t realize exactly how much you’re giving up by striking out in a new direction. We put years of work into our regular web application architecture and it shows. Migrations, builds, deployment, client libraries, all that incidental complexity is handled just the way we like. Multiple SRE’s and devs know how most systems work and can troubleshoot independently. I wouldn’t go so far as to say as it’s finely honed, but it’s definitely honed.

It was folly to think that things would go so smoothly with new-and-shiny-everything. The Windows Service + async/await + new data store equation always added up to more incidental work. We needed to make deploying Windows Services work. Cassandra needed migrations. Our regular perf tools like MiniProfiler don’t work right with async/await, so we needed to learn about flowing execution context. If Cassandra performance wasn’t what we needed it to be we were in a whole new world of pain, stuck debugging a distributed data store we had little experience with.

And then it happened.

The datacenter screwed up and power to our systems was cut unceremoniously. When we came back up Cassandra was in an odd state. We tried repairing, and anything else we thought would fix it but ultimately got nowhere. After a few days we found a bug report that exhibited similar behavior. It’d been filed a few weeks earlier, but there was no repro case. The ultimate cause wasn’t found until a month after the original bug report was filed.

This was the nail in the coffin for Cassandra. The fact that we got bitten by a bug in someone else’s software wasn’t hard to swallow. Bugs happen. It was the fact that were we in production, we’d have eaten a mystery outage for a month before someone was able to come up with an answer. It just proved how not ready we were with it and it made us uncomfortable. So we moved on.

Speculative Technology Decisions Redux

So what do you do when a bad technology choice bites you in the ass after months of work? You decide to use an existing technology in a completely unproven way and see if that works any better, of course.

We still “knew” our main tools for storing data weren’t going to work. But we’d also just been bitten and we wanted to use something closer to us, not as crazy, something we had more experience with. So we chose elasticsearch.

Solving The Wrong Problems Redux

And lo, we repeated all the same wrong problem solving with elasticsearch. There was a smidge less work because elasticsearch was already part of our infrastructure. Our operational issues ended up being just as bad though. We were auditing the system to figure out why our data wasn’t as we expected it and rediscovered a more-than-a-year-old bug, that HTTP Pipelining didn’t work. We turned pipelining off and while elasticsearch acted correctly we saw a marked performance drop. We tried to optimize for another few weeks but ultimately threw in the towel.

Failing slowly

Bad planning, bad tech decisions, and a propensity for sinking weeks on things only incidental to the problem all adds up to excruciatingly…slow…failure. Failing is only kind of bad. Failing slowly, particularly as slowly as we were failing, and having absolutely nothing to show for it is so much worse. We spent months on datastores without thinking to ourselves “Wow, this hurts a lot. Maybe we shouldn’t do this.” We spent time rewriting client libraries and validating our implementations. We held off throwing in the towel until way too late way too often.

At this point, now mid September 2014, we finally realized we needed to get failing fast on these models in the real world or we weren’t going to have anything to ship in January. We gave up on everything but the models, and focused on how we could get them tested as quickly as possible. We dropped back to the simplest system we could think of that would get us going: Windows task scheduler invoked console applications that write data to files on disk that are loaded into redis. We gave up on the ill-chosen “a year of data” and “updates every 15 minutes” constraints, and backed off to 6 weeks of data updated daily.

Within two weeks we had real-world tests going and we got incredibly lucky. Results of the model tests were nearly universally positive.

So what’s different now?

Engineering made some changes in the way we do things to try to keep Providence’s lost year from happening again. Overall we’re much more prickly about technology choices now. Near the end of last year we started doing Requests For Comment (RFCs) to help our decision making process.

RFCs start with a problem statement, a list of people (or placeholders for people) representative of teams that are interested in the problem, and a proposed solution. RFCs publicize the problem within the company, help you to gauge interest, and solicit feedback. They are public artifacts (we just use Google Docs) that help surface the “how” of decisions, not just the “what”. It’s still early going, but I like them a lot.

Kevin and I have essentially become allergic to big projects. We attempt to practice “What can get done by Friday?” driven development. Keeping things small precludes a whole class of errors like “We need a new datastore”, ‘cause that ain’t gettin’ done by Friday. It’s hard to sink a week on something you weren’t supposed to be working on when all you have is a week.

Providence: Architecture and Performance

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

We’ve talked about how we’re trying to understand our users better at Stack Exchange and seen just how big an impact it’s had on our pilot project, the Careers Job Ads. Let’s take a look at the architecture of the system.


This is the easy part, so let’s just get it out of the way. Providence has 2 workhorse servers (where we crunch data and make predictions), and 3 redis servers, configured like this:

Workhorse Servers

  • Dell R620
  • Dual Intel E5-2697v2 (12 Physical Cores @2.7GHz, 30MB Cache, 3/3/3/3/3/3/3/4/5/6/7/8 Turbo)
  • 384GB RAM (24x 16GB @ 1600MHz)
  • 2x 2.5” 500GB 7200rpm (RAID 1 - OS)
  • 2x 2.5” 512GB Samsung 840 Pro (RAID 1 - Data)

Redis Servers

  • Dell R720XD
  • Dual Intel E5-2650v2 (8 Physical Cores @2.6GHz, 10MB Cache, 5/5/5/5/5/6/7/8 Turbo)
  • 384GB RAM (24x 16GB @ 1600MHz)
  • 2x 2.5” 500GB 7200rpm (RAID 1 - OS, in Flex Bays)
  • 4x 2.5” 512GB Samsung 840 Pro (RAID 10 - Data)

If you’re a hardware aficionado you might notice that’s a weird configuration for a redis box. More on that (and the rest of our failures) in my next post.


Providence has 3 main parts: The Web Apps/HAProxy, The Daily Process, and the Read API.

The Web Apps and HAProxy


Providence’s architecture starts with the problem of identity. In order to make predictions about people we have to know who’s who. We handle identity in the Stack Exchange web application layer. When a person comes to Stack Exchange they’re either logged in to an account or not. If they’re logged in we use their Stack Exchange AccountId to identify them. If they’re not logged in, we set their “Providence Cookie” to a random GUID with a far future expiration and we use that to identify the person. We don’t do panopticlicky evercookies or anything, because ew. We add the person’s identifier to a header in every response the web application generates.


HAProxy sits in front of all our web apps and strips the identity header off of the response before we return it to the user. HAProxy also produces a log entry for every request/response pair that includes things like the date, request URI and query string, and various timing information. Because we added the person’s identity to the response headers in the web application layer, the log also includes our identifier for the person (the AccountId or the Providence Cookie).

We take all these log entries and put them into our HAProxyLogs database, which has a table for each day. So for each day we have a record of who came to the sites and what URIs they hit. This is the data that we crunch in the Daily Process.

The Daily Process

At the end of each broadcast day we update Providence’s predictions. The Daily Process is a console application written in C# that gets run by the Windows Task Scheduler at 12:05AM UTC every morning. There are 6 parts to the Daily Process.

Identity Maintenance

Whenever a person logs in to Stack Exchange they will have 2 identifiers in the HAProxyLogs that belong to them: the ProvidenceCookie GUID (from when they were logged out) and their AccountId (from after they logged in). We need to make sure that we know these two identifiers map to the same person. To do that we have the web app output both identifiers to the HAProxyLogs whenever someone successfully logs in. Then we run over the logs, pull out those rows, and produce the Identifier Merge Map. This tells us which Providence Cookie GUIDs go with which AccountIds. We store the Identifier Merge Map in protocol buffers format on disk.

Identity maintenance takes about 20 minutes of the Daily Process. The merge map takes up about 4GB on disk.

Tag Views Maintenance

The second step of the Daily Process is to maintain Tag View data for each person. A person’s Tag View data is, for each tag, the number of times that person has viewed a question with that tag. The first thing we do is load up the Identifier Merge Map and merge all the existing Tag View data to account for people who logged in. Then we run over the day’s HAProxyLogs table pulling out requests to our Questions/Show route. We regex the question id out of the request URI, lookup that question’s tags and add them to the person’s Tag View data.

We keep all Tag View data in one of our redis instances. It’s stored in hashes keyed by a person’s identifer. The field names of the hashes are integers we assign to each tag, the values are the number of times the person has viewed a question with the associated tag.

For each person we also maintain when their Tag View data was last updated, once in each person’s TagViews hash, and once again in a redis sorted set called “UpdatedDates” (more on this in a moment).

Tag Views maintenance takes about 20 minutes of the Daily Process. The Tag View data set + UpdatedDates fits in about 80GB of memory, and 35GB when saved to disk in redis’ RDB format.

Making Predictions

The third step of the Daily Process is to update our predictions. Our models are static, and the only input is a person’s Tag View data. That means we only need to update predictions for people whose Tag View data was updated. This is where we use the UpdatedDates sorted set as it tells us exactly who was updated on a particular day. If we didn’t have it, we’d have to walk the entire redis keyspace (670MM keys at one point) to find the data to update. This way we only get the people we need to do work on.

We walk over these people, pull out their Tag View data and run it through each of our predictors: Major, Web, Mobile, and Other Developer Kinds, and Tech Stacks. We dump the predictions we’ve made onto disk in dated folders in protocol buffers format. We keep them on disk for safety. If we hose the data in redis for some reason and we don’t notice for a few days we can rebuild redis as it should be from the files on disk.

Making predictions takes 5 or 6 hours of the Daily Process. The predictions take up about 1.5GB on disk per day.

Loading Predictions Into Redis

The fourth step of the Daily Process is to load all the predictions we just made into redis. We deserialize the predictions from disk and load them into our other redis server. It’s a similar setup to the Tag View data. There’s a hash for each person keyed by identifier, the fields of the hash are integers that map to particular features (1 = MajorDevKind.Web, 2 = MajorDevKind.Mobile, etc), and the values are the predictions (just doubles) for each feature.

Loading predictions takes about 10 minutes of the Daily Process. The predictions dataset takes up 85GB in memory and 73GB on disk in redis’ RDB format.

Six Week Cull

Providence’s SLA is split by whether you’re anonymous or not. We keep and maintain predictions for any anonymous person we’ve seen in the last 6 weeks (the vast vast majority of people). We keep and maintain predictions for logged in people (the vast vast minority of people) indefinitely as of 9/1/2014. So that means once an anonymous person hasn’t been updated in 6 weeks we’re free to delete them, so we do. We use the UpdatedDates set again here to keep us from walking hundreds of millions of keys to evict the 1% or so that expire.

The Six Week Cull runs in about 10 minutes.


The last part of the Daily Process is getting all the data out to Oregon in case we have an outage in New York. We backup our two redis instances, rsync the backups to Oregon, and restore them to two Oregon redis instances.

The time it takes to do Backup/Restore to Oregon varies based on how well our connection to Oregon is doing and how much it’s being used by other stuff. We notify ourselves at 5 hours if it hasn’t completed just to make sure someone looks at it and we’ve tripped that notification a few times. Most nights it takes a couple hours.

The Read API

Step four in the Daily Process is to load all our shiny new predictions into redis. We put them in redis so that they can be queried by the Read API. The Read API is an ASP.NET MVC app that turns the predictions we have in redis hashes into JSON to be consumed by other projects at Stack Exchange. It runs on our existing service box infrastructure.

The ad server project calls into the Read API to get Providence data to make better decisions about which jobs to serve an incoming person. That means the Read API needs to get the data out as quickly as possible so the ad renders as quickly as possible. If it takes 50ms to pull data from providence we’ve just eaten the entire ad server’s response time budget. To make this fast Read API takes advantage of the serious performance work that’s gone into our redis library StackExchange.Redis, and Kevin’s JSON serializer Jil.


The Workhorses

Excluding backups, the daily process takes about 7 hours a day, the bottleneck of which is cranking out predictions. One workhorse box works and then sits idle for the rest of the day. The second workhorse box sits completely idle, so we have plenty of room to grow CPU-hours wise.


With redis we care about watching CPU load and memory. We can see there’s two humps in CPU load each day. The first hump is the merging of Tag View data, the second is when redis is saving backups to disk. Other than that, the redii sit essentially idle, only being bothered to do reads for the Read API. We aren’t even using the third redis box right now. After we implemented the Six Week Cull, the Tag Views and Predictions data sets fit comfortably in RAM on one box, so we may combine the two we’re using down to just one.

The Read API

The big thing to watch in the Read API is latency. The Read API has a pretty decent 95th-percentile response time since launch, but we did have some perf stuff to work out. The Read API is inextricably tied to how well the redii are doing, if we’re cranking the TagViews redis core too hard, Read API will start timing out because it can’t get the data it needs.

At peak the Read API serves about 200 requests/s.

Failure is Always an Option

That’s it for the architecture and performance of Providence. Up next we’ll talk about our failures in designing and building Providence.

Providence: Testing and Results

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

The Providence project was motivated by our desire to better understand our users at Stack Exchange. So we think we’ve figured out what kind of developers come to our sites, and what technologies they’re using. Then we figured out a way to combine all our features into the Value Function. How did we test these features?

After each feature worked in the clean room (it passed the muster on the training, validation, and test sets we’d labeled) we had to drop it into some real-world mud to try and move a needle. Please attempt to contain your surprise as I tell you we chose to use ads as our mud puddle, and click-through rate as our needle.

The particular ads we wanted to use were the Careers Job Ads that run in the sidebar of Stack Overflow. Last we left Careers Job Ads, we knew that geography was by far the most important feature we could use to select jobs to show you in the ads. We’ve done some visual updates since then, but the job selection algorithm has remained unchanged.

We had a number of features in the pipeline at this point that were untested in real-world conditions. Major Dev Kinds, Minor Dev Kinds, Tech Stacks and Job Tags were the individual features we had to test. We did those first and then the final test was for the Value Function that combined each of those individual features and geography. Each test had these phases:

Labeling Party: The Jobening

We had a model for users, but we weren’t going to model jobs at the same time because then we wouldn’t know whether we were testing the user model or the job model. We also needed the cleanest labels on jobs that we could get which meant good ol’ fashioned home grown fresh squeezed human labels. Gotta give a big shout out to the developers at Stack Exchange for giving us their time to label each data set.

Pre-analysis and Experiment Design

This is a thing we’re getting more and more into here. We’ve wasted a lot of cycles on tests and ideas that would’ve been avoidable if we’d just looked at the data ahead of time. We needed to know approximately how many impressions we were going to get in the test groups so we would know how long we’d have to wait for the A/B results to converge. If it was going to take months for the results to converge for one of these experiments, we’d be better off just doing a different experiment. We had to know when we were going to have to give up controlling for important factors like geography in order to get enough data points in a bucket.

Turning it up

Once we knew how long we’d have to wait, we did some easing in. We’d run the test at 1% for a day to work out any bugs, monitor the early impression distributions and click-through rates to make sure they weren’t getting completely hosed and that we were also measuring correctly. Then we’d slowly ramp it up during the second day to 100% of usable impressions.


Post Analysis

At this point we’d have click-through rates that had supposedly converged, but it wasn’t enough that CTR just went up. We’d also have to make sure that gains on one subgroup weren’t at an unreasonable expense to the other groups. We called these checks “Degeneracy Passes”.

By the end, we were able to get through about a test per week. By the luck o’ the data scientist we even got some decent results.


The Value Function performs pretty admirably in comparison to the geography based targeting we were using before. We saw ~27% lift.

Looking at CTR improvement by geography, we found that jobs closer to the user improved less, confirming our previous experiments that geography is pretty important. Inside 60 miles we saw ~17% lift picking jobs using Value Function. Outside 60 miles we saw even larger gains.

Looking at jobs by some popular dev kinds we saw the following gains:

It wasn’t all kitten whispers and tickle fights. Desktop OSX jobs lost about 6% of their clicks, but we only had 9 of those jobs in the test. Similarly, SharePoint Integration jobs lost 100% of their clicks in the test, but there were only 3 of those jobs. I guess we’ll just have to live with those terrible blemishes for now.

Looking at jobs with a smattering of popular tags we saw these gains:

Node really stands out there (I think we just got lucky), but we were pretty happy with all those gains. We didn’t find any losers among tags used on more than 50 jobs.

Next up: The Architecture of Providence

A Wild Anomaly Appears! Part 2: The Anomaling

After all the rave reviews of my last post I knew you were just on the edge of your seat waiting to hear more about my little unsupervised text anomaly detector.

So, we’ve got some working ML! Our job’s done right?

‘Round these parts, it ain’t shipped ‘til it’s fast and it’s got it’s own chat bot. We spend all day in chat and there’s a cast of characters we’ve come to know, and love, and hate.


Pinbot helps us not take each other’s database migrations. You can say “taking ###” and Pinbot removes the last taken migration from this room’s pins and replaces it with this one. So we always know what migration we’re up to, even if someone hasn’t pushed it yet. Also pictured: Me calling myself a dumbass in the git logs. Which gets broadcast by our TeamCity bot. Someone starred it.

Hair on Fire

Hair on fire bot helps Careers keep making money. Hair on fire pops up every now and again to tell us a seriously critical error that affects the bottom line has happened on the site. If someone is buying something or on their way to buying something and gets an exception Hair On Fire dumps the details directly to chat so someone can get to work immediately.


Here’s another little Careers helper. We have a policy that any advertising images that go up on Stack Overflow must be reviewed by a real live person. When display advertising is purchased from our ad sales/ad operations teams, they do it. When we introduced Company Page Ads we automated the process of getting images on Stack Overflow and no one had to talk to a sales person. This begat LogoBot who reminds us to do our job and make sure no one’s putting up animated gifs or other such tawdriness.

Malfunctioning Eddie

Malfunctioning Eddie’s…got problems.

Anomaly Bot

Which brings me to the Anomaly bot. We need to make sure that all these anomalous jobs I’m detecting get in front of the sales support team. They are the human layer of detectors I alluded to in my last post who used to have to check over every single job posted to the board.

There it is. Anomaly bot. Where does it lead us puny humans?

Welcome to the super secret admin area of Careers. At the top of the page we have a list of the jobs that were posted today. There are 3 columns, the anomaly score (which is based solely on title), the job title, and a few buttons to mark the job anomalous or not. The second section of the page is for all jobs currently on the board.

I’m hoping the heatmap pops out at you. It runs from Red (pinned to the all-time most anomalous job) to Green (pinned to the all-time most middle-of-the-road job ever). The jobs posted today are light orange at worst, so that’s pretty good! On the “all jobs” list there’s a bunch of red that we need to review.

Just to give a reference, here was the first version sans heatmap.

So much more information in that tiny extra bit of color. If you want to make simple heatmaps it’s really easy to throw together some javascript that uses the power of HSL.

What’s Next?

We’re gonna let this marinate for a while to actually test my hypothesis that we only have to look at the top 10% of jobs by anomaly score. The sales support team’s gonna clear the backlog in the “all jobs” section of the report, then use the tool for a little while and then we’ll have the data we need to actually set the threshold. Once we do that the Anomaly bot can get a little smarter. Right now Anomaly bot just shows every three hours with that same dumb message. Maybe it’ll only show up when there’s jobs above our human-trained threshold (modulo a safety factor). Maybe we’ll change it to pop up as soon as an anomalous job gets posted on the board.

Here, have some code

If you want to use the very code we’re using right now to score the job titles it’s up on Nuget, and the source is on Github

Got experience solving problems like this one? Wanna work at a place like Stack Exchange? Head on over to our completely middle-of-the-road job listing and get in touch with us.

A Wild Anomaly Appears!

So, I’m working on the new Data Team at Stack Exchange now. Truth is we have no idea what we’re doing (WANNA JOIN US?). But every now and then we come across something that works a little too well and wonder why we haven’t heard about it before.

We run a niche job board for programmers that has about 2900 jobs on it this morning. Quality has been pretty easy to maintain. We have a great spam filter called “The $350 Price Tag”. Then we have some humans that look over the jobs that get posted looking for problems. Overall the system works well, but at 2900 jobs a month that means someone has to look through about 150 jobs every working day. They’re looking for a needle in a haystack as most (>95%) of the jobs posted are perfectly appropriate for the board, so there’s a lot of “wasted” time spent looking for ghosts that aren’t there. And it’s pretty boring to boot. I’m sure that person would rather do other things with their time.

It’d be nice if we had an automated way of dealing with this. We have no idea what we’re doing, so we frequently just reach into our decidedly meager bag of tricks, pull one out, and try it on a problem. I’d done that a few times before on this problem, trying Naive Bayes or Regularized Logistic Regression, but had gotten nowhere. There are a lot of different ways a job can be appropriate for the board and there are a lot of different ways a job could be not appropriate for the board which makes coming up with a representative training set difficult.

Last week while taking another whack at the problem I Googled “Text Anomaly” and came across David Guthrie’s 186 page Ph. D. thesis, Unsupervised Detection of Anomalous Text. There’s a lot there, but the novel idea was simple enough (and it worked in his experiments and mine) that I’m surprised I haven’t heard about it until now.

Distance to the Textual Complement

Say you have a bunch of documents. You pull one out and want to determine how anomalous it is with respect to all the others. Here’s what you do:

  1. Choose some features to characterize the document in question.
  2. Convert the document to its feature representation.
  3. Treat all the other documents as one giant document and convert that to its feature representation.
  4. Calculate the distance between the two.

Do this for every document and sort the results descending by the distance calculated in step 4. The documents at the top of the list are the “most anomalous”.

That’s it. Pretty simple to understand and implement. There are two choices to make: which features, and which distance metric to use.

Obscurity of Vocabulary, I choose you!

In any machine learning problem you have to come up with features to characterize the set of things you’re trying to work on. This thesis is chock full of features, 166 of them broken up into a few different categories. This part of the paper was a goldmine for me (remember, I have no idea what I’m doing). The text features I knew about before this were word counts, frequencies, tf-idf and maybe getting a little into part of speech tags. The kinds of features he talks about are stuff I never would’ve come up with on my own. If you’re doing similar work and are similarly lost, take a look there for some good examples of feature engineering.

The set of features that stood out to me the most were the ones in a section called “Obscurity of Vocabulary Usage”. The idea was to look at a giant reference corpus and rank the words in the corpus descending by frequency. Then you make lists of the top 1K, 5K, 10K, 50K, etc. words. Then you characterize a document by calculating the percentages of the document’s words that fall into each bucket.

Manhattan Distance, I choose you!

Guthrie pits a bunch of distance metrics against eachother and for Distance to the Textual Complement method the Manhattan distance got the blue ribbon, so I used that.


When I’ve been looking through the jobs before I can pretty much tell by their titles whether they’re busted or not, so my documents were just the job titles. There isn’t really a good reference corpus from which to build the Top-N word lists, so I just used the job titles themselves. I tried a couple different sets of Ns but ended up on 100, 300, 1000, 3000, and 10000 (Really ~7,000 as that’s the number of unique terms in all job titles).


Here’s the sort of all the jobs that were on the board yesterday.

Basically everything on the right is good and has low scores (distances).

Most of the jobs on the left have something anomalous in their titles. Let’s group up the anomalies by the reasons they’re broken and look over some of them.

Stuff that just ain’t right

These jobs just don’t belong on the board. We need to follow up with these people and issue refunds.

  1. Supervisor Commercial Administration Fox Networks Group
  2. Sales Executive Risk North Americas
  3. Associate Portfolio Manager Top Down Research
  4. Senior Actuarial Pre-Sales Consultant
  5. Ad Operations Coordinator
  6. NA Customer Service Representative for Sungard Energy
  7. Manager, Curation

Just Terrible Titles

These jobs would belong, but the titles chosen were pretty bad. Misspellings, too many abbreviations, etc. We need to follow up with our customers about these titles to improve them.

  1. Javascript Devlopers Gibraltar
  2. Sr Sys Admin Tech Oper Aux Svcs
  3. VR/AR Developer

Duplicate Information

Duplicate Information

Anywhere you see the title for a job on Stack Overflow or Careers we also show you things like the location of the job, whether it’s remote or not, and the company’s name. These titles duplicate that information. We need to follow up with our customers to improve them.

  1. Delivery Manager, Trade Me, New Zealand
  2. Visualization Developer, Calgary
  3. Technical Expert Hyderabad
  4. Technical Expert Pune
  5. Technical Expert Chennai
  6. Technical Expert Gurgaon
  7. New York Solutions Architect
  8. Sr Fullstack Eng needed for Cargurus We reach over 10MM unique visitors monthly
  9. Sr. Tester, Sky News, West London
  10. Chief Information Officer/CIO Audible.com
  11. Machine Learning Engineer Part Time Remote Working Available

What about the false positives?

A number of false positives are produced (this is just a sample):

  1. Computer Vision Scientist
  2. Mac OSX Guru
  3. Angularjs + .NET + You
  4. Android Developer 100% Boredom Free Zone
  5. Java Developer 100% Boredom Free Zone
  6. DevOps Engineer - Winner, Hottest DC Startups!!! $10M Series A
  7. Jr. Engineer at Drizly

Some of these (Computer Vision, Mac OSX) are just infrequently found on our board. Some of these people are trying to be unique (and are successful, by this analysis) so that their listing stands out.

Guthrie goes into a bit of detail about this in a section on precision and recall in the paper. His conclusion is that this kind of anomaly detection is particularly suited to when you have a human layer of detectors as the last line of defense and want to reduce the work they have to do. An exhaustive exploration of the scores finds that all of the jobs we need to follow up on are in the top 10% when ordered descending by their anomaly scores. Setting that threshold should cut the job our humans have to do by 90%, making them happier and less bored, and improving the quality of the job board.

So You Want a Zillion Developers…

I work at Stack Overflow on Careers 2.0. In addition to our job board we have a candidate database where you can search for developers to hire. Our candidate database has 124K+ developers in it right now.

Customers frequently gawk at this number because they’ve looked at other products in the dev hiring space that offer millions of candidates in their databases. Sourcing.io claims to have “over 4 million developers” in their database. Gild offers “Over 6 Million Developers”. Entelo will give you access to “18+ million candidates indexed from 20+ social sites.”

Yeah man, your numbers stink

Hey. That hurts.

Let’s put those numbers in perspective. The vast majority of the developers “in” these other databases don’t even know they exist. The devs never signed up to be listed or even indicated that they were looking for work. There isn’t even a way to opt out. These databases are built by scraping APIs and data dumps from sites developers actually care about like Stack Overflow and GitHub.

On the other hand the only people you’ll find in the Careers 2.0 database are ones who made the affirmative choice to be found. They start by getting an invitation to create a profile. They build out a profile with their employment and education history, open source projects, books they read, peer reviewed answers on Stack Overflow, and so on. Then they can choose to be listed as either an active candidate (they’re in the market for a job right now) or a passive candidate (they’re happy where they are but are willing to hear your offer). After a candidate gets hired they can delist themselves from the database so future employers don’t waste any time on them.

So the difference between us and them is that we give you a smaller number of candidates who are already interested in job offers and they give you a giant database filled with hope and built by skeez.

We have some data from Careers that tells us hope is not a recruiting strategy.

Our Experiment

Careers 2.0 experimented with the “index a bunch of people who don’t know they’re being indexed” model to see if it could possibly work. We created what we called “mini-profiles” which consisted exclusively of already public information available on Stack Overflow. We would add mini-profiles to the database if the Stack Overflow user provided a location in their profile and had a minimum number of answers with a minimum score. We showed these mini-profiles along with our “real” candidates in search results. If an employer wanted to contact one of the people behind a mini-profile Careers 2.0 would send them an e-mail asking if they want to open up a conversation with the employer. If the candidate wanted to continue they could authorize us to share their contact information with the employer and they’d start working on their love connection.

Our Results

We track response rates to employer messages to look out for bad actors and generally make sure the messaging system is healthy. A candidate can respond to a message interested/not interested or they can fail to respond at all. Response rate is defined as Messages Responded To / Messages Sent. When we compared the response rates of messages to mini-profiles to the response rates of messages to “real” profiles the results were not good for mini-profiles. Messages to “real” profiles were 6.5x more likely to get a response than messages to mini-profiles. That was the last and only straw for mini-profiles. We retired the experiment earlier this year.

So what about the zillions of programmers?

All those services I named at the beginning of this post do what we did in our experiment, just a little more extensively by including devs from more places online. I have to believe that the response rates from their unqualified leads are similar to the ones we found in our experiment. I suppose technically the response rates from randodevs on GitHub or Bitbucket could be higher than that of randodevs on Stack Overflow thus invalidating our conclusion, but anecdotal evidence from our customers about those other services suggests not.

“Wait a sec there Jason,” you’re thinking, “if their databases are at least 6.5x larger than yours I’ll still get more responses to my messages right?” Absolutely! That’s called spam. You are totally allowed to go down the path of the spammer but let me hip you to the two problems there. The first problem with your plan is that devs hate recruiting spam more than they hate PHP, and they hate PHP alot. The word will get out that you’re wasting everyone’s time. People will write about it. The second problem is that spam is supposed to be cheap. This isn’t cheap. In this case you’ll have to spend at least 6.5x the time wading through these zillions of devs identifying the ones that meet your hiring criteria, messaging them, and waiting for responses. So not only are you wasting their time, you’re wasting yours.

We aren’t going to build this business off hope and spam and wasting people’s time. If a smaller database is the price, so be it.