Jason Punyon

Providence: Failure Is Always an Option

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

The last five blog posts have been a highlight reel of Providence’s successes. Don’t be fooled, though, the road to Providence was long and winding. Let’s balance out the highlight reel with a look at some of the bumps in the road.

I said the road was long. Let’s quantify that. The first commit to the Providence repo happened on February 13, 2014 and Providence officially shipped with Targeted Job Ads for Stack Overflow on January 27, 2015. We worked on Providence for about a year.

You read that right; we worked on Providence for about a year. I’m telling you that this system…

…took a year to build. It’s kind of impressive, actually. The only new things on that diagram are the Daily Process, the Read API and a couple of redis instances. This certainly isn’t a big system by any means. We only sling around a few GB a day.

So…What Happened?

We failed slowly while solving the wrong problems trying to build a system that was too big on speculatively chosen technologies.

Too Big

When we started building Providence, we started with the idea that we were gonna keep data about persons for the most recent 365 days. Here’s a graph I made today:

That’s showing the distribution of people we saw on February 11, 2015 by the last time we saw them. As you go further back in time the percentage of people who were last seen on that day drops off precipitously (that’s a log scale). Keeping a year of data around makes absolutely no sense if you have that graph. The potential benefit (a hit on an old person) is far outweighed by the operational cost of a system that could keep that much data around. We didn’t have that graph when we sized the system, we just picked a number out of the sky. Our choice of a year was an order of magitude too high.

Speculative Technology Decisions

The problem with keeping around a year of data is that…that’s a lot of data. We decided that SQL Server, the data store we had, was inadequate based on the fact that we were going to have to do key-value lookups on hundreds of millions of keys. We never really questioned that assertion or anything, it just seemed to go without saying.

This led us to a speculative technology decision. We “needed” a new datastore, so the task was to evaluate the ones that seemed promising. I spent about 2 weeks learning Cassandra, did some tiny proofs of concept, and without evaluating other options (aside from reading feature lists) decided that it was the way to go.

The terribleness of this decision would be borne out over the ensuing months, but far more terrible was this way of making decisions. Based on my most cursory of investigations of one technology I, with what can only be described as froth, was able to convince Stack Exchange Engineering and Site Reliability we needed to build a cluster of machines to run a distributed data store that I had all of two weeks of experience with. And we did it.

While I certainly appreciate the confidence, this was madness.

Another decision we made, apropos of nothing, was that we needed to update predictions for people every 15 minutes. This led us to decide on a Windows Service architecture, which wasn’t really our forte. In addition to that, we also were pretty frothy about using C#’s async/await as TheWay™ to do things, and we had some experience there but not a bunch.

Solving the wrong problems

Early on we spent a good deal of time on the offline machine learning aspect of Providence. This was one of the few things we got right. Even if we had all our other problems nailed down, Providence still wouldn’t be anything if the underlying models didn’t work. We knew the models had to work and that they were the hardest part of the problem at the beginning, so that’s what we worked on.

The Developer Kinds model was finished with offline testing in June. The next step was to get it tested in the real world. That next step didn’t happen for four months. The linchpin of the system sat on the shelf untested for four months. Why?

Some of the fail of speculative technology decisions is that you don’t realize exactly how much you’re giving up by striking out in a new direction. We put years of work into our regular web application architecture and it shows. Migrations, builds, deployment, client libraries, all that incidental complexity is handled just the way we like. Multiple SRE’s and devs know how most systems work and can troubleshoot independently. I wouldn’t go so far as to say as it’s finely honed, but it’s definitely honed.

It was folly to think that things would go so smoothly with new-and-shiny-everything. The Windows Service + async/await + new data store equation always added up to more incidental work. We needed to make deploying Windows Services work. Cassandra needed migrations. Our regular perf tools like MiniProfiler don’t work right with async/await, so we needed to learn about flowing execution context. If Cassandra performance wasn’t what we needed it to be we were in a whole new world of pain, stuck debugging a distributed data store we had little experience with.

And then it happened.

The datacenter screwed up and power to our systems was cut unceremoniously. When we came back up Cassandra was in an odd state. We tried repairing, and anything else we thought would fix it but ultimately got nowhere. After a few days we found a bug report that exhibited similar behavior. It’d been filed a few weeks earlier, but there was no repro case. The ultimate cause wasn’t found until a month after the original bug report was filed.

This was the nail in the coffin for Cassandra. The fact that we got bitten by a bug in someone else’s software wasn’t hard to swallow. Bugs happen. It was the fact that were we in production, we’d have eaten a mystery outage for a month before someone was able to come up with an answer. It just proved how not ready we were with it and it made us uncomfortable. So we moved on.

Speculative Technology Decisions Redux

So what do you do when a bad technology choice bites you in the ass after months of work? You decide to use an existing technology in a completely unproven way and see if that works any better, of course.

We still “knew” our main tools for storing data weren’t going to work. But we’d also just been bitten and we wanted to use something closer to us, not as crazy, something we had more experience with. So we chose elasticsearch.

Solving The Wrong Problems Redux

And lo, we repeated all the same wrong problem solving with elasticsearch. There was a smidge less work because elasticsearch was already part of our infrastructure. Our operational issues ended up being just as bad though. We were auditing the system to figure out why our data wasn’t as we expected it and rediscovered a more-than-a-year-old bug, that HTTP Pipelining didn’t work. We turned pipelining off and while elasticsearch acted correctly we saw a marked performance drop. We tried to optimize for another few weeks but ultimately threw in the towel.

Failing slowly

Bad planning, bad tech decisions, and a propensity for sinking weeks on things only incidental to the problem all adds up to excruciatingly…slow…failure. Failing is only kind of bad. Failing slowly, particularly as slowly as we were failing, and having absolutely nothing to show for it is so much worse. We spent months on datastores without thinking to ourselves “Wow, this hurts a lot. Maybe we shouldn’t do this.” We spent time rewriting client libraries and validating our implementations. We held off throwing in the towel until way too late way too often.

At this point, now mid September 2014, we finally realized we needed to get failing fast on these models in the real world or we weren’t going to have anything to ship in January. We gave up on everything but the models, and focused on how we could get them tested as quickly as possible. We dropped back to the simplest system we could think of that would get us going: Windows task scheduler invoked console applications that write data to files on disk that are loaded into redis. We gave up on the ill-chosen “a year of data” and “updates every 15 minutes” constraints, and backed off to 6 weeks of data updated daily.

Within two weeks we had real-world tests going and we got incredibly lucky. Results of the model tests were nearly universally positive.

So what’s different now?

Engineering made some changes in the way we do things to try to keep Providence’s lost year from happening again. Overall we’re much more prickly about technology choices now. Near the end of last year we started doing Requests For Comment (RFCs) to help our decision making process.

RFCs start with a problem statement, a list of people (or placeholders for people) representative of teams that are interested in the problem, and a proposed solution. RFCs publicize the problem within the company, help you to gauge interest, and solicit feedback. They are public artifacts (we just use Google Docs) that help surface the “how” of decisions, not just the “what”. It’s still early going, but I like them a lot.

Kevin and I have essentially become allergic to big projects. We attempt to practice “What can get done by Friday?” driven development. Keeping things small precludes a whole class of errors like “We need a new datastore”, ‘cause that ain’t gettin’ done by Friday. It’s hard to sink a week on something you weren’t supposed to be working on when all you have is a week.

Providence: Architecture and Performance

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

We’ve talked about how we’re trying to understand our users better at Stack Exchange and seen just how big an impact it’s had on our pilot project, the Careers Job Ads. Let’s take a look at the architecture of the system.

Hardware

This is the easy part, so let’s just get it out of the way. Providence has 2 workhorse servers (where we crunch data and make predictions), and 3 redis servers, configured like this:

Workhorse Servers

  • Dell R620
  • Dual Intel E5-2697v2 (12 Physical Cores @2.7GHz, 30MB Cache, 3/3/3/3/3/3/¾/5/6/7/8 Turbo)
  • 384GB RAM (24x 16GB @ 1600MHz)
  • 2x 2.5” 500GB 7200rpm (RAID 1 - OS)
  • 2x 2.5” 512GB Samsung 840 Pro (RAID 1 - Data)

Redis Servers

  • Dell R720XD
  • Dual Intel E5-2650v2 (8 Physical Cores @2.6GHz, 10MB Cache, 5/5/5/5/5/6/7/8 Turbo)
  • 384GB RAM (24x 16GB @ 1600MHz)
  • 2x 2.5” 500GB 7200rpm (RAID 1 - OS, in Flex Bays)
  • 4x 2.5” 512GB Samsung 840 Pro (RAID 10 - Data)

If you’re a hardware aficionado you might notice that’s a weird configuration for a redis box. More on that (and the rest of our failures) in my next post.

Software

Providence has 3 main parts: The Web Apps/HAProxy, The Daily Process, and the Read API.

The Web Apps and HAProxy

Identity

Providence’s architecture starts with the problem of identity. In order to make predictions about people we have to know who’s who. We handle identity in the Stack Exchange web application layer. When a person comes to Stack Exchange they’re either logged in to an account or not. If they’re logged in we use their Stack Exchange AccountId to identify them. If they’re not logged in, we set their “Providence Cookie” to a random GUID with a far future expiration and we use that to identify the person. We don’t do panopticlicky evercookies or anything, because ew. We add the person’s identifier to a header in every response the web application generates.

HAProxy

HAProxy sits in front of all our web apps and strips the identity header off of the response before we return it to the user. HAProxy also produces a log entry for every request/response pair that includes things like the date, request URI and query string, and various timing information. Because we added the person’s identity to the response headers in the web application layer, the log also includes our identifier for the person (the AccountId or the Providence Cookie).

We take all these log entries and put them into our HAProxyLogs database, which has a table for each day. So for each day we have a record of who came to the sites and what URIs they hit. This is the data that we crunch in the Daily Process.

The Daily Process

At the end of each broadcast day we update Providence’s predictions. The Daily Process is a console application written in C# that gets run by the Windows Task Scheduler at 12:05AM UTC every morning. There are 6 parts to the Daily Process.

Identity Maintenance

Whenever a person logs in to Stack Exchange they will have 2 identifiers in the HAProxyLogs that belong to them: the ProvidenceCookie GUID (from when they were logged out) and their AccountId (from after they logged in). We need to make sure that we know these two identifiers map to the same person. To do that we have the web app output both identifiers to the HAProxyLogs whenever someone successfully logs in. Then we run over the logs, pull out those rows, and produce the Identifier Merge Map. This tells us which Providence Cookie GUIDs go with which AccountIds. We store the Identifier Merge Map in protocol buffers format on disk.

Identity maintenance takes about 20 minutes of the Daily Process. The merge map takes up about 4GB on disk.

Tag Views Maintenance

The second step of the Daily Process is to maintain Tag View data for each person. A person’s Tag View data is, for each tag, the number of times that person has viewed a question with that tag. The first thing we do is load up the Identifier Merge Map and merge all the existing Tag View data to account for people who logged in. Then we run over the day’s HAProxyLogs table pulling out requests to our Questions/Show route. We regex the question id out of the request URI, lookup that question’s tags and add them to the person’s Tag View data.

We keep all Tag View data in one of our redis instances. It’s stored in hashes keyed by a person’s identifer. The field names of the hashes are integers we assign to each tag, the values are the number of times the person has viewed a question with the associated tag.

For each person we also maintain when their Tag View data was last updated, once in each person’s TagViews hash, and once again in a redis sorted set called “UpdatedDates” (more on this in a moment).

Tag Views maintenance takes about 20 minutes of the Daily Process. The Tag View data set + UpdatedDates fits in about 80GB of memory, and 35GB when saved to disk in redis’ RDB format.

Making Predictions

The third step of the Daily Process is to update our predictions. Our models are static, and the only input is a person’s Tag View data. That means we only need to update predictions for people whose Tag View data was updated. This is where we use the UpdatedDates sorted set as it tells us exactly who was updated on a particular day. If we didn’t have it, we’d have to walk the entire redis keyspace (670MM keys at one point) to find the data to update. This way we only get the people we need to do work on.

We walk over these people, pull out their Tag View data and run it through each of our predictors: Major, Web, Mobile, and Other Developer Kinds, and Tech Stacks. We dump the predictions we’ve made onto disk in dated folders in protocol buffers format. We keep them on disk for safety. If we hose the data in redis for some reason and we don’t notice for a few days we can rebuild redis as it should be from the files on disk.

Making predictions takes 5 or 6 hours of the Daily Process. The predictions take up about 1.5GB on disk per day.

Loading Predictions Into Redis

The fourth step of the Daily Process is to load all the predictions we just made into redis. We deserialize the predictions from disk and load them into our other redis server. It’s a similar setup to the Tag View data. There’s a hash for each person keyed by identifier, the fields of the hash are integers that map to particular features (1 = MajorDevKind.Web, 2 = MajorDevKind.Mobile, etc), and the values are the predictions (just doubles) for each feature.

Loading predictions takes about 10 minutes of the Daily Process. The predictions dataset takes up 85GB in memory and 73GB on disk in redis’ RDB format.

Six Week Cull

Providence’s SLA is split by whether you’re anonymous or not. We keep and maintain predictions for any anonymous person we’ve seen in the last 6 weeks (the vast vast majority of people). We keep and maintain predictions for logged in people (the vast vast minority of people) indefinitely as of 9/1/2014. So that means once an anonymous person hasn’t been updated in 6 weeks we’re free to delete them, so we do. We use the UpdatedDates set again here to keep us from walking hundreds of millions of keys to evict the 1% or so that expire.

The Six Week Cull runs in about 10 minutes.

Backup/Restore

The last part of the Daily Process is getting all the data out to Oregon in case we have an outage in New York. We backup our two redis instances, rsync the backups to Oregon, and restore them to two Oregon redis instances.

The time it takes to do Backup/Restore to Oregon varies based on how well our connection to Oregon is doing and how much it’s being used by other stuff. We notify ourselves at 5 hours if it hasn’t completed just to make sure someone looks at it and we’ve tripped that notification a few times. Most nights it takes a couple hours.

The Read API

Step four in the Daily Process is to load all our shiny new predictions into redis. We put them in redis so that they can be queried by the Read API. The Read API is an ASP.NET MVC app that turns the predictions we have in redis hashes into JSON to be consumed by other projects at Stack Exchange. It runs on our existing service box infrastructure.

The ad server project calls into the Read API to get Providence data to make better decisions about which jobs to serve an incoming person. That means the Read API needs to get the data out as quickly as possible so the ad renders as quickly as possible. If it takes 50ms to pull data from providence we’ve just eaten the entire ad server’s response time budget. To make this fast Read API takes advantage of the serious performance work that’s gone into our redis library StackExchange.Redis, and Kevin’s JSON serializer Jil.

Performance

The Workhorses

Excluding backups, the daily process takes about 7 hours a day, the bottleneck of which is cranking out predictions. One workhorse box works and then sits idle for the rest of the day. The second workhorse box sits completely idle, so we have plenty of room to grow CPU-hours wise.

Redis

With redis we care about watching CPU load and memory. We can see there’s two humps in CPU load each day. The first hump is the merging of Tag View data, the second is when redis is saving backups to disk. Other than that, the redii sit essentially idle, only being bothered to do reads for the Read API. We aren’t even using the third redis box right now. After we implemented the Six Week Cull, the Tag Views and Predictions data sets fit comfortably in RAM on one box, so we may combine the two we’re using down to just one.

The Read API

The big thing to watch in the Read API is latency. The Read API has a pretty decent 95th-percentile response time since launch, but we did have some perf stuff to work out. The Read API is inextricably tied to how well the redii are doing, if we’re cranking the TagViews redis core too hard, Read API will start timing out because it can’t get the data it needs.

At peak the Read API serves about 200 requests/s.

Failure is Always an Option

That’s it for the architecture and performance of Providence. Up next we’ll talk about our failures in designing and building Providence.

Providence: Testing and Results

This post is part of a series on the Providence project at Stack Exchange. The first post can be found here.

The Providence project was motivated by our desire to better understand our users at Stack Exchange. So we think we’ve figured out what kind of developers come to our sites, and what technologies they’re using. Then we figured out a way to combine all our features into the Value Function. How did we test these features?

After each feature worked in the clean room (it passed the muster on the training, validation, and test sets we’d labeled) we had to drop it into some real-world mud to try and move a needle. Please attempt to contain your surprise as I tell you we chose to use ads as our mud puddle, and click-through rate as our needle.

The particular ads we wanted to use were the Careers Job Ads that run in the sidebar of Stack Overflow. Last we left Careers Job Ads, we knew that geography was by far the most important feature we could use to select jobs to show you in the ads. We’ve done some visual updates since then, but the job selection algorithm has remained unchanged.

We had a number of features in the pipeline at this point that were untested in real-world conditions. Major Dev Kinds, Minor Dev Kinds, Tech Stacks and Job Tags were the individual features we had to test. We did those first and then the final test was for the Value Function that combined each of those individual features and geography. Each test had these phases:

Labeling Party: The Jobening

We had a model for users, but we weren’t going to model jobs at the same time because then we wouldn’t know whether we were testing the user model or the job model. We also needed the cleanest labels on jobs that we could get which meant good ol’ fashioned home grown fresh squeezed human labels. Gotta give a big shout out to the developers at Stack Exchange for giving us their time to label each data set.

Pre-analysis and Experiment Design

This is a thing we’re getting more and more into here. We’ve wasted a lot of cycles on tests and ideas that would’ve been avoidable if we’d just looked at the data ahead of time. We needed to know approximately how many impressions we were going to get in the test groups so we would know how long we’d have to wait for the A/B results to converge. If it was going to take months for the results to converge for one of these experiments, we’d be better off just doing a different experiment. We had to know when we were going to have to give up controlling for important factors like geography in order to get enough data points in a bucket.

Turning it up

Once we knew how long we’d have to wait, we did some easing in. We’d run the test at 1% for a day to work out any bugs, monitor the early impression distributions and click-through rates to make sure they weren’t getting completely hosed and that we were also measuring correctly. Then we’d slowly ramp it up during the second day to 100% of usable impressions.

Wait

Post Analysis

At this point we’d have click-through rates that had supposedly converged, but it wasn’t enough that CTR just went up. We’d also have to make sure that gains on one subgroup weren’t at an unreasonable expense to the other groups. We called these checks “Degeneracy Passes”.

By the end, we were able to get through about a test per week. By the luck o’ the data scientist we even got some decent results.

Results

The Value Function performs pretty admirably in comparison to the geography based targeting we were using before. We saw ~27% lift.

Looking at CTR improvement by geography, we found that jobs closer to the user improved less, confirming our previous experiments that geography is pretty important. Inside 60 miles we saw ~17% lift picking jobs using Value Function. Outside 60 miles we saw even larger gains.

Looking at jobs by some popular dev kinds we saw the following gains:

It wasn’t all kitten whispers and tickle fights. Desktop OSX jobs lost about 6% of their clicks, but we only had 9 of those jobs in the test. Similarly, SharePoint Integration jobs lost 100% of their clicks in the test, but there were only 3 of those jobs. I guess we’ll just have to live with those terrible blemishes for now.

Looking at jobs with a smattering of popular tags we saw these gains:

Node really stands out there (I think we just got lucky), but we were pretty happy with all those gains. We didn’t find any losers among tags used on more than 50 jobs.

Next up: The Architecture of Providence

A Wild Anomaly Appears! Part 2: The Anomaling

After all the rave reviews of my last post I knew you were just on the edge of your seat waiting to hear more about my little unsupervised text anomaly detector.

So, we’ve got some working ML! Our job’s done right?

‘Round these parts, it ain’t shipped ‘til it’s fast and it’s got it’s own chat bot. We spend all day in chat and there’s a cast of characters we’ve come to know, and love, and hate.

Pinbot

Pinbot helps us not take each other’s database migrations. You can say “taking ###” and Pinbot removes the last taken migration from this room’s pins and replaces it with this one. So we always know what migration we’re up to, even if someone hasn’t pushed it yet. Also pictured: Me calling myself a dumbass in the git logs. Which gets broadcast by our TeamCity bot. Someone starred it.

Hair on Fire

Hair on fire bot helps Careers keep making money. Hair on fire pops up every now and again to tell us a seriously critical error that affects the bottom line has happened on the site. If someone is buying something or on their way to buying something and gets an exception Hair On Fire dumps the details directly to chat so someone can get to work immediately.

LogoBot

Here’s another little Careers helper. We have a policy that any advertising images that go up on Stack Overflow must be reviewed by a real live person. When display advertising is purchased from our ad sales/ad operations teams, they do it. When we introduced Company Page Ads we automated the process of getting images on Stack Overflow and no one had to talk to a sales person. This begat LogoBot who reminds us to do our job and make sure no one’s putting up animated gifs or other such tawdriness.

Malfunctioning Eddie

Malfunctioning Eddie’s…got problems.

Anomaly Bot

Which brings me to the Anomaly bot. We need to make sure that all these anomalous jobs I’m detecting get in front of the sales support team. They are the human layer of detectors I alluded to in my last post who used to have to check over every single job posted to the board.

There it is. Anomaly bot. Where does it lead us puny humans?

Welcome to the super secret admin area of Careers. At the top of the page we have a list of the jobs that were posted today. There are 3 columns, the anomaly score (which is based solely on title), the job title, and a few buttons to mark the job anomalous or not. The second section of the page is for all jobs currently on the board.

I’m hoping the heatmap pops out at you. It runs from Red (pinned to the all-time most anomalous job) to Green (pinned to the all-time most middle-of-the-road job ever). The jobs posted today are light orange at worst, so that’s pretty good! On the “all jobs” list there’s a bunch of red that we need to review.

Just to give a reference, here was the first version sans heatmap.

So much more information in that tiny extra bit of color. If you want to make simple heatmaps it’s really easy to throw together some javascript that uses the power of HSL.

What’s Next?

We’re gonna let this marinate for a while to actually test my hypothesis that we only have to look at the top 10% of jobs by anomaly score. The sales support team’s gonna clear the backlog in the “all jobs” section of the report, then use the tool for a little while and then we’ll have the data we need to actually set the threshold. Once we do that the Anomaly bot can get a little smarter. Right now Anomaly bot just shows every three hours with that same dumb message. Maybe it’ll only show up when there’s jobs above our human-trained threshold (modulo a safety factor). Maybe we’ll change it to pop up as soon as an anomalous job gets posted on the board.

Here, have some code

If you want to use the very code we’re using right now to score the job titles it’s up on Nuget, and the source is on Github

Got experience solving problems like this one? Wanna work at a place like Stack Exchange? Head on over to our completely middle-of-the-road job listing and get in touch with us.

A Wild Anomaly Appears!

So, I’m working on the new Data Team at Stack Exchange now. Truth is we have no idea what we’re doing (WANNA JOIN US?). But every now and then we come across something that works a little too well and wonder why we haven’t heard about it before.

We run a niche job board for programmers that has about 2900 jobs on it this morning. Quality has been pretty easy to maintain. We have a great spam filter called “The $350 Price Tag”. Then we have some humans that look over the jobs that get posted looking for problems. Overall the system works well, but at 2900 jobs a month that means someone has to look through about 150 jobs every working day. They’re looking for a needle in a haystack as most (>95%) of the jobs posted are perfectly appropriate for the board, so there’s a lot of “wasted” time spent looking for ghosts that aren’t there. And it’s pretty boring to boot. I’m sure that person would rather do other things with their time.

It’d be nice if we had an automated way of dealing with this. We have no idea what we’re doing, so we frequently just reach into our decidedly meager bag of tricks, pull one out, and try it on a problem. I’d done that a few times before on this problem, trying Naive Bayes or Regularized Logistic Regression, but had gotten nowhere. There are a lot of different ways a job can be appropriate for the board and there are a lot of different ways a job could be not appropriate for the board which makes coming up with a representative training set difficult.

Last week while taking another whack at the problem I Googled “Text Anomaly” and came across David Guthrie’s 186 page Ph. D. thesis, Unsupervised Detection of Anomalous Text. There’s a lot there, but the novel idea was simple enough (and it worked in his experiments and mine) that I’m surprised I haven’t heard about it until now.

Distance to the Textual Complement

Say you have a bunch of documents. You pull one out and want to determine how anomalous it is with respect to all the others. Here’s what you do:

  1. Choose some features to characterize the document in question.
  2. Convert the document to its feature representation.
  3. Treat all the other documents as one giant document and convert that to its feature representation.
  4. Calculate the distance between the two.

Do this for every document and sort the results descending by the distance calculated in step 4. The documents at the top of the list are the “most anomalous”.

That’s it. Pretty simple to understand and implement. There are two choices to make: which features, and which distance metric to use.

Obscurity of Vocabulary, I choose you!

In any machine learning problem you have to come up with features to characterize the set of things you’re trying to work on. This thesis is chock full of features, 166 of them broken up into a few different categories. This part of the paper was a goldmine for me (remember, I have no idea what I’m doing). The text features I knew about before this were word counts, frequencies, tf-idf and maybe getting a little into part of speech tags. The kinds of features he talks about are stuff I never would’ve come up with on my own. If you’re doing similar work and are similarly lost, take a look there for some good examples of feature engineering.

The set of features that stood out to me the most were the ones in a section called “Obscurity of Vocabulary Usage”. The idea was to look at a giant reference corpus and rank the words in the corpus descending by frequency. Then you make lists of the top 1K, 5K, 10K, 50K, etc. words. Then you characterize a document by calculating the percentages of the document’s words that fall into each bucket.

Manhattan Distance, I choose you!

Guthrie pits a bunch of distance metrics against eachother and for Distance to the Textual Complement method the Manhattan distance got the blue ribbon, so I used that.

Setup

When I’ve been looking through the jobs before I can pretty much tell by their titles whether they’re busted or not, so my documents were just the job titles. There isn’t really a good reference corpus from which to build the Top-N word lists, so I just used the job titles themselves. I tried a couple different sets of Ns but ended up on 100, 300, 1000, 3000, and 10000 (Really ~7,000 as that’s the number of unique terms in all job titles).

Results

Here’s the sort of all the jobs that were on the board yesterday.

Basically everything on the right is good and has low scores (distances).

Most of the jobs on the left have something anomalous in their titles. Let’s group up the anomalies by the reasons they’re broken and look over some of them.

Stuff that just ain’t right

These jobs just don’t belong on the board. We need to follow up with these people and issue refunds.

  1. Supervisor Commercial Administration Fox Networks Group
  2. Sales Executive Risk North Americas
  3. Associate Portfolio Manager Top Down Research
  4. Senior Actuarial Pre-Sales Consultant
  5. Ad Operations Coordinator
  6. NA Customer Service Representative for Sungard Energy
  7. Manager, Curation

Just Terrible Titles

These jobs would belong, but the titles chosen were pretty bad. Misspellings, too many abbreviations, etc. We need to follow up with our customers about these titles to improve them.

  1. Javascript Devlopers Gibraltar
  2. Sr Sys Admin Tech Oper Aux Svcs
  3. VR/AR Developer

Duplicate Information

Duplicate Information

Anywhere you see the title for a job on Stack Overflow or Careers we also show you things like the location of the job, whether it’s remote or not, and the company’s name. These titles duplicate that information. We need to follow up with our customers to improve them.

  1. Delivery Manager, Trade Me, New Zealand
  2. Visualization Developer, Calgary
  3. Technical Expert Hyderabad
  4. Technical Expert Pune
  5. Technical Expert Chennai
  6. Technical Expert Gurgaon
  7. New York Solutions Architect
  8. Sr Fullstack Eng needed for Cargurus We reach over 10MM unique visitors monthly
  9. Sr. Tester, Sky News, West London
  10. Chief Information Officer/CIO Audible.com
  11. Machine Learning Engineer Part Time Remote Working Available

What about the false positives?

A number of false positives are produced (this is just a sample):

  1. Computer Vision Scientist
  2. Mac OSX Guru
  3. Angularjs + .NET + You
  4. Android Developer 100% Boredom Free Zone
  5. Java Developer 100% Boredom Free Zone
  6. DevOps Engineer - Winner, Hottest DC Startups!!! $10M Series A
  7. Jr. Engineer at Drizly

Some of these (Computer Vision, Mac OSX) are just infrequently found on our board. Some of these people are trying to be unique (and are successful, by this analysis) so that their listing stands out.

Guthrie goes into a bit of detail about this in a section on precision and recall in the paper. His conclusion is that this kind of anomaly detection is particularly suited to when you have a human layer of detectors as the last line of defense and want to reduce the work they have to do. An exhaustive exploration of the scores finds that all of the jobs we need to follow up on are in the top 10% when ordered descending by their anomaly scores. Setting that threshold should cut the job our humans have to do by 90%, making them happier and less bored, and improving the quality of the job board.

So You Want a Zillion Developers…

I work at Stack Overflow on Careers 2.0. In addition to our job board we have a candidate database where you can search for developers to hire. Our candidate database has 124K+ developers in it right now.

Customers frequently gawk at this number because they’ve looked at other products in the dev hiring space that offer millions of candidates in their databases. Sourcing.io claims to have “over 4 million developers” in their database. Gild offers “Over 6 Million Developers”. Entelo will give you access to “18+ million candidates indexed from 20+ social sites.”

Yeah man, your numbers stink

Hey. That hurts.

Let’s put those numbers in perspective. The vast majority of the developers “in” these other databases don’t even know they exist. The devs never signed up to be listed or even indicated that they were looking for work. There isn’t even a way to opt out. These databases are built by scraping APIs and data dumps from sites developers actually care about like Stack Overflow and GitHub.

On the other hand the only people you’ll find in the Careers 2.0 database are ones who made the affirmative choice to be found. They start by getting an invitation to create a profile. They build out a profile with their employment and education history, open source projects, books they read, peer reviewed answers on Stack Overflow, and so on. Then they can choose to be listed as either an active candidate (they’re in the market for a job right now) or a passive candidate (they’re happy where they are but are willing to hear your offer). After a candidate gets hired they can delist themselves from the database so future employers don’t waste any time on them.

So the difference between us and them is that we give you a smaller number of candidates who are already interested in job offers and they give you a giant database filled with hope and built by skeez.

We have some data from Careers that tells us hope is not a recruiting strategy.

Our Experiment

Careers 2.0 experimented with the “index a bunch of people who don’t know they’re being indexed” model to see if it could possibly work. We created what we called “mini-profiles” which consisted exclusively of already public information available on Stack Overflow. We would add mini-profiles to the database if the Stack Overflow user provided a location in their profile and had a minimum number of answers with a minimum score. We showed these mini-profiles along with our “real” candidates in search results. If an employer wanted to contact one of the people behind a mini-profile Careers 2.0 would send them an e-mail asking if they want to open up a conversation with the employer. If the candidate wanted to continue they could authorize us to share their contact information with the employer and they’d start working on their love connection.

Our Results

We track response rates to employer messages to look out for bad actors and generally make sure the messaging system is healthy. A candidate can respond to a message interested/not interested or they can fail to respond at all. Response rate is defined as Messages Responded To / Messages Sent. When we compared the response rates of messages to mini-profiles to the response rates of messages to “real” profiles the results were not good for mini-profiles. Messages to “real” profiles were 6.5x more likely to get a response than messages to mini-profiles. That was the last and only straw for mini-profiles. We retired the experiment earlier this year.

So what about the zillions of programmers?

All those services I named at the beginning of this post do what we did in our experiment, just a little more extensively by including devs from more places online. I have to believe that the response rates from their unqualified leads are similar to the ones we found in our experiment. I suppose technically the response rates from randodevs on GitHub or Bitbucket could be higher than that of randodevs on Stack Overflow thus invalidating our conclusion, but anecdotal evidence from our customers about those other services suggests not.

“Wait a sec there Jason,” you’re thinking, “if their databases are at least 6.5x larger than yours I’ll still get more responses to my messages right?” Absolutely! That’s called spam. You are totally allowed to go down the path of the spammer but let me hip you to the two problems there. The first problem with your plan is that devs hate recruiting spam more than they hate PHP, and they hate PHP alot. The word will get out that you’re wasting everyone’s time. People will write about it. The second problem is that spam is supposed to be cheap. This isn’t cheap. In this case you’ll have to spend at least 6.5x the time wading through these zillions of devs identifying the ones that meet your hiring criteria, messaging them, and waiting for responses. So not only are you wasting their time, you’re wasting yours.

We aren’t going to build this business off hope and spam and wasting people’s time. If a smaller database is the price, so be it.

Commuting: A Perverse Incentive at Stack Exchange

So, we just went through comp review season here at the Stack Exchange. This is pretty much the only time of year we talk about money, because that’s the way we want it. We pay people enough to be happy and then shut up about it. You’ll probably only ever hear stuff about comp from me around September each year because that’s the only time it’s on my mind. The system works, and I’m generally happy about my financial situation, but we have a comp policy about remote work that subjects me to a bit of a perverse incentive when it comes to commuting.

The policy is that if you work out of the New York office, you get a 10% pay increase relative to what you’d be making if you worked remote. The reason for this has always been a little cloudy to me. I’ve heard cost of living adjustment. I’ve heard we want to incentivize people to be in the office because of “accidental” innovation from pick-up meetings and conversations in the hall. Regardless of the reason, that’s the policy.

I live in Stamford, CT and have been commuting to the New York Office 3 days a week (down from 5) since my daughter Elle was born in December. My total commute time averages just under 3 hours a day (10 min from my house to the Metro North, 55 minutes to Grand Central, 20 minutes from Grand Central down to the Office). So I end up commuting about 36 hours per month (down from 60).

On the Metro North getting a seat means cramming in next to one or two other people in side-by-side seats leaving little elbow room for typing (or living, FSM forbid they’re overweight), sitting in the seats that face each other and knee-knock with people who are drawn from a population with a mean height of 7 feet, or sitting on the floor in a vestibule near the doors. Some days the Metro North crawls because apparently they didn’t design this surface rail line to deal with even the slightest amount of rain. The subway is the subway, you get what you get. This commute stinks and it’d be my default position to forgo it.

Here’s where the perversion comes in. Let’s say I make $120K a year (I’m using this number because the math works out simply) out of the New York Office and decide to go remote. Every month I’ll make $1K less and get 36 hours of my life back. So Stack Exchange thinks my commute is worth $27.78 an hour. 4x minimum wage for no productive output is nice work if you can get it.

When done right, it makes people extremely productive. Private office? Check. Flexible hours? Check. Short commute? Check. I’ll let you in on a secret: most of our remote developers work longer hours than our in-office devs. It’s not required, and probably won’t always be the case, but when going to work is as simple as walking upstairs (pants optional, but recommended) people just tend to put in more hours and work more productively.

Going remote means a large portion of the 36 hours a month I spend commuting would go back to productive work (I won’t lie, some of it will be spent enjoying time with my daughter) so Stack Exchange is better off. I’d be happier because I get to skip the dreadful commute and work instead so I’d be better off. But I don’t make nearly enough that I can just drop 10% of my pay and not feel it.

Fun With RNGs: Calculating π

So, calculating π is a fun pastime for people it seems. There are many ways to do it, but this one is mine. It’s 12 lines of code, it wastes a lot of electricity and it takes forever to converge.

1
2
3
4
5
6
7
8
9
10
11
12
public double EstimatePi(int numberOfTrials)
{
  var r = new Random();
  
  return 4 * Enumerable.Range(1, numberOfTrials)
                       .Select(o => {
                                      var x = r.NextDouble();
                                      var y = r.NextDouble();
                                      return Math.Pow(x, 2) + Math.Pow(y, 2) < 1 ? 1 : 0;
                                    })
                       .Average();
}

What’s going on here? First we initialize our random number generator. Then for 1 to the number of trials we specify in the argument we do the following:

  1. Generate two random numbers between 0 and 1. We use one for the X coordinate and one for the Y coordinate of a point.
  2. We test if the point (X,Y) is inside the unit circle by using the formula for a circle (x2 + y2 = r2).
  3. If the point (X,Y) is inside the circle we return a 1 otherwise a zero.

Then we take the average of all those zeros and ones and multiply it by a magic number, 4. We have to multiply by four because the points we generate are all in the upper right quadrant of the xy-plane.

How bad is it? Here’s some output:

    Number Of Trials       Estimate of Pi
        10                  3.6
        100                 3.24
        1000                3.156
        10000               3.1856
        100000              3.14064
        1000000             3.139544
        10000000            3.1426372
        100000000           3.14183268
        1000000000          3.141593 (Took 2:23 to complete)

Things That, Were I to Unfortunately Wake Up Tomorrow as a Recruiter, I Would Never Do

I would never send e-mails that make potential candidates for a position think I’m not effective at finding potential candidates for a position. Giving candidates that impression just makes them think I stink at everything else too.

Subject: Barbara Nelson in Search of Javascript Expertise

Do you mean the Barbara Nelson?

Hello from Barbara!

What a great salutation! Not. Save that one for your next family newsletter.

I saw your profile either on github or on stackoverflow

Really? WOW! It sounds like you did a lot of research on me and moreover you’re the kind of go-getter who keeps the relevant information she needs at her fingertips at all times.

I am looking for several strong JavaScript Object-Oriented Engineers (not “just” web developers). These three openings have been especially challenging to fill…

Well alright let me click through and see what these jobs are about. Oh…no company names? The third one is really a C++ job? And you say you’re having trouble filling these positions?

Some JavaScript opportunities I am helping to fill are at solid funded start-ups, some are at start-ups already acquired by a well-known global company with solid benefits. We can make your relocation to the beautiful Bay Area happen if there’s a good fit.

That’s good I guess…I’m not really that interested in moving to the Bay Area.

Those who are interested in a brief discussion on the phone: please send a resume or an online profile that reflects your experience, a good time to talk, and a good phone number, and we’ll schedule a quick call.

Those who sent this e-mail should learn how to address the recipient directly and singularly instead of giving the impression that this is just another useless e-mail blast from a contingency recruiter.

If you never want to hear about career opportunities from me again, just let me know; reply and say so.

By the way you almost whited that out I’d almost think you didn’t want me to actually do that.

I love referrals.

I love how I almost don’t even get the feeling you’re trying to get me to do your job for you.

Wat?

Contrast

So let’s look an e-mail with a similar goal.

Subject: Facebook Engineering

Do you mean the Facebook? Let’s not be unfair to poor Barbara. Her subject line is much harder to get right than this one.

I hope all is well. I had the pleasure of stumbling upon your information online and saw that you have been working on some pretty neat stuff with Stack Overflow and various companies (it wasn’t disclosed on your resume) plus you have an awesome academic background from SUNY Geneseo to complement it.

This is much better than what Barbara had to say about me. Minimally Jeremy has read my public Careers 2.0 profile and noted my current position and where I went to school. He also called out the fact that I don’t list the companies I’ve worked at before on my profile (mainly so I can write about my experiences there when I want to without anyone getting bent out of shape). This e-mail is about me. It’s not a cattle call.

I am currently helping grow our engineering team in the NYC office and would love to chat with you about what you’ve been up to and perhaps put us on your radar; if nothing else we can have a friendly conversation. Let me know what works for you and we can schedule a time at your convenience. If this isn’t the right time, I completely understand and we can stay in touch based on your schedule – no rush. I look forward to hearing from you.

Great tone. Sounds like a human. He tells me what he’s after while being accomodating and not pushy. He makes me believe that if I respond, he’s going to respond back. Jeremy could’ve broken some of this down into paragraphs to make it less WALLOFTEXT but other than that it was a decent recruiting e-mail.

Get Your Redis On on Windows

TL;DR: Want a virtual machine running redis in however long it takes you to download 400MB + a little completely automated install time? Follow the instructions here and you’ll be on your way.

Well, it only took me a year of having this blog for me to write up something even remotely technical. But here you are, and here I am…so let’s just tone down the celebration a little bit and get on with it already.

So…it’s hard running a dev environment sometimes. We at the Stack Exchange will use anything to get the job done, but on the programmer’s side we’re mainly a windows shop. One piece of software we’ve come to know and love is Redis though. We love it so much we’ve got antirez on speed dial. It’s really the greatest.

Here’s where it isn’t quite the greatest though (for us): it’s really meant to run on Linux. Some people have made mostly working windows builds in the past that were good enough for dev’ing on but had weird behavior when it came to background operations. They’re great and I appreciate the work they d(o|id), but they fall behind when redis bumps stable versions (it’s behind 1 stable version right now leaving out features like the Lua scripting engine). Microsoft went through the rigamarole of patching redis so that it will run on windows, but that patch isn’t getting merged to master…ever.

So what’s a girl to do? When I’ve been on a team of one and had this kind of problem I thought to myself, “Self! Get VMWare on here, spin up a one off VM with ubuntu and just run it there! Problem Solved!” and many internal high-fives were had. But when you’re on a team of 6 (the Careers team, plug: we’re hiring) that doesn’t really scale well. So what are my choices? Let’s go to the big board of options:

  • Just tell my teammates “Hey, spend a couple hours spinning up your own VM and hope the one you have and the one I have match up and behave exactly the same”. (HINT: No)
  • Check in a 10 gig VM into source control and push so the other members of the team can run it too? (HINT: NO. That’s an example of what we call the “I quit” check-in.)

So how do you solve this problem?

Enter the Hobo

So it turns out a bunch of other people have this problem too (WEIRD RITE?). A smart dude decided to solve it and created Vagrant. Vagrant is a super simple yet powerful way to create and manage a reproducible virtual dev environment. Check in a couple kB of config and you get a virtual machine (or a multiple machine environment) your whole team can run. Vagrant wraps around Virtual Box for it’s virtual machines and it’s not just for windows. It runs on Linux and Mac too. Let’s run it down.

Getting in Installed

Follow the startup guide here. It’s basically install VirtualBox and install Vagrant.

Creating a machine

To create the machine, the first thing we do is create your Vagrantfile. Don’t worry…it isn’t a driter fetish. It’s just a config file that outlines how your virtual machine is setup. It’s also just a bit of ruby. Here’s the one we’re using:

File /Users/jpunyon/code/octopress/source/downloads/code/get-your-redis-on-on-windows/Vagrantfile could not be found

So first we tell Vagrant which box to use. A box is essentially a map from a key to a file. Box names can be anything you want, in this case I just have a name telling me that it’s ubuntu’s latest 64-bit release.

Next we have a url that points to a file. As it says in the comment there, this url points to a box file that will be downloaded if the box with the name in config.vm.box doesn’t exist. This is nice because it means i send the file and when my teammate runs it it will go fetch everything it needs to create the virtual machine. Brilliant. A bunch of base boxes can be found at Vagrantbox.es. They have many different guest operating systems and versions and such to use. Very cool.

Next we have some port forwarding settings. Vagrant takes care of setting up the network for you, you just have to tell it what you need. So I’m just forwarding to port 6379 on the guest machine (the default port on which redis runs) from port 6379 on my host machine.

Next I customize the vm to have a gig of memory instead of whatever the base box has by default.

So that’s it for the configuration the box. The last line runs a provisioner which will setup the box once it’s running. There are a number of provisioners to choose from including Puppet, Chef and shell. This was the gotcha for me when I was doing it the first time. The docs list the provisioners in this order…

So I spent an hour or two trying to grok the chef and puppet docs and ended up getting frustrated. Those systems have a bunch of abstractions in them which probably make them great for doing sys-adminny type stuff but in my head I was screaming “AAAARRRGH. JUST LET ME RUN A FUCKING SHELL SCRIPT!”. Of course I go back to the vagrant docs after that, look an inch or two down and feel like an idiot. I do wish the bullet points there went in order of increasing complexity though.

Long story short, the provisioner just executes the specified shell script on the guest box after it boots up.

Shellack It

So once the machine boots up what do we want it to do? Well, this:

File /Users/jpunyon/code/octopress/source/downloads/code/get-your-redis-on-on-windows/init.sh could not be found

So first we make the directories where redis will live. Then we go to the top level one, download and extract the code for the version of redis we’re interested in and build it. Then we copy the resulting executables to their final home in /opt/redis/bin.

Next we copy an init.d script to where it needs to be, then we copy the redis configuration to where it lives. Add a redis user, start redis and we’re all finished.

You might be asking “How did that init.d script and redis configuration get into the vagrant directory on the guest box?”

The way you run vagrant is by going to the directory where the Vagrantfile lives and typing vagrant up. That starts the whole ball rolling. When vagrant starts up your VM, it automatically shares the directory where the Vagrantfile is with the guest box at /vagrant on the guest box. It’s a magical default behavior.

So that’s pretty much it

Well, for now anyway. Vagrant can be used to setup multiple machine environments (which I might do next to test out an elasticsearch cluster for Careers). It has many more bells and whistles to keep your virtual dev environment running lean and mean. I’ve been super impressed with just how easy it is to work with (total home grown code to get my redis VM up was 31 lines, 15 of which were the shell script for installing redis) and bonus everyone on my team thinks I’m a hero. It’s that magical.

+1 Vagrant…+1.

Appendix

This is the init.d script I used which I cribbed from Ian Lewis.

File /Users/jpunyon/code/octopress/source/downloads/code/get-your-redis-on-on-windows/redis.init.d could not be found