qotm – Puzzled Pint

tl;dr People mostly like the star difficulty rating system, so we’re keeping it.

You might have noticed that we’ve been experimenting with adding difficulty rating to puzzles. I brought this notion to Puzzled Pint HQ last year because some teams in Austin kept asking which puzzles were the “easy” ones. Looking further into this, it became apparent that those teams were generally of mixed experience levels, and wanted to give the easier ones to the more novice solvers to attempt first.

Our First Two Trials

We tested this in Austin as an A/B test adding “Easy, Medium, or Hard” to the top of each puzzle in October. I asked each team after the event if they liked having the rating system. Obviously, the teams that normally asked about difficulty liked it, but nearly all teams gave really positive feedback among them. In fact, we had only a single negative comment of the nature “I was proud of myself, until I saw it was marked easy“.

In November and December, we went broader adding the same system to every city’s copies. This went over less well with several complains by GC of people hating the system, and in December of the rating being inaccurate. The ratings were based strictly on the “difficulty” response provided by playtesters on their feedback forms, but there was some judgement in determining the cutoff values between easy and medium and hard.

This negative feedback from GC was concerning and confusing, since the Austin test had gone so well. We didn’t know if only the players that hated it were complaining and GC wasn’t getting the positive feedback, or if the testing in Austin was an outlier and the hate was universal.

Another Test

Because of that negative feedback, we decided to scrap the ratings system for January 2017 and think about a resolution. It seemed the “I felt bad because this was supposed to be easy” was a common complaint on the feedback thread, so we decided on a slightly different system of using a 5-star rating instead of the English words. Hopefully, this would convey the information, but allow people to rate their abilities themselves instead of having the implicit judgement of not being able to get an ‘easy’ puzzle.

Thus, February’s puzzles had this new system, but, by golly, we were going to solicit player feedback this time to make sure. If you love charts like I do, you’ll like this next bit…

We had 434 teams do February’s set (1440 people, not including Game Control members). The puzzles’ difficulties ranged from 2 to 4 stars, as is our goal, not too easy and not too hard. Of course, in the future, it’s possible for various reasons that a puzzle set might legitimately contain a 1 or a 5-star puzzle.

February’s set was the most playtested, perhaps ever, in Puzzled Pint history. This ensures that the feedback on the system wasn’t tainted by incorrect difficulty ratings. Even so, I made the call to bump the Cupid puzzle to 4 stars because it had a large standard deviation instead of keeping it at the strict mean, which would have been 3 stars. We probably will formalize that going forward at setting star difficulties at the first standard deviation to the right of the mean.

So, how were the responses?

Well, first off, we didn’t have the response rate I’d hoped. Even if you don’t consider teams that didn’t finish the puzzle set (i.e. had fewer than 5 completed puzzles), here’s the response rate by city:

Still, a 59% response rate is enough to represent a good section of our players, and the results are likely skewed away from beginners anyhow because beginners are less likely to finish the set and thus have not given a response. Recall, Puzzled Pint is very much targeting the experience of the beginner puzzler, not the experts or even the ‘regulars’.

So, what were the survey results?

Only 4 teams in our survey reported that the difficulty ratings were harmful. Amazing! We figured that would be higher considering the feedback on the GC thread.

90 teams did say that they were not helpful, but that they didn’t mind their existence.

67 teams said that they were helpful, but not necessary.

Finally, 63 teams said that they really wanted us to keep them!

Breaking Down the Data

Those stats alone don’t tell the full story. Yes, more people wanted us to keep them than thought they were helpful, but we are interested to know how much they helped the more novice teams.

How are we to judge which QotM responses came in from the beginners vs the more experienced? We hypothesized that, since we collect solve times, we could look at those and assume that teams that took longer to finish the set were the less experienced puzzlers.

But wait! What about team size? Don’t smaller teams take longer? To check, I ran those numbers and came up with this lovely chart:

Nope! Team size matters very very little to overall solve time. There is a clear downward trend, but the standard deviation is nearly a consistent 40 minutes for each sized team.

In the chart, I used larger bubbles when multiple teams had the same exact size and minutes taken. As you can see by the data points, there was a huge variance of solve times, no matter the team size.

Therefore, I felt safe in doing the analysis based purely on the number of minutes taken to do the set, assuming those that took longer were the more inexperienced. A simple histogram sorting those times into buckets of 15 minutes allowed me to create this graph of the opinion results:

First of all, let us revel in the lovely emergence of the Gaussian curve again in nature. This one has a fatter tail than true normal, but it’s nice and smooth. Ahh.

Next, we can clearly ignore the red bars, the ‘complainers’, as they are so few. So, let’s look only at the rest.

Both light green and orange show no clear trend, but it does seem the dark green increases as solving time increases. For a clearer picture let’s ignore the number of teams in each category and look at the percentages within each:

Now we can see a significant trend. Ignoring the outliers on either end (there’s only one team in the 20-34 bucket). Even though a pretty constant percent of the teams think the rating aren’t helpful, the longer a team takes to solve the more they like the rating system!

Okay, so there’s one lingering question that remained in my mind. Is this city dependent? Maybe some cities just hate them and others like them? Will we see a significant variance among cities, or will they all just be average? Well, check this out:

Boom! We have a triangle! I’ve put the size of each city on the X axis, and their average rating on the Y. As cities grow the responses move towards the mean answer of ‘slightly yes’. Still, I’m amazed at the variance in the smaller cities (this is actually locations, not cities, but you know what I mean).

Boston, our largest city by far is clearly supportive of the rating system, not like Victoria’s 100% support, of course, but solidly above the disdain that Tacoma’s 17 people have. Luckily, this chart shows that, by combining cities, I wasn’t significantly masking any strong negatives from only a few. No city really minds the system (on average), and most of them are well into the ‘yes’ range. Austin (both sites) are well into the yes range, which validates the earlier testing there.

Overall folks, the rating system is here to stay. Thanks for participating, and please keep the suggestions and feedback coming, so we can continue to improve in the future.

Yours Truly,

Neal Tibrewala
Puzzled Pint HQ

Puzzled Pint is always looking for authors — both seasoned veterans and people who want to get their first taste at puzzle design. Because we do one set of puzzles per month, our waitlist is about a year out, but that’s a good thing for all. It means we have to time to work with draft puzzles, provide direct feedback, bounce the puzzles off of playtesters in the US and abroad, and route that feedback to the author as suggested revisions.

Some authors like to come up with a theme first, then see what sorts of puzzle mechanisms that theme inspires. Others like to come up with mechanisms first and then wrap them in story and theme. Both ways are equally valid. For what it’s worth, bonus puzzles are often — but not always — in the latter camp.

This month we asked a Question of the Month to our Puzzled Pint attendees. We asked you to suggest themes for upcoming months. Our hope was that this could provide a source of inspiration to future authors. We would like to share the results here. There were 374 total suggestions from 23 cities. The top suggestions (with 3 or more votes) are:

Star Wars (14)
Harry Potter (13)
Disney (10)
board games (9)
geography (6)
video games (6)
Pokemon (6)
Alice in Wonderland (5)
Game of Thrones (4)
Doctor Who (4)
Star Trek (4)
Dr. Seuss (4)
Lord of the Rings (3)
superheroes (3)
space (3)
Carmen Sandiego (3)
pirates (3)
alcohol (3)
Buffy the Vampire Slayer (3)
Olympics (3)
food (3)
X-Files (3)
pizza (3)
Breaking Bad (3)
James Bond (3)
Shakespeare (3)

We happened to do Disney Star Wars back in 2012 when the Lucas/Disney sale was first announced, but there’s no reason why we couldn’t do Disney, Star Wars, or another Disney Star Wars. We also did board games back in May of 2011, Doctor Who a little more recently in September, and James Bond in 2013. But these are all good themes, and as long as the puzzles are unique, fun, and challenging, we’re open to revisiting past motifs.

The remaining suggestions are as follows (~~highlights added strictly for humor value~~ I’ve highlighted a few unique entries to better stand out):

30 Rock, 7 Wonders, 80s, 80s action movies, 90s, a salute to gingers, AA Milne, Adventure Time, ALF, anagrams, Ancient Rome, Angry Birds, astrology, Austin, automobiles, Back to the Future, backpacking, bad Scifi movies, Barbie, Beatles songs, beer, Best of the MIT Mystery Hunt, birdwatching, Blade Runner, Bones, boy bands, branches of the Armed Forces, Britney Spears, Broadway shows, butts, Calvin and Hobbes, cats, cheese (A Brie Encounter, Cheddar Off Dead, etc.), childhood, childhood games, chocolate, circus, classic cinema, classic literature, Clue, Clue (the game), Coca-Cola, college football conferences, colors, comic books, Comics, composers of classical music, conspiracy theories, crosswords, cryptology, cuisines, cultures around the world, David Bowie + The Muppets = Labyrinth, David Bowie/labyrinth, DC, dessert, dinosaurs, Disney Princesses, dogs, donuts, Downton Abbey, Edgar Allan Poe, Egypt, emoji, escape rooms, Ex Machina, exploring, fairy tale, fairy tales, famous cathedrals, famous Chicagoans/landmarks/history, famous crossroads, Fargo, Firefly, Firefly/fireflies/“Firefly”, fish, flowers, Follow that Bird, football (soccer), Futurama, G. I. Joe, Ghostbusters, Gilmore Girls, grade school, Gravity’s Rainbow, hair metal, hair metal bands, Hamilton, He-Man, Hello Kitty/Sanrio, history, holidays, Hollywood/directing a film, hot cheese, House MD, HP Lovecraft, Hunger Games, Jeopardy, Jim Henson, John Hughes movies, Keep Austin Weird, Labyrinth (the film), Lady Gaga outfits, Larry Bird vs. Dr. J, Law & Order SVU, League of Legends, Lego, libraries, Limburger, literature, logic puzzles, Looney Toons, Mad Magazine, magic, March Madness, Mardi Gras, Marvel, Mass Effect, math, mazes, MegaMan, Mel Brooks, Michael J. Fox, Miyazaki (Totoro), moar robots, Monty Python, movies, Muppets, museums, music, music, sheet music, musicals, mythical creatures, Nickelodeon, Nintendo, Orphan Black, outer space, Parks & Rec, Party Down!, pinball, Pixar, Portal, Portlandia, Post Apocalyptia, presidents, pro wrestling, psychology, Pulp Fiction, pumpkin everything, QI, raccoons, Rambo, Red Dwarf, Rocky Horror, Roman, Roman numerals, RPGs, running, running a newspaper, science, scifi, scifi movies, Scooby Doo, seas on Earth, seas on the moon, secret agent, Seinfeld, Sesame Street, sharks, Sharktopus, Sherlock Holmes, Simpsons, Smurfs, solar system, songs by a famous band, space/planets/astronomy, spies, spies/spying, sports, Steven King, Story Lords (on YouTube, created in 1984, children’s reading educational program produced in Wisconsin), stupid laws (e.g. emergency Sasquatch ordinance), summer camp, Super Mario, Terry Pratchett works, the (fictional) Martians, The Big Bang Theory, The Birdman of Alcatraz, the elements, The Hateful Eight, the impressionists, The Legend of Zelda, The Martian, The Matrix, the movies of Gene Kelly, The Office, The Power Broker: Robert Moses and the Fall of New York, the presidents, The Smashing Pumpkins, the victorian era, The World Series, time travel, Tom Cruise, Tour de France, transportation, trash pandas, travel/flights, traveling, TV game shows, TV shows with initialisms, Twin Peaks, twitter, US presidents, varieties of tomatoes, Walking Dead, War of 1812, We care more about good content. Anything can make a good theme if handled well., weather systems, Weird Al, Whedonverse, who-dunnits (Like Jan 2016), wilderness survival, Winnie the Pooh, women scientists or musicians, World of Warcraft, xkcd, Yo-yos, “don’t be such butts”, “less poetry, more math”, and “one that does not require scissors”

A few of the suggestions are funny in context (such as the one team that suggested Angry Birds, birdwatching, Follow that Bird, The Birdman of Alcatraz, and Larry Bird vs. Dr. J, or the team that suggested The Martian as well as sci-fi martians in general). A few of the suggestions might not resonate in all countries globally, such as US presidents. Several suggestions all orbited around David Bowie, Labyrinth, the Muppets, and Jim Henson. It was tough to normalize them down to one specific word or phrase, but any of those ideas could be fun.

If you’re interested in writing a whole month or simply contributing a bonus puzzle or two, please contact us and we’ll point you to the author guidelines. And if you’d like to perform your own analysis, you can download the raw data.

Tag: qotm

Rating System Feedback

Our First Two Trials

Another Test

Breaking Down the Data

Your theme suggestions