I’ve been lucky to attend every CraftConf Budapest since its inception in 2014. It has always been a mind-expanding experience, with a healthy mix of established and newcomer speakers from the tech industry worldwide. Although Craft, as the name suggests, is focused on Software Craftsmanship, the specific topics of talks vary each year as industry trends fluctuate. I was interested in taking a deeper look at trends of technologies and paradigms as they become more or less popular over time, and to find those, if any, whose popularity has remained more or less constant.
While doing some research into data analytics for a project at work, I came across Microsoft SandDance. My interest in the CraftConf data was the perfect opportunity to teach myself SandDance (and some Python as well). So I put together a quick experiment: Using Martin’s excellent tutorial on Web Scraping with Python as a reference, I wrote a Python script that scrapes CraftConf’s talk descriptions (using archives from 2014, 2015 and 2016) and produces a CSV file of the most frequently occurring words. Obviously standard English language words like pronouns, days of the week, and so on are ignored. What we’re left with is a list of top 100 words for each year, and their frequencies of occurrence, which can be visualized in SandDance.
Although this approach is quick and more or less effective, the limitation is that it may not accurately reflect trends for phrases like “Agile Methodology” — the word frequency of “agile” may not be the same as that of “methodology”. But that’s something that can be worked on later. So although I would take this as a good indicator (which meets my purpose), I wouldn’t use the analysis outcomes as a serious reference.
What the Data Tells Us
Here are some interesting findings from the first pass (150 of the most frequent words selected, out of those 100 produced after filtering out common and punctuated words):
- “Product” showed up 26 times in 2015 and nearly doubled to 51 in 2016 (a growing trend: no talk, some talk, twice the talk…)
- “Functional” [programming] appeared 16 times in 2014, and not in the other 2 years (something that’s coming and going?)
- Similarly, “Architecture” showed up respectively 15, 37 and 23 times (up and down)
- “DevOps” was an equally hot trend in 2014 and 2015 but didn’t show up in 2016 (presumably because the hype is over)
- “Microservices” appears 29 times in 2016, but didn’t show up in the previous years (so there is a recent spike in popularity)
See for Yourself
As a fun exercise in data visualization and trend analysis, I encourage you to try it out for yourself, using the CSV file produced by my script. To start with:
- Load the dataset: Dataset > Web > CSV file (Keep “First line is header” checked)
- Set the URL to the CSV file link above, and click Load
- View as: Column
- X Axis: Keyword
- Sum by, Facet by: None
- Color by: Keyword
- Sort by: Frequency
- Set the X axis bins to: 100
The typical way to drill down using SandDance would be:
- Select a keyword (say “lean”)
- Click Isolate. Everything else gets filtered out (note the “Filtered” count increased from 0 to 2)
- Now you can check “Details”, or “Facet by…”, for example
- To go back, simply click “Filtered” to clear the selection
Isolating the “Other” keyword will reveal a whole lot of keywords that don’t show up in the first 100 bins. You can also take the SandDance tour (by clicking on Tour) and discover many other interesting ways of playing with SandDance.
You can find the source code in my Git Repository WordFreqCount. If you find it useful, please feel free to reuse, derive from or improve it. Note that credit goes to Martin for the original code on web scraping using Python and Beautiful Soup, which I largely adapted from. And of course, thanks to Microsoft for making the elegant and powerful SandDance available for free!