Language popularity: It’s not about search engine result counts

Recently there has been a lot of noise about the Tiobe index in which search engine result counts are compared for various programming languages. Looking at search engine results is an approximate, but very inaccurate method of measuring popularity. One problem with such a method is that the mere mention of a programming language on any web page (regardless of context) is interpreted as “popularity”. There is no notion of how old a web page is. If C++ gets mentioned in a blog post from 1997 that’s counted towards current “popularity” even if the author of the blog post no longer uses C++. Search results including “I hate XYX programming” and “XYZ programming sucks” get counted as “popularity”.

What we should really be measuring is which languages are actively being used. How do you measure usage? The first idea which usually springs to mind is to see how many open source projects are using language X on GitHub or Sourceforge. This logic is deeply flawed as a great deal of code being written today is not open source. Focusing only on open source projects excludes vast quantities of code being churned out by paid developers working on projects and internal systems which will never be open sourced.

We need to measure the number of developers actively writing code in a particular language *today*. What do programming languages all have in common? They all have developers trying to solve real problems. Typically when a developer has a problem he can’t solve he goes to a site like stackoverflow.com and asks for advice. If you’re asking questions about how to do something in a programming language there’s a very high probability that you are actively using that language.

Looking at stackoverflow.com data for the last week we get a picture which is very different from the Tiobe index. The first thing that stands out is that Java, C#, Javascript and PHP feature much higher in the rankings than C. This should not be surprising. While C is suited to many tasks such as operating systems and device driver development the vast majority of code being churned out by Joe Developer is not written in C.

The next thing stands out is that the next generation of JVM languages (Scala, Groovy, Clojure) feature well ahead of languages such as Ada, Nxt-g and Logo which are ranked surprisingly high in naive search engine result counts. Scala is in fact getting very close to breaking into the mainstream group.

stackoverflow questions per week by language

About these ads

20 thoughts on “Language popularity: It’s not about search engine result counts

    • Thank you for the link. Maybe we should not just grab as much data as possible but to somehow statistically compare the measurements. Developers of some language may be concentrated around a few sites, so if you are measuring say 50 languages, there may be more than 100 sites that we should include in the research. The problem is that developers of different languages are not spread out evenly on the Web.

      A similar problem would be trying to measure the popularity of human languages by just going to different cities around the world and building the statistics what languages people speak there. This would work well for the top languages, but may not work well for the rest. And also it is relatively hard to reliably measure the relative popularity of different languages compared to each this way.

      I did a similar research but stopped after one site because I understood that getting data from another 5 sites will not give me much more confidence http://smthngsmwhr.wordpress.com/2012/11/19/measuring-popularity-of-programming-languages/

  1. Stackoverflow shows languages people have questions about. Novices have more questions than experts. Novices don’t write much code, experts do.

  2. So it’s somewhat reasonable that stack exchange / stack overflow questions and traffic may be influenced by usage, it also seems reasonable to assume that large numbers of questions may be indicactive of two other attributes: languages with poor documentation, bad design choices, or those with general lack of clarity, and also languages users who either lack training, experience, or knowledge.

    People who program in C++ or Python seems objectively less likely to post lots of questions to a stack* site, as they are either, respectively, likely to know their language and use context well, or have access to a great deal of well documented, clear, and concise references as well as consistent use paradigms.

    C++, yeah, not a clean language, but it’s used by more compentant programmers.
    Python, well, it’s clean and there’s a documented propper way to do everything.

    PHP? Lots of clueless non-programmers hack it, plus it’s a dirty language with lots of variations of syntax and convention, and plenty of unexpected gotchas.

    No wonder there are a lot of questions.

    JAVA… not as bad as PHP, but it’s designed to create debate. Should I use an ArrayList or a HashMap? How about our UML diagrams? I don’t know, let’s have a big discussion and pretend to our enterprise overlords that stuff is happening.

    So I guess my point is that there’s not an easy way to get a clear answer to such an esoteric question.

    • Cmon, difference between list and map is basic computer science knowledge, and UML diagrams??? Who is using them these days? Anyways, they are way better than doing the same thing in plain text…

    • i agree with your reply in principal.

      a better way to determine popularity would be to look at unique users or unique browsers on stack overflow: this would avoid the “language x is crap so i have to ask more questions” problem.

      one minor point: if you can’t tell a list from a hash map you need to rtfm!

  3. Great post, thanks for bringing another angle on the whole ‘Java once more surpassed by C’ story. I don’t put much stock in the TIOBE index, and I’d argue that anyone who thinks that language popularity is an important metric for the economics of software production has the cart thoroughly before the horse. That is, aside from the question of, can we find lots of people who *express* an interest in this language. That was a reasonably good metric when computer programming was in it’s infancy, and mainly popular with math, philosophy and logistics nerds.

    Now that software engineering is one of the last good gigs in the developed world, popularity is about as good a metric of quality as it is for novels. Efforts at building ‘programming for dummies kits’ seem to largely fail because anything complicated enough to require a computer program seems to be complicated enough to be poorly expressible with tinker toys. Picking languages based on popularity makes hiring computer programmers as if they were machinists easier yes, but you really don’t want to do that (probably not for machinists either anymore). Most of the cost in software is in maintenance. Hiring people who build things that require less maintenance is a much better play. As the successes of Amazon, Google and Apple make plain, the best move is to hire people who produce technical appreciating assets, not those who take on technical debt. As software becomes a globally competitive business this is going to become more dramatically true.

    The metric I’ve always applied to assessing any software technology is a very simple ratio:

    number of smart people I encounter interested in a technology
    + amount of things smart people have continued to use for some time built with said technology

    /

    apparent marketing spend to convince people this technology will save their enterprise a bajillion dollars

    C ranks pretty highly, as does Python and Unix. Java has nice garbage collection and some good concurrency primitives, but almost every smart person I’ve encountered who programs in Java seems to feel they are stuck with Java for one pragmatic reason or another. There’s a lot of awesome stuff like Hadoop & friends built in Java, but I’d way rather interact with those tools through a Scala or a Clojure.

  4. “Java, C#, Javascript and PHP feature much higher in the rankings than C.”

    Aren’t there other plausible explanations for this?

    C is a much simpler language, for one. ANSI C is a 200-page spec; even C99 (which nobody is really using yet) is 500 pages. C# is on par with C++ now, at around 1000 pages. I would expect a lot more questions about complex languages. Are languages like Tcl and Scheme off the charts here because they have no users, or because the languages are so dang simple that there’s nothing left to ask?

    Also, C is not changing much, and especially the version that people actually use (Microsoft isn’t supporting the 13-year-old C99 because they say there’s no demand). C# has a new major version every couple years. It’s quite likely that programmers have a C book from the past couple decades that will answer their questions just fine, while virtually nobody bothers getting a C# book that will be obsolete next year. I would totally have believed it if you labeled your graph “how much would you hate yourself if you woke up hungover and discover that in a drunken stupor you’d actually spent money on a book about this language?”. :-)

      • But for established languages you can simply ask a colleague, while for new ones you better use Stack Overflow

  5. You missed Haskell too. It usually hovers around 100 qs/week (similar to Scala), and in the top 15 overall (with a similar ranking on GitHub).

    • Google lets you search based on the age of content. I doubt that all of the search engines Tiobe is using support this feature. It would definitely be an improvement for Tiobe and it crossed my mind while writing this blog entry.

  6. Let’s step back even further and say “Why do we actually care about language popularity?” Unless you’re in the business of creating tools for a given language, or want to write a book about a language, I can’t see what relevance popularity has. TIOBE is answering a question that few need an answer to.

    TIOBE is sound and fury, signifying nothing.

  7. If I want to guage the relative popularity of some technology, I go look at the “trends” page on Indeed.com. I figure if an employer is seeking a person to work with that technology then they must be using it.

  8. Thank you for the article. I did a similar research here http://smthngsmwhr.wordpress.com/2012/11/19/measuring-popularity-of-programming-languages/ and compared the popularity of languages on github.com, stackoverflow.com and in the TIOBE index.

    The result is that we cannot judge well the relative popularity even for the top languages based on these 3 researchs, although the group of top languages is more or less the same. Visual Basic is popular in the TIOBE index, JavaScript on github.com and C# on stackoverflow.com.

    So, first using only stackoverflow.com is not enough. Second, there might not be that much improvement even if you research even 10 sites like stackoverflow.com.

    The problem is that language popularity is not just online activity of its users and developers of some specific language tend to flock to one or two sites and do not spread over evenly over the whole of Internet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s