59

How many lines of open source code are hosted at the Eclipse Foundation?

 5 years ago
source link: https://www.tuicool.com/articles/hit/RVnMF3U
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Spoiler alert: 162 million !

aeUNb22.png!web That’s right, as of August 1st, there are 330 active open-source projects hosted at the Eclipse Foundation and if you look across the 1120 Git repositories that this represents, you will find over 162 million physical source lines of code . But beyond this number, let’s look at how it was obtained, and what it really means.

I’ve bloggedseveral times about the importance of using metrics to monitor the health (and hopefully, growth!) of an open source project/community, and lines of code are just one. You should always have other metrics on your radar like the number of contributors, diversity, etc.

There are many ways, and many tools available out there, to count source lines of code. Openhub (previously known as ohloh) used to be a really good tool, but it doesn’t seem to be actively maintained. For a few years now, I’ve been relying on a home-made script to analyze Eclipse IoT projects, and it’s only recently that I realized I should probably run it against the entire eclipse.org codebase!

In this blog post, I will briefly talk about how the aforementioned script works, why you should make sure to take these metrics with a pinch of salt and finally, go through some noteworthy findings.

Line counting process

The script used to count the number of lines of code is available on Github . It takes a list of Eclipse projects’ identifiers (e.g ‘iot.paho’) and a given time range as an input and outputs a consolidated CSV file.

The main script ( main.js ) uses the Eclipse Project Management Infrastructure (PMI) API to retrieve the list of Git repositories for the requested projects and then proceeds to clone the repos and run the cloc command-line tool against each repo. The script also allows computing the statistics for a given time period, in which case it looks at the state of each repository at the beginning of each month for that period.

Once the main script has completed (and it can obviously take quite some time), the csv-concat.js script can be used to consolidate all the produced metric files into one single CSV file that will contain the detailed breakout of lines of code per project and per programming language, the affiliation of the project to a particular top-level projects, the number of blanks or comment lines, etc.. It is pretty easy to then feed this CSV into Excel or Google Spreadsheets, and use it as the source for building pivot tables for specific breakouts.

Caveats

Just like virtually any KPI, you want to take the number of lines of code in your project with a grain of salt. Here are a few things to keep in mind:

All lines of code are not created equal

There is an incredible diversity of projects at Eclipse, and while a majority is using Java as their main programming language, there’s also a lot of C, C++, Python, Javascript, … 10M lines of Java code probably don’t carry the same value (i.e. how much effort has been needed to produce them) as 10M lines of C code.

Trends are more important than snapshots

It is nice to know that as of today there are 162 million lines of code in the Eclipse repositories, but it is, in my opinion, more important to look at trends over time. Is a particular programming language becoming more popular? Are all the top-level projects equally active?

I didn’t have a chance to run the scripts for a longer time period yet, but I will make sure to share the results when I get a chance!

Generated code, should it count?

There is a fair amount of generated code in some projects (in the Modeling top-level project in particular, of course), which certainly accounts for a few million lines of code. However, generated code often is customized, so I think it doesn’t necessarily skew the numbers as much as one would think.

Development does not always happen in a single branch

My script just looks at the code stored in the main (HEAD) branch of the Git repository. Some projects may have more than one development stream and may e.g. have a “develop” branch that is ahead of the main stable branch. Therefore, there is very likely more code in our repositories than what this quick analysis shows.

Additional findings

As my script outputs pretty detailed statistics, it is interesting to have a quick look at e.g. how the different top-level projects and programming languages compare.

Top 3 top-level projects: Runtime, Technology & Modeling

Top-level project Physical SLOC rt 54,961,728 technology 28,887,621 modeling 27,140,344 tools 14,214,182 webtools 9,651,900 eclipse 6,401,518 ee4j 5,809,126 ecd 3,114,768 polarsys 3,105,229 iot 2,930,217 birt 2,235,624 science 1,670,051 datatools 939,424 mylyn 767,652 soa 752,774

Top programming language: Java

Programming language Physical SLOC Java 72,349,870 HTML 61,119,106 XML 7,543,689 ANTLR Grammar 3,161,339 JSON 2,313,556 JavaScript 2,251,418 C++ 2,245,759 C 1,446,013 XMI 1,355,914 C/C++ Header 1,019,368 TTCN 923,098 Maven 884,271 CSS 805,073 Assembly 717,771 XSD 688,764 PHP 459,237 Python 316,553 Markdown 304,421 XSLT 256,857 Scala 229,560 Bourne Shell 214,142 Go 184,306 SWIG 152,062 JSP 142,190 Gencat NLS 125,251 Ant 113,133 TypeScript 108,217 AsciiDoc 105,552 Windows Module Definition 64,843 TITAN Project File Information 64,014 Groovy 55,261 Sass 53,915 XQuery 51,432 XHTML 51,166 DTD 51,052 make 48,021 Perl 43,643 DITA 42,526 yacc 39,876 TeX 36,400 m4 34,438 AspectJ 33,717 Ruby 28,355 Scheme 27,484 YAML 26,348 CMake 25,182 Lua 23,646 LESS 18,712 SQL 16,070 Cucumber 15,454 IDL 12,564 INI 12,171 Bourne Again Shell 11,978 Pascal 11,915 lex 11,795 DOS Batch 11,675 Windows Resource File 10,278 Blade 8,295 C# 7,983 Tcl/Tk 7,611 Stylus 7,477 Fortran 90 7,211 ERB 7,048 Vuejs Component 6,281 Visualforce Component 5,047 MSBuild script 4,538 Freemarker Template 4,077 Dockerfile 3,696 Velocity Template Language 3,649 awk 3,068 Rust 2,903 Qt 2,772 CUDA 2,533 Puppet 2,084 diff 1,880 Haml 1,819 Oracle PL/SQL 1,778 ProGuard 1,739 Objective C 1,469 ActionScript 1,459 Visual Basic 1,365 Mathematica 1,247 RobotFramework 1,074 Korn Shell 1,023 D 1,007 Smalltalk 911 R 887 TOML 826 Ada 668 Lisp 618 Objective C++ 589 Fortran 77 588 Arduino Sketch 480 MATLAB 476 sed 461 Protocol Buffers 454 WiX source 446 JavaServer Faces 440 PowerShell 284 Qt Project 176 Windows Message File 139 Expect 120 NAnt script 110 Smarty 109 HCL 78 CoffeeScript 78 Skylark 74 Forth 69 Qt Linguist 61 WiX include 52 XAML 49 QML 48 Handlebars 46 Clojure 38 Prolog 37 Razor 32 PO File 29 Haskell 27 JSX 24 ASP.NET 21 HLSL 15 F# 11 Swift 10 GLSL 8 Kotlin 7 C Shell 7 Mustache 1

If you end up using my script and have any question, please let me know in the comments or directly on Github!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK