Watc

Watc | codesuji

2022-08-13

Read Time: 7 minutes

Recently Ted Unangst wrote about his tool, watc, to extract line count and file size statistics to support some of his work. Chris Wellons followed up with his take on watc. Inspired by both posts, I thought it would be an interesting tool to add to my own toolbox. It pairs nicely with some of my current work on extracting useful information from code repositories. This feels like a good way to put together a quick tool using F#. I’ll also use this as an opportunity to show some F# along the way.

Like Chris, I tend to favor non-interactive apps for this time of tooling. I have my own personal additions, but I follow his design a bit closer. At a high level, the app is a relatively simple matter of iterating a directory structure and aggregating line count and file sizes. Since the goal is analyzing source code, it will filter binaries, .git, build artifacts, etc; allowing me to stay focused on what I immediately care about. Command line parameters allow me to dictate summary level, sorting, and report formatting. You can find the full code here, but I’m just going to focus on a couple small aspects. Before I get to the point, below is a small example of what the results look like.

$ ./watc --depth=2 --sort=lines ~/projects/fsharp/src
/home/codesuji/projects/fsharp/src       430.9K LOC    23.0MB
  Compiler                               356.0K LOC    19.3MB
    xlf                                  126.6K LOC     8.8MB
    Checking                              55.2K LOC     2.8MB
    TypedTree                             31.5K LOC     1.4MB
    AbstractIL                            30.6K LOC     1.1MB
    Service                               22.2K LOC     1.1MB
    Driver                                14.0K LOC   659.7KB
    Utilities                             11.5K LOC   409.5KB
    SyntaxTree                            11.1K LOC   426.0KB
    CodeGen                               10.8K LOC   536.4KB
    Optimize                               8.9K LOC   417.7KB
    Interactive                            8.2K LOC   433.6KB (1)
    Symbols                                7.3K LOC   317.3KB
    Facilities                             5.0K LOC   206.6KB (1)
    DependencyManager                      1.2K LOC    57.5KB (1)
    Legacy                                  659 LOC    31.4KB
  FSharp.Core                             66.2K LOC     3.2MB
    xlf                                    9.3K LOC   571.8KB
    math                                    134 LOC     5.0KB
  FSharp.Build                             4.0K LOC   191.4KB
    xlf                                     416 LOC    29.1KB
  FSharp.DependencyManager.Nuget           1.9K LOC    88.0KB
    xlf                                     818 LOC    38.7KB
  fsi                                      1.7K LOC   159.0KB
  FSharp.Compiler.Interactive.Settings      408 LOC    16.4KB
    xlf                                      78 LOC     5.1KB
  fsc                                       194 LOC     7.7KB
  Microsoft.FSharp.Compiler                 103 LOC     7.7KB
  FSharp.Compiler.Server.Shared              95 LOC     3.2KB
  fsiAnyCpu                                  69 LOC     2.7KB
  fscAnyCpu                                  66 LOC     2.9KB

With some of the demonstration out of the way, time to get to the point. Improving application performance is a complicated and nuanced topic; obvious statement I know. Seeing the hoops some languages need to jump through to support parallelism is a good reminder is it doesn’t always have to be difficult. This leads me to F#. Today’s post is a pretty shallow view, looking for a quick win, but sometimes that’s all you need. For relatively simple tasks, parallelism can be simple to acheive with F#. A conversion of Array.map to Array.Parallel.map gives quick access to parallelism out of the box. To illustrate this, I’ll pull the related section out of the code.

Before, single-threaded:

let processDir maxDepth showFiles dir =
  ...
    let filesLines =
      files
      |> Array.map (fun x ->
          { Node.Name = Path.GetFileName x
            Type = NodeType.File
            Lines = getFileLines x
            Bytes = getFileBytes x
            DirCount = 0
            Children = [||]
          })

let dirsLines =
      dirs
      |> Array.map (fun x -> processDir' (currentDepth + 1) x)

let lineSum =
      [| filesLines; dirsLines |]
      |> Array.concat
      |> Array.map (fun x -> x.Lines)
      |> Array.sum

let byteSum =
      [| filesLines; dirsLines |]
      |> Array.concat
      |> Array.map (fun x -> x.Bytes)
      |> Array.sum

...

After, multi-threaded:

let processDir maxDepth showFiles dir =
  ...
    let filesLines =
      files
      |> Array.Parallel.map (fun x -> // LINE CHANGED
          { Node.Name = Path.GetFileName x
            Type = NodeType.File
            Lines = getFileLines x
            Bytes = getFileBytes x
            DirCount = 0
            Children = [||]
          })

let dirsLines =
      dirs
      |> Array.Parallel.map (fun x -> processDir' (currentDepth + 1) x) // LINE CHANGED

let lineSum =
      [| filesLines; dirsLines |]
      |> Array.concat
      |> Array.Parallel.map (fun x -> x.Lines) // LINE CHANGED
      |> Array.sum

let byteSum =
      [| filesLines; dirsLines |]
      |> Array.concat
      |> Array.Parallel.map (fun x -> x.Bytes) // LINE CHANGED
      |> Array.sum

...

Above you’ll see four line changes, resulting in a faster application. At this point, it is worth noting this is a cool trick, with caveats. When it fits the needs, it is a simple way to get a performance improvement. But, not all situations are the same. Sometimes design dictates a need for more control over the implementation. It is also something you need to test to ensure you’re getting the proper benefits, and making the correct tradeoffs. There are many, particularly large scale apps, where this won’t necessarily work and you’d have to use other techniques. But I do enjoy how for many cases, this is a quick win.

I mentioned testing earlier. This is such a small project, I didn’t break out more advanced benchmarks. I just ran some quick sanity checks to see how the changes impacted runtime. I performed tests using two different directories, the F# and Rust language github repos. I ran it multiple times, clearing system caches between tests. In a very unscientific fashion, below are representative results of running time using a serial versus parallel version of watc. It shows the app running faster in elapsed time (real time), which is what I’m aiming for.

# time ./watc ~/projects/fsharp

Serial:
real  0m2.429s
user  0m0.933s
sys   0m0.467s

Parallel:
real  0m1.135s
user  0m1.156s
sys   0m0.346s

# time ./watc ~/projects/rust/

Serial:
real  0m6.367s
user  0m2.071s
sys   0m1.286s

Parallel:
real  0m0.855s
user  0m2.222s
sys   0m0.971s

That’s all I have for today. Array.Parallel has given me a nice performance boost when I’m doing repo recon, and I’ll take it. Beyond that, I just wanted to give a quick view into watc, F#-style. Until next time.

Recommend

Amazon’s new AI tool may take over work from employees facing layoffs and buyout...

小红书如何通过短网址进行营销推广？

Cloud cost management: How to reduce and control cloud spend

Shared Folder in QEMU Between Linux Host and Windows Guest (Shallow Thoughts)

Codeforces Round #836 (Div. 2)

赵明：MagicOS旨在突破单机OS局限性未来荣耀设备将全面接入

Follow Friday: NodeJS Edition (18 November 2022)

Apple supplier Foxconn offered protesting workers $1,400 each to quit their jobs...

Why I Don't do TDD

✨Today I Learned: The Subtle Art of Code Reviews 💡✨

About Joyk