Exploratory Data Analysis on a large dataset using ROOT.

Having been messing with the physicists’ toy for a while. And not being a Python fan for Python’s easiness to write dirty code. I find ROOT to be a fascinating tool. Allowing almost perfect scaling and can be written in a way to use basically no memory. I’ll be putting ROOT to do real work, in a scalable way, in this post.

I’ll be using some advanced C++ (lambda, automatic type deduction, STL) and ROOT/cling trickery in this analysis. This is also a practice to push myself to my limit and learn more about ROOT. You have been warned.

The dataset

Spending hours on Kaggle, trying to find a easy-to-process yet large enough dataset is hard. I ended up with a 8GB dataset of Seattle Checkouts by Title. Which contains a huge CSV file. Looks to be good enough. Anyhow, I’ll need to convert the CSV into something ROOT can handle. From experience we know there must be missing data. So maybe dump the CSV into a ROOT archive of all strings first and worry about types later.

Unfortunately, ROOT’s CSV parser is a very basic one and won’t parse our dataset properly. So we’ll use an external one.

void to_root_raw() { // Create a reader from CSV csv::CSVReader reader("checkouts-by-title.csv");

// Create a ROOT File. For the first pass, we'll be saving everything as a string auto f = new TFile("library.root", "recreate"); TTree *t = new TTree("library_raw","raw data from csv file"); auto fields = reader.get_col_names(); auto buf = std::vector<std::string>(fields.size());

// Assign each buffer to a buffer for(size_t i=0;i<fields.size();i++) t->Branch(fields[i].c_str(), &buf[i]);

// Copy data from CSV to ROOT for (auto& row: reader) { for(size_t i=0;i<fields.size();i++) buf[i] = row[fields[i]].get(); t->Fill(); }

// Write the data to root f->Write(); }

Now we have library_raw.root. We can find out how much of the data is empty.

void count_zeros() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library_raw", "library.root");

auto columns = rdf.GetColumnNames(); auto empty_lst = std::map<std::string, size_t>();

ProgressDisplay disp(*rdf.Count()*columns.size());

size_t finished = 0; for(auto column : columns) { size_t empty_count = 0; rdf.Foreach([&](std::string s) { empty_count += (s == "");

finished += 1; if(finished%100000 == 0) disp.update(finished); }, {column}); empty_lst[column] = empty_count; } std::cout << "\n";

for(auto [column, count] : empty_lst) std::cout << "Column " << column << ", have " << count << " empty entries" << std::endl; }

And we get…:

Column CheckoutMonth, have 0 empty entries
Column CheckoutType, have 0 empty entries
Column CheckoutYear, have 0 empty entries
Column Checkouts, have 0 empty entries
Column Creator, have 12001187 empty entries
Column MaterialType, have 0 empty entries
Column PublicationYear, have 9796934 empty entries
Column Publisher, have 9475964 empty entries
Column Subjects, have 1729855 empty entries
Column Title, have 0 empty entries
Column UsageClass, have 0 empty entries

Seems the data is quite complete except some columns. Nice, only the PublicationYear needs to be filled with a default value. Let’s convert the current data archive into a one with types so processing is easier.

// A basic parser, it might get stuff wrong, but good enought for EDA Int_t parse_year(std::string year) { if(year == "") throw std::runtime_error("year is empty"); if(year[0] == '[' || year[0] == 'c' || year[0] == 'p') // Handle format of [2000], c2000 and p2000 return std::stoi(std::string(year.begin()+1, year.begin()+5)); else if(isdigit(year[0]) == true) return std::stoi(std::string(year.begin(), year.begin()+4)); throw std::runtime_error("cannot parse format"); }

void to_root() { // Open the raw archive using string = std::string; auto rdf = ROOT::RDataFrame("library_raw", "library_raw.root");

//Create a new archive that we'll be coping to auto f = new TFile("library.root", "recreate"); auto t = new TTree("library", "checkout_data");

// All the fields TDatime date; string usage, checkout_type, creator, material, title, publisher; std::vector<std::string> subjects; Int_t num_checkout, publication_year;

// Assign column/branchs to the fields t->Branch("usage", &usage); t->Branch("checkout_type", &checkout_type); t->Branch("material", &material); t->Branch("checkout_month", &date); t->Branch("num_checkout", &num_checkout); t->Branch("title", &title); t->Branch("creator", &creator); t->Branch("subjects", &subjects); t->Branch("publisher", &publisher); t->Branch("publication_year", &publication_year);

ProgressDisplay disp(*rdf.Count()); size_t i=0;

rdf.Foreach([&](string usage_, string checkout_type_, string material_, string checkout_year_, string checkout_month_, string checkouts_, string title_ , string creator_, string subjects_, string publisher_, string publication_year_) {

// Copy data to field usage = usage_; checkout_type = checkout_type_; material = material_; num_checkout = std::stoi(checkouts_); title = title_; creator = creator_; subjects = split(subjects_); publisher = publisher_;

try {publication_year = parse_year(publication_year_);} catch(…) {publication_year = 0x7fffffff;}

date.Set(std::stoi(checkout_year_), std::stoi(checkout_month_),0, 0, 0, 0);

t->Fill();

i++; if(i%100000 == 0) disp.update(i);

}, {"UsageClass", "CheckoutType", "MaterialType", "CheckoutYear", "CheckoutMonth", "Checkouts", "Title", "Creator", "Subjects", "Publisher", "PublicationYear"});

f->Write(); }

Now, let’s see… I’m curious how many books from each year is borrowed. The code I use is not accurate (it counts the # of entries in the dataset, but the entries can replicate (different date). But should give me a rough idea.

root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] h1 = rdf.Filter("publication_year != 0x7fffffff").Histo1D("publication_year");
root [2] h1->Draw();
root [3] c1->SetLogy();

As expected, most books are from the the late 20 century to now. Let’s zoom in a bit.

root [5] h1 = rdf.Filter("publication_year != 0x7fffffff && publication_year > 1750").Histo1D("publication_root");
root [6] c1->SetLogy(kFALSE);
root [7] h1->Draw();

I’m quite curious which books from the 0th century are borrowed. My guess is that they are historical books and bible, etc…?

root [11] rdf_oldbook = rdf.Filter("publication_year < 200 && publication_year > -200");
 root [12] set<string> books;
 root [13] rdf_oldbook.Foreach([&](string title){books.insert(title);}, {"title"});
 root [14] books.size()
1744
 root [15] for(auto title : books) cout << title << endl;
...
A contemporary concept of bowing technique for the double bass [music] / by Frederick Zimmeramann.
A day in the death of Joe Egg : a play / by Peter Nichols.
A field guide to the mammals : North America north of Mexico / text and maps by William Henry Burt ; illustrations by Richard Philip Grossenheider ; sponsored by the National Audubon Society and National Wildlife Federation.
...

Wait, what? Ohh… shi*.. They must be using the 2 digit year code. Let’s modify the year parsing code and try again.

Looks better. Well, what’s the deal with the spike ~1000AC? Let’s see…

root [0] rdf = ROOT::RDataFrame("library", "library.root");
 root [1] rdf_oldbook = rdf.Filter("publication_year < 1100 && publication_year > 800");
 root [2] map<string, int> books;
 root [3] rdf_oldbook.Foreach([&](string title){books[title] += 1;}, {"title"});
 root [4] books
(std::map<std::string, int> &) { "Der Fall Gouffé; Roman." => 3, "Hilda Hen's search / Mary Wormell." => 59, "Prehistoric cave art in Northern Spain, Asturias / Magin Berenguer ; translated into English by Henry Hinds ; introduction by Fredo Arias de la Canal." => 28 }

There’s only two books, but judging from the title, they look like books from the modern era. Maybe a typo? Well, they are a small enough sample. I guess I don’t care.

The missing data

After printing out the statistics of books with missing information, I’m very curious which and from what time they are from.

void missing_creator() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library", "library.root"); auto rdf_missing = rdf.Filter("creator != \"\" && publication_year != 0x7fffffff"); std::map<std::string, int> books; rdf_missing.Foreach([&](std::string title, int year) {books[title] = year;}, {"title", "publication_year"}); auto h1 = new TH1F("h1", "books with missing creator", 100, 1850, 2019); for(auto [title, year] : books) h1->Fill(year); h1->Draw(); }

Oh… so this is not a problem of old books with information about it’s creators. Let’s try to normalize the plot so we see how many percent of the books have missing data.

void missing_creator_pct() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library", "library.root"); auto rdf_with_year = rdf.Filter("publication_year != 0x7fffffff"); auto rdf_missing = rdf_with_year.Filter("creator != \"\""); std::map<std::string, int> books; std::map<std::string, int> all_books; rdf_with_year.Foreach([&](std::string title, int year) {all_books[title] = year;}, {"title", "publication_year"}); rdf_missing.Foreach([&](std::string title, int year) {books[title] = year;}, {"title", "publication_year"}); auto h1 = new TH1F("h1", "% of books with missing creator", 100, 1850, 2019); for(auto [title, year] : books) h1->Fill(year); auto h2 = new TH1F("h2", "books with missing year", 100, 1850, 2019); for(auto [title, year] : all_books) h2->Fill(year); for(int i=0;i<101;i++) { if(h2->GetBinContent(i) == 0) continue; h1->SetBinContent(i, h1->GetBinContent(i)/h2->GetBinContent(i)); } h1->Draw(); }

It looks bad. Why? Maybe people are just being lazy so they didn’t bother to type in the creators? Let’s find how many entries doesn’t have a creator.

root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] num_entries = *rdf.Count();
root [2] num_entires_no_creator = *(rdf.Filter("creator == \"\"").Count());
root [3] (double)num_entires_no_creator/num_entries
(double) 0.35279131

Only 35% of the data doesn’t have a creator? That’s doesn’t fit the plot I have just drawn. Maybe there’s a high correlation between having no publication year and no creator?

root [5] num_entires_no_year = *(rdf.Filter("publication_year == 0x7fffffff").Count());
root [6] (double)num_entires_no_year/num_entries
(double) 0.68926208

That still makes no sense. There’s 35% of entries without a creator, 65% of entries without a publication year yet ~90% of books years doesn’t have a creator. Maybe the books without creators are borrowed more? And thus more entries?

root [7] num_entries_no_both = *(rdf.Filter("creator == \"\" && publication_year == 0x7fffffff").Count());
root [8] (double)num_entries_no_both/num_entries
(double) 0.28426520

Bingo! 28% of data don’t have both year and creator while 35% of data have no creator. Bayesian math tells us that 89.28% of books without creator don’t have a publication date.

Now let’s look at the subject of the borrowed books.

root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] unordered_map<string, int> subjects;
root [2] rdf.Foreach([&](vector<string> subjs){for(auto sub : subjs) subjects[sub]++;}, {"subjects"});
root [3] subjects.size()
(unsigned long) 611269

I was wishing to make a correlation matrix between the subjects. Well, that’s not happening.

root [4] vector<pair<string, int>> subject_counts(subjects.begin(), subjects.end());
root [5] sort(subject_counts.begin(), subject_counts.end(), [](auto a, auto b){return a.second>b.second;}); 
root [6] for(int i=0;i<10;i++) cout << subject_counts[i].first << endl;
 Fiction
 Video recordings for the hearing impaired
 Feature films
 Nonfiction
 Mystery fiction
 Romance
 Graphic novels
 Mystery
 Fiction films

So… most of the peoples are borrowing films. I’m totally not expecting video recordings for the hearing impaired in the list, but in fact it is the 2st item in the list, interesting…. And why is feature films on the top of the list?

Back to the beginning

Oops, I’ve gone too deep into the rabbit hole of the missing creators. Let’s go back and do some basic analysis. Let’s start from a plot of # of entries vs time. Unfortunately as of writing this post, ROOT can’t plot graphs using time as an axis. Unix time it is!

Also, since we are finally not using custom Foreach loops. We can finally turn on EnableImplicitMT to enable parallel processing!

root [0] ROOT::EnableImplicitMT(); 
root [1] rdf = ROOT::RDataFrame("library", "library.root");
root [2] unixtime = rdf.Define("unixtime", [](TDatime t){return t.Convert();}, {"checkout_month"});
root [3] h1 = unixtime.Histo1D("unixtime");
root [4] h1->Draw("L");

Seems there’s a steady increase in the amount of borrowings. Looks like a linear increase? Definitely looks like so!

root [5] h1->Fit("pol1");

Now, let’s visualize the actual amount of book borrowed. (And properly setup the x-axis this time) (remember we are counting the amount of entries before?)

root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] unixtime = rdf.Define("unixtime", [](TDatime t){return t.Convert();}, {"checkout_month"});
root [2] min_time = *unixtime.Min("unixtime");
root [3] max_time = *unixtime.Max("unixtime");
root [4] h1 = new TH1D("h1", "# of borrowings", 100, min_time, max_time);
root [5] unixtime.Foreach([&](unsigned int uxtime, int count){ for(int i=0;i<count;i++) h1->Fill(uxtime);}, {"unixtime", "num_checkout"});
root [6] h1->Draw("L");

And let’s fit again.

root [7] h1->Fit("f1");

I mean, linear fitting is good. But it is unrealistic, let’s try sigmoid. Maybe sub-linear growth will look better.

root [8] f1 = new TF1("f1","[1]/(1.0+exp(-[0]*x+[2]))",min_time, max_time);
root [9] h1->Fit("f1");
Warning in <Fit>: Abnormal termination of minimization.
 FCN=1.60201e+07 FROM MIGRAD    STATUS=FAILED        131 CALLS         132 TOTAL
                     EDM=24.0194    STRATEGY= 1      ERR MATRIX NOT POS-DEF
  EXT PARAMETER                APPROXIMATE        STEP         FIRST   
  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE 
   1  p0           0.00000e+00   3.78518e+00  -0.00000e+00   2.61551e+00
   2  p1           2.05311e+06   2.51338e+03   0.00000e+00   1.04223e-07
   3  p2           0.00000e+00   9.39291e-01  -0.00000e+00  -1.06962e-01

Maybe sigmoid isn’t the brightest idea. I’m curious about the FFT of the plot. The plot looks like a wave, there should be a dominate frequency.

root [10] TVirtualFFT::SetTransform(0);
root [11] TH1* hm = nullptr;
root [12] hm = h1->FFT(hm, "MAG");
root [13] hm->SetTitle("Magnitude of the 1st transform");
root [14] hm->SetBinContent(1, 0); // set DC to 0
root [15] hm->Draw();

There definitely is. We can calculate the frequency to be 30*(100/(max_time-min_time))*3600*24 = 0.019327406 days. Or around every 51 days. Every 2 moths. Possibly the library have some clean-up or activity every 2 moths?

I guess it’s time to end this post. I have learned a lot doing every analysis in C++ in a memory-restricted manner. Hopefully you have also learned something too!

Exploratory Data Analysis on a large dataset using ROOT.

Exploratory Data Analysis on a large dataset using ROOT.

The dataset

The missing data

Back to the beginning

Recommend

Why you should build your WebApp on C++

SAS vs ROOT. Roasting SAS and comparing to ROOT

Hacking code generated from ONNC Cortex-M backend to run on Cortex-A

CERN ROOT tips and tricks

How to run FreeBSD/ARM64 in a VM on an ARM Linux server

定时同步数据的任务优化

Linux配置Tomcat日志轮转（切割）

SpringMVC,SpringBoot快速验证参数是否完整基于@Valid注解

Springboot Druid 使用Slf4j输出可执行SQL

基于IDEA热部署更新服务器Tomcat类，服务器Tomcat热更新

About Joyk