

Exploratory Data Analysis on a large dataset using ROOT.
source link: https://mightynotes.wordpress.com/2019/09/04/exploratory-data-analysis-on-a-large-dataset-using-root/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Exploratory Data Analysis on a large dataset using ROOT.

Having been messing with the physicists’ toy for a while. And not being a Python fan for Python’s easiness to write dirty code. I find ROOT to be a fascinating tool. Allowing almost perfect scaling and can be written in a way to use basically no memory. I’ll be putting ROOT to do real work, in a scalable way, in this post.
I’ll be using some advanced C++ (lambda, automatic type deduction, STL) and ROOT/cling trickery in this analysis. This is also a practice to push myself to my limit and learn more about ROOT. You have been warned.
The dataset
Spending hours on Kaggle, trying to find a easy-to-process yet large enough dataset is hard. I ended up with a 8GB dataset of Seattle Checkouts by Title. Which contains a huge CSV file. Looks to be good enough. Anyhow, I’ll need to convert the CSV into something ROOT can handle. From experience we know there must be missing data. So maybe dump the CSV into a ROOT archive of all strings first and worry about types later.
Unfortunately, ROOT’s CSV parser is a very basic one and won’t parse our dataset properly. So we’ll use an external one.
void to_root_raw() { // Create a reader from CSV csv::CSVReader reader("checkouts-by-title.csv");
// Create a ROOT File. For the first pass, we'll be saving everything as a string auto f = new TFile("library.root", "recreate"); TTree *t = new TTree("library_raw","raw data from csv file"); auto fields = reader.get_col_names(); auto buf = std::vector<std::string>(fields.size());
// Assign each buffer to a buffer for(size_t i=0;i<fields.size();i++) t->Branch(fields[i].c_str(), &buf[i]);
// Copy data from CSV to ROOT for (auto& row: reader) { for(size_t i=0;i<fields.size();i++) buf[i] = row[fields[i]].get(); t->Fill(); }
// Write the data to root f->Write(); }
Now we have library_raw.root
. We can find out how much of the data is empty.
void count_zeros() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library_raw", "library.root");
auto columns = rdf.GetColumnNames(); auto empty_lst = std::map<std::string, size_t>();
ProgressDisplay disp(*rdf.Count()*columns.size());
size_t finished = 0; for(auto column : columns) { size_t empty_count = 0; rdf.Foreach([&](std::string s) { empty_count += (s == "");
finished += 1; if(finished%100000 == 0) disp.update(finished); }, {column}); empty_lst[column] = empty_count; } std::cout << "\n";
for(auto [column, count] : empty_lst) std::cout << "Column " << column << ", have " << count << " empty entries" << std::endl; }
And we get…:
Column CheckoutMonth, have 0 empty entries
Column CheckoutType, have 0 empty entries
Column CheckoutYear, have 0 empty entries
Column Checkouts, have 0 empty entries
Column Creator, have 12001187 empty entries
Column MaterialType, have 0 empty entries
Column PublicationYear, have 9796934 empty entries
Column Publisher, have 9475964 empty entries
Column Subjects, have 1729855 empty entries
Column Title, have 0 empty entries
Column UsageClass, have 0 empty entries
Seems the data is quite complete except some columns. Nice, only the PublicationYear needs to be filled with a default value. Let’s convert the current data archive into a one with types so processing is easier.
// A basic parser, it might get stuff wrong, but good enought for EDA Int_t parse_year(std::string year) { if(year == "") throw std::runtime_error("year is empty"); if(year[0] == '[' || year[0] == 'c' || year[0] == 'p') // Handle format of [2000], c2000 and p2000 return std::stoi(std::string(year.begin()+1, year.begin()+5)); else if(isdigit(year[0]) == true) return std::stoi(std::string(year.begin(), year.begin()+4)); throw std::runtime_error("cannot parse format"); }
void to_root() { // Open the raw archive using string = std::string; auto rdf = ROOT::RDataFrame("library_raw", "library_raw.root");
//Create a new archive that we'll be coping to auto f = new TFile("library.root", "recreate"); auto t = new TTree("library", "checkout_data");
// All the fields TDatime date; string usage, checkout_type, creator, material, title, publisher; std::vector<std::string> subjects; Int_t num_checkout, publication_year;
// Assign column/branchs to the fields t->Branch("usage", &usage); t->Branch("checkout_type", &checkout_type); t->Branch("material", &material); t->Branch("checkout_month", &date); t->Branch("num_checkout", &num_checkout); t->Branch("title", &title); t->Branch("creator", &creator); t->Branch("subjects", &subjects); t->Branch("publisher", &publisher); t->Branch("publication_year", &publication_year);
ProgressDisplay disp(*rdf.Count()); size_t i=0;
rdf.Foreach([&](string usage_, string checkout_type_, string material_, string checkout_year_, string checkout_month_, string checkouts_, string title_ , string creator_, string subjects_, string publisher_, string publication_year_) {
// Copy data to field usage = usage_; checkout_type = checkout_type_; material = material_; num_checkout = std::stoi(checkouts_); title = title_; creator = creator_; subjects = split(subjects_); publisher = publisher_;
try {publication_year = parse_year(publication_year_);} catch(…) {publication_year = 0x7fffffff;}
date.Set(std::stoi(checkout_year_), std::stoi(checkout_month_),0, 0, 0, 0);
t->Fill();
i++; if(i%100000 == 0) disp.update(i);
}, {"UsageClass", "CheckoutType", "MaterialType", "CheckoutYear", "CheckoutMonth", "Checkouts", "Title", "Creator", "Subjects", "Publisher", "PublicationYear"});
f->Write(); }
Now, let’s see… I’m curious how many books from each year is borrowed. The code I use is not accurate (it counts the # of entries in the dataset, but the entries can replicate (different date). But should give me a rough idea.
root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] h1 = rdf.Filter("publication_year != 0x7fffffff").Histo1D("publication_year");
root [2] h1->Draw();
root [3] c1->SetLogy();

As expected, most books are from the the late 20 century to now. Let’s zoom in a bit.
root [5] h1 = rdf.Filter("publication_year != 0x7fffffff && publication_year > 1750").Histo1D("publication_root");
root [6] c1->SetLogy(kFALSE);
root [7] h1->Draw();

I’m quite curious which books from the 0th century are borrowed. My guess is that they are historical books and bible, etc…?
root [11] rdf_oldbook = rdf.Filter("publication_year < 200 && publication_year > -200");
root [12] set<string> books;
root [13] rdf_oldbook.Foreach([&](string title){books.insert(title);}, {"title"});
root [14] books.size()
1744
root [15] for(auto title : books) cout << title << endl;
...
A contemporary concept of bowing technique for the double bass [music] / by Frederick Zimmeramann.
A day in the death of Joe Egg : a play / by Peter Nichols.
A field guide to the mammals : North America north of Mexico / text and maps by William Henry Burt ; illustrations by Richard Philip Grossenheider ; sponsored by the National Audubon Society and National Wildlife Federation.
...
Wait, what? Ohh… shi*.. They must be using the 2 digit year code. Let’s modify the year parsing code and try again.

Looks better. Well, what’s the deal with the spike ~1000AC? Let’s see…
root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] rdf_oldbook = rdf.Filter("publication_year < 1100 && publication_year > 800");
root [2] map<string, int> books;
root [3] rdf_oldbook.Foreach([&](string title){books[title] += 1;}, {"title"});
root [4] books
(std::map<std::string, int> &) { "Der Fall Gouffé; Roman." => 3, "Hilda Hen's search / Mary Wormell." => 59, "Prehistoric cave art in Northern Spain, Asturias / Magin Berenguer ; translated into English by Henry Hinds ; introduction by Fredo Arias de la Canal." => 28 }
There’s only two books, but judging from the title, they look like books from the modern era. Maybe a typo? Well, they are a small enough sample. I guess I don’t care.
The missing data
After printing out the statistics of books with missing information, I’m very curious which and from what time they are from.
void missing_creator() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library", "library.root"); auto rdf_missing = rdf.Filter("creator != \"\" && publication_year != 0x7fffffff"); std::map<std::string, int> books; rdf_missing.Foreach([&](std::string title, int year) {books[title] = year;}, {"title", "publication_year"}); auto h1 = new TH1F("h1", "books with missing creator", 100, 1850, 2019); for(auto [title, year] : books) h1->Fill(year); h1->Draw(); }

Oh… so this is not a problem of old books with information about it’s creators. Let’s try to normalize the plot so we see how many percent of the books have missing data.
void missing_creator_pct() { // Now we can use the new and improved RDataFrame interface auto rdf = ROOT::RDataFrame("library", "library.root"); auto rdf_with_year = rdf.Filter("publication_year != 0x7fffffff"); auto rdf_missing = rdf_with_year.Filter("creator != \"\""); std::map<std::string, int> books; std::map<std::string, int> all_books; rdf_with_year.Foreach([&](std::string title, int year) {all_books[title] = year;}, {"title", "publication_year"}); rdf_missing.Foreach([&](std::string title, int year) {books[title] = year;}, {"title", "publication_year"}); auto h1 = new TH1F("h1", "% of books with missing creator", 100, 1850, 2019); for(auto [title, year] : books) h1->Fill(year); auto h2 = new TH1F("h2", "books with missing year", 100, 1850, 2019); for(auto [title, year] : all_books) h2->Fill(year); for(int i=0;i<101;i++) { if(h2->GetBinContent(i) == 0) continue; h1->SetBinContent(i, h1->GetBinContent(i)/h2->GetBinContent(i)); } h1->Draw(); }

It looks bad. Why? Maybe people are just being lazy so they didn’t bother to type in the creators? Let’s find how many entries doesn’t have a creator.
root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] num_entries = *rdf.Count();
root [2] num_entires_no_creator = *(rdf.Filter("creator == \"\"").Count());
root [3] (double)num_entires_no_creator/num_entries
(double) 0.35279131
Only 35% of the data doesn’t have a creator? That’s doesn’t fit the plot I have just drawn. Maybe there’s a high correlation between having no publication year and no creator?
root [5] num_entires_no_year = *(rdf.Filter("publication_year == 0x7fffffff").Count());
root [6] (double)num_entires_no_year/num_entries
(double) 0.68926208
That still makes no sense. There’s 35% of entries without a creator, 65% of entries without a publication year yet ~90% of books years doesn’t have a creator. Maybe the books without creators are borrowed more? And thus more entries?
root [7] num_entries_no_both = *(rdf.Filter("creator == \"\" && publication_year == 0x7fffffff").Count());
root [8] (double)num_entries_no_both/num_entries
(double) 0.28426520
Bingo! 28% of data don’t have both year and creator while 35% of data have no creator. Bayesian math tells us that 89.28% of books without creator don’t have a publication date.
Now let’s look at the subject of the borrowed books.
root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] unordered_map<string, int> subjects;
root [2] rdf.Foreach([&](vector<string> subjs){for(auto sub : subjs) subjects[sub]++;}, {"subjects"});
root [3] subjects.size()
(unsigned long) 611269
I was wishing to make a correlation matrix between the subjects. Well, that’s not happening.
root [4] vector<pair<string, int>> subject_counts(subjects.begin(), subjects.end());
root [5] sort(subject_counts.begin(), subject_counts.end(), [](auto a, auto b){return a.second>b.second;});
root [6] for(int i=0;i<10;i++) cout << subject_counts[i].first << endl;
Fiction
Video recordings for the hearing impaired
Feature films
Nonfiction
Mystery fiction
Romance
Graphic novels
Mystery
Fiction films
So… most of the peoples are borrowing films. I’m totally not expecting video recordings for the hearing impaired in the list, but in fact it is the 2st item in the list, interesting…. And why is feature films on the top of the list?
Back to the beginning
Oops, I’ve gone too deep into the rabbit hole of the missing creators. Let’s go back and do some basic analysis. Let’s start from a plot of # of entries vs time. Unfortunately as of writing this post, ROOT can’t plot graphs using time as an axis. Unix time it is!
Also, since we are finally not using custom Foreach loops. We can finally turn on EnableImplicitMT to enable parallel processing!
root [0] ROOT::EnableImplicitMT();
root [1] rdf = ROOT::RDataFrame("library", "library.root");
root [2] unixtime = rdf.Define("unixtime", [](TDatime t){return t.Convert();}, {"checkout_month"});
root [3] h1 = unixtime.Histo1D("unixtime");
root [4] h1->Draw("L");

Seems there’s a steady increase in the amount of borrowings. Looks like a linear increase? Definitely looks like so!
root [5] h1->Fit("pol1");

Now, let’s visualize the actual amount of book borrowed. (And properly setup the x-axis this time) (remember we are counting the amount of entries before?)
root [0] rdf = ROOT::RDataFrame("library", "library.root");
root [1] unixtime = rdf.Define("unixtime", [](TDatime t){return t.Convert();}, {"checkout_month"});
root [2] min_time = *unixtime.Min("unixtime");
root [3] max_time = *unixtime.Max("unixtime");
root [4] h1 = new TH1D("h1", "# of borrowings", 100, min_time, max_time);
root [5] unixtime.Foreach([&](unsigned int uxtime, int count){ for(int i=0;i<count;i++) h1->Fill(uxtime);}, {"unixtime", "num_checkout"});
root [6] h1->Draw("L");

And let’s fit again.
root [7] h1->Fit("f1");

I mean, linear fitting is good. But it is unrealistic, let’s try sigmoid. Maybe sub-linear growth will look better.
root [8] f1 = new TF1("f1","[1]/(1.0+exp(-[0]*x+[2]))",min_time, max_time);
root [9] h1->Fit("f1");
Warning in <Fit>: Abnormal termination of minimization.
FCN=1.60201e+07 FROM MIGRAD STATUS=FAILED 131 CALLS 132 TOTAL
EDM=24.0194 STRATEGY= 1 ERR MATRIX NOT POS-DEF
EXT PARAMETER APPROXIMATE STEP FIRST
NO. NAME VALUE ERROR SIZE DERIVATIVE
1 p0 0.00000e+00 3.78518e+00 -0.00000e+00 2.61551e+00
2 p1 2.05311e+06 2.51338e+03 0.00000e+00 1.04223e-07
3 p2 0.00000e+00 9.39291e-01 -0.00000e+00 -1.06962e-01

Maybe sigmoid isn’t the brightest idea. I’m curious about the FFT of the plot. The plot looks like a wave, there should be a dominate frequency.
root [10] TVirtualFFT::SetTransform(0);
root [11] TH1* hm = nullptr;
root [12] hm = h1->FFT(hm, "MAG");
root [13] hm->SetTitle("Magnitude of the 1st transform");
root [14] hm->SetBinContent(1, 0); // set DC to 0
root [15] hm->Draw();

There definitely is. We can calculate the frequency to be 30*(100/(max_time-min_time))*3600*24 = 0.019327406
days. Or around every 51 days. Every 2 moths. Possibly the library have some clean-up or activity every 2 moths?
I guess it’s time to end this post. I have learned a lot doing every analysis in C++ in a memory-restricted manner. Hopefully you have also learned something too!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK