Monthly report for March 2018
Making data explorable
The project I’m working on had its dataset scattered in various places: CSV files, MySQL tables, even JSON files. This dataset is meant to be used by my teammates to verify some of their hypotheses, but also to be explored in a more free fashion by other researchers.
To verify hypotheses, my teammates extracted some data from one source, copied it in some pivot format, performed a query on another tool, reformated and reinserted data in the same pivot format… You get the idea, it’s not really something that sounds fun to deal with. Not to mention each of these steps was done manually, making the whole process error-prone.
What seemed obvious to me was to have the whole dataset pieces as easy to reach as possible for the end user, and easy to maintain for me. The consequence of this assumption was: I wanted to find a storage system that allowed me to plug some kind of visualization system on top of it. I could have created a piece of software that could use the existing storage systems, but I didn’t have much time to create it, and virtually no time at all for new features.
After looking at the landscape of visualization tools, Grafana and Kibana were the last two remaining contenders. They both offer good support from the community, they allow custom visualization building, exist in a free version, and most importantly can be hosted on our own servers. This is an important matter to us because we own sensitive data and legally can’t rely on an external storage service. Well, we could, but the paperwork would take ages to be filled, and it would be more expensive anyway.
Both of these tools can read data from Elasticsearch. Since I could request whatever machine resources I needed, chosing storage was actually almost a no-brainer and I went the Elasticsearch way.
Using Grafana would have been a decent choice, but I got the impression that Kibana was more user-friendly. Kibana integration with Elasticsearch is straightforward, and having both services up and running is only a matter of minutes using the official RPM repository. I toyed with the tools using the Docker images, which I find even more convenient than setting up a Vagrant box.
The only downside of this setup is that our data format didn’t fit quite well.
The dataset is made of Users
objects, and UserEvents
belonging to them. I
could have used
Elasticsearch nested objects,
but Kibana
doesn’t support nested type aggregations
and our team needed to be able to study UserEvents
without querying the
related Users
.
In a flat world, this could be done using two separated indices and have each
UserEvent
wear the whole User
properties, but this doesn’t seem quite right
given our use case since this would have required more manipulation from my
teammates. In addition to that, Users
and UserEvents
were completely
separated in the application logic, and I didn’t have the time to migrate data
and update the application code.
Hopefully, I managed to make it right using only a single index for the whole
documents, and use the
join
datatype
to bridge my objects. Using this parent/children relationship induces
sparsity
when documents fields are different –which was true for us–, but storage
requirements were not my main concern and this was an acceptable trade-off.
Tools I used
The Post Javascript Apocalypse - Douglas Crockford
Another good parts about programming languages.
Performant web rendering
I’ve read quite a lot about the page rendering process of the browsers, and found some very interesting articles to begin with: