Polling a filesystem to detect files creation might be cumbersome and inefficient. Linux brought an answer to this called inotify: you register a listener to a folder and get notified asynchronously when an change occurs in that folder.
I started to ask myself the question back in 2013: what about HDFS, how to get notified when a new file is created in HDFS?
Use-cases leveraging this feature are multiples.
- The most obvious one being a processing workflow, when a file is created, the next stage of processing could start immediately: instead of polling HDFS to detect, say, a _SUCCESS file in a particular folder, the next stage would start directly by being notified once the file created
- The most interesting to replace would be files monitoring: instead of recursively polling a entire directory tree, which might be consuming a significant amount of resources in HDFS, the monitoring system would be notified on new file creation in a given folder.
Convinced about the value of such system, I started working on it with my colleagues. After several iterations and multiple failures, the current version is running in production for several months, handling on average 1000's of events per seconds, and an order of magnitude more with peaks.
Key features of Trumpet are:
- Non-intrusive to the NameNode: as a Hadoop admin, my first priority is to protect the NameNode. Trumpet server is then running in a different JVM than the NameNode or the JournalNode, extracting the transactions from the edits log files, the HDFS redo log files. This actually happened to be pretty straight forward, and (almost :)) without touching the NameNode code.
- Scalability: Trumpet is leveraging Kafka -- its pub-sub messaging model fits perfectly what I'm trying to achieve with Trumpet: reliably broadcasting HDFS events to multiple consumers. The current implementation can server hundreds of subscribers without problem -- actually, Trumpet scales as well as your Kafka cluster.
- Highly Available: Trumpet is designed from day one to run in HA mode, with multiple processes electing a leader, and recovering from a previous state.
- Ops friendly: I'm Hadoop ops, and I like making my life easier. Trumpet provides all the tools to validate, dig, watch, monitor a running installation. This also includes the support for rolling upgrade!
- Simple: in total, Trumpet is about 3K line of codes. Most of them are boilerplate for exception handling. The main components of Trumpet probably fits in a 100's lines of code. Trumpet is only doing one thing, but is doing it rather well.
And that's it. Install it and use it, and build awesome reactive pipeline on top of it!
Oh yeah, and if you don't have Kafka running along side your Hadoop cluster, well, you definitely should -- Kafka is awesome!