Apache Flume is a real-time ETL tool for data warehouse platform. It consists of different types of components, and during runtime all of them are managed by Flume’s lifecycle and supervisor mechanism. This article will walk you through the source code of Flume’s component lifecycle management.
Flume’s source code can be downloaded from GitHub. It’s a Maven project, so we can import it into an IDE for efficient code reading. The following is the main structure of the project:
main entrance of Flume agent is in the
org.apache.flume.node.Application class of
flume-ng-node module. Following is an abridged source code:
The process can be illustrated as follows:
- Parse command line arguments with
commons-cli, including the Flume agent’s name, configuration method and path.
- Configurations can be provided via properties file or ZooKeeper. Both provider support live-reload, i.e. we can update component settings without restarting the agent.
- File-based live-reload is implemented by using a background thread that checks the last modification time of the file.
- ZooKeeper-based live-reload is provided by Curator’s
NodeCacherecipe, which uses ZooKeeper’s watch functionality underneath.
- If live-reload is on (by default), configuration providers will add themselves into the application’s component list, and after calling
LifecycleSupervisorwill start the provider, and trigger the reload event to parse the configuration and load all defined components.
- If live-reload is off, configuration providers will parse the file immediately and start all components, also supervised by
- Finally add a JVM shutdown hook by
Runtime#addShutdownHook, which in turn invokes
Application#stopto shutdown the Flume agent.