Apache Flume Tutorial: An Introduction to Log Collection and Aggregation

Apache Flume Tutorial: An Introduction to Log Collection and Aggregation

Installing and Configuring Flume

Flume is a distributed service for collecting, aggregating and moving large amounts of log data from various sources to a centralized data store. Flume can help you manage the flow of data from your applications to your analytics systems in a reliable and scalable way.

In this blog post, we will show you how to install and configure Flume on a Linux machine. We will use Flume 1.9.0 as an example, but you can follow the same steps for other versions.

Step 1: Download Flume

You can download Flume from its official website: https://flume.apache.org/download.html
Choose the binary distribution that matches your system architecture and extract it to a directory of your choice.

Step 2: Configure Flume

Flume requires a configuration file that specifies the sources, channels and sinks that define the data flow. A source is where Flume receives data from, such as a log file or a socket. A channel is where Flume temporarily stores the data before sending it to a sink. A sink is where Flume delivers the data to, such as HDFS or Kafka.

You can create your own configuration file or use one of the examples provided in the conf directory of Flume. For this tutorial, we will use the following configuration file named flume.conf:

# Define a source named r1 that reads from /var/log/syslog
r1.sources = s1
r1.sources.s1.type = exec
r1.sources.s1.command = tail -F /var/log/syslog

# Define a channel named c1 that uses memory as storage
r1.channels = c1
r1.channels.c1.type = memory

# Define a sink named k1 that writes to Kafka topic logs
r1.sinks = k1
r1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
r1.sinks.k1.topic = logs
r1.sinks.k1.brokerList = localhost:9092

# Bind the source and sink to the channel
r1.sources.s1.channels = c1
r1.sinks.k1.channel = c1

Step 3: Start Flume

To start Flume, you need to specify the agent name and the configuration file as arguments. For example:

$ bin/flume-ng agent --name r1 --conf-file conf/flume.conf

This will start an agent named r1 that reads from /var/log/syslog and writes to Kafka topic logs.


In this blog post, we have learned how to install and configure Flume on a Linux machine. We have also seen how to define a simple data flow using sources, channels and sinks. Flume is a powerful tool for collecting and moving large amounts of log data in an efficient and reliable way.


Q: How can I monitor the status of my Flume agents?

A: You can use the web UI provided by Flume at http://localhost:41414 (by default) or use JMX metrics exposed by Flume.

Q: How can I troubleshoot errors or failures in my Flume agents?

A: You can check the logs generated by Flume in the logs directory or enable debug mode by setting log4j.logger.org.apache.flum=DEBUG in conf/log4j.properties.

Q: How can I customize or extend Flume functionality?

A: You can write your own custom sources, channels or sinks using Java API provided by Flume or use third-party plugins available online.

Previous Next
Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

Yaspal Chaudhary 3 weeks ago

Good Content

Gaurav 7 months ago