x-forwarded-host – Header

We have a requirement in the ReSt application to generate a hyperlink to one of web pages in the pdf file. For that we would want to get the host name of the environment where the application is running. The request flow is like below,
user => Web application =>ReSt application.

The x-forwarded-host header contains the domain name of the application. We use this in our code and able to generate the link

        String host = httpServletRequest.getHeader("x-forwarded-host");


Hadoop Map Reduce Output

During the Map Reduce job run,
The map tasks output would be written into local file system of each node where the map task is running. This will be removed once the job is completed. The location of this directory is defined using this property mapreduce.cluster.local.dir

The reduce tasks output would be persisted into HDFS and followed by necessary replication will happen

Hadoop Reducer – Configuration

If we dont set the number of reducer tasks in our driver class, then by default, it will be assumed as 1 and run the job.

If we dont set the reducer class in our driver class,then the IdentityReducer will be taken by default and it will just do sorting and shuffling and produce the results in a single output file.

If we set the number of reducer tasks as 0, then no reducer tasks will be run and the map output will be the final output and would be written into HDFS.

Apache Flume Vs Apache Kafka

Kafka Flume
Publish subscribe messaging system Its a service for collecting, aggregating and moving the large amounts of data to hadoop or process and persists the data into a relational database systems
The messages are replicated in multiple broker nodes, so in case of failure, we can easily retrieve back the message It does not replicates the events/data, so in case of node failure, the data will be lost
Its a pull messaging system so the message is still available for some number of days. So the client with different consumer group can pull the message Data is pushed to the destination which could be logger, hadoop or Custom Sink. So the messages wont be stored as like in Kafka

Both systems can be used together. So the messages can be pushed to Kafka and the same would be consumed by Flume agent with KafkaSource and the data also can be pushed to the KafkaSink.