Automata is Hongyu Su’s journey on data engineering and architecture !
A feed of the most recent posts is available.
Provision AWS EC2 cluster with Spark version 2.x View Comments
This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.
Streaming with Apache Storm View Comments
Several previous posts briefly demonstrate some of my experiences on streaming data processing using mostly Kafka and Spark. Still we need to remember that we have realtime processing capability via Kafka and near-realtime processing capability using Spark. This is mostly due to the fact that Spark processes a stream of RDDs generated by some time window. In this article, let me quickly walk through some basic idea and example of streaming data processing using Apache Storm which is another popular streaming processing framework. Sadly, storm is not part of Cloudera package. So if you have a sandbox with Cloudera and you will be missing this storm.
Data ingestion and loading: Flume, Sqoop, Hive, and HBase View Comments
Extraction and loading are important parts of BigData ETL operations. In this article, we will be focusing on data ingestion operations mainly with Sqoop and Flume. These operations are quite often used to transfer data between file systems e.g. HDFS, noSql databases e.g. Hbase, Sql databases e.g. Hive, message queuing system e.g. Kafka, as well as other sources and sinks.
Streaming processing (III): Best Spark Practice View Comments
Previous article illustrates some experience when applying Kafka on processing streaming data which is an extension of early article discussing the basic concept, setup, and integration of Kafka, Spark stream, Confluent Schema Registry, and Avro for streaming data processing. This post will devote to some best practices on Spark Streaming operations (e.g., transform, Broadcast Variables), serialization and deserialization, and unit test for Spark streaming processing.
Streaming processing (II): Best Kafka Practice View Comments
In the previous article, I briefly discussed the basic setup and integration of Spark Streaming, Kafka, Confluent Schema Registry, and Avro for streaming data processing. My focus here is to demonstrate the best practices when it comes to applying these streaming processing technologies. In particular, I will illustrate a few common KStream operations (e.g., ValueMapper, KeyValueMapper, ValueJoiner, Predicate), serialization and deserialization, and unit test for Kafka streaming processing.