Thoughts on big data processing

Automata is Hongyu Su’s journey on data engineering and architecture !

A feed of the most recent posts is available.

24 Sep 2017

Provision AWS EC2 cluster with Spark version 2.x View Comments

This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.

12 Mar 2017

Streaming with Apache Storm View Comments

Several previous posts briefly demonstrate some of my experiences on streaming data processing using mostly Kafka and Spark. Still we need to remember that we have realtime processing capability via Kafka and near-realtime processing capability using Spark. This is mostly due to the fact that Spark processes a stream of RDDs generated by some time window. In this article, let me quickly walk through some basic idea and example of streaming data processing using Apache Storm which is another popular streaming processing framework. Sadly, storm is not part of Cloudera package. So if you have a sandbox with Cloudera and you will be missing this storm.

01 Mar 2017

Data ingestion and loading: Flume, Sqoop, Hive, and HBase View Comments

Extraction and loading are important parts of BigData ETL operations. In this article, we will be focusing on data ingestion operations mainly with Sqoop and Flume. These operations are quite often used to transfer data between file systems e.g. HDFS, noSql databases e.g. Hbase, Sql databases e.g. Hive, message queuing system e.g. Kafka, as well as other sources and sinks.

26 Feb 2017

Streaming processing (III): Best Spark Practice View Comments

Previous article illustrates some experience when applying Kafka on processing streaming data which is an extension of early article discussing the basic concept, setup, and integration of Kafka, Spark stream, Confluent Schema Registry, and Avro for streaming data processing. This post will devote to some best practices on Spark Streaming operations (e.g., transform, Broadcast Variables), serialization and deserialization, and unit test for Spark streaming processing.

25 Feb 2017

Streaming processing (II): Best Kafka Practice View Comments

In the previous article, I briefly discussed the basic setup and integration of Spark Streaming, Kafka, Confluent Schema Registry, and Avro for streaming data processing. My focus here is to demonstrate the best practices when it comes to applying these streaming processing technologies. In particular, I will illustrate a few common KStream operations (e.g., ValueMapper, KeyValueMapper, ValueJoiner, Predicate), serialization and deserialization, and unit test for Kafka streaming processing.

Older Posts →

Automata ← Hongyu Su