This is the complete archive of posts from Automata in reverse chronological order.
Provision AWS EC2 cluster with Spark version 2.x View Comments
This article is to provision a EC2 cluster with Spark and Hadoop. As a result, one should be able to run Spark applications utilizing HDFS file system.
Streaming with Apache Storm View Comments
Several previous posts briefly demonstrate some of my experiences on streaming data processing using mostly Kafka and Spark. Still we need to remember that we have realtime processing capability via Kafka and near-realtime processing capability using Spark. This is mostly due to the fact that Spark processes a stream of RDDs generated by some time window. In this article, let me quickly walk through some basic idea and example of streaming data processing using Apache Storm which is another popular streaming processing framework. Sadly, storm is not part of Cloudera package. So if you have a sandbox with Cloudera and you will be missing this storm.
Data ingestion and loading: Flume, Sqoop, Hive, and HBase View Comments
Extraction and loading are important parts of BigData ETL operations. In this article, we will be focusing on data ingestion operations mainly with Sqoop and Flume. These operations are quite often used to transfer data between file systems e.g. HDFS, noSql databases e.g. Hbase, Sql databases e.g. Hive, message queuing system e.g. Kafka, as well as other sources and sinks.
Streaming processing (III): Best Spark Practice View Comments
Previous article illustrates some experience when applying Kafka on processing streaming data which is an extension of early article discussing the basic concept, setup, and integration of Kafka, Spark stream, Confluent Schema Registry, and Avro for streaming data processing. This post will devote to some best practices on Spark Streaming operations (e.g., transform, Broadcast Variables), serialization and deserialization, and unit test for Spark streaming processing.
Streaming processing (II): Best Kafka Practice View Comments
In the previous article, I briefly discussed the basic setup and integration of Spark Streaming, Kafka, Confluent Schema Registry, and Avro for streaming data processing. My focus here is to demonstrate the best practices when it comes to applying these streaming processing technologies. In particular, I will illustrate a few common KStream operations (e.g., ValueMapper, KeyValueMapper, ValueJoiner, Predicate), serialization and deserialization, and unit test for Kafka streaming processing.
Streaming processing (I): Kafka, Spark, Avro Integration View Comments
Streaming data processing is yet another interesting topic in data science. In this article, we will walk through the integration of Spark streaming, Kafka streaming, and Schema registry for the purpose of communicating Avro-format messages. Spark, Kafka and Zookeeper are running on a single machine (standalone cluster). The actual configurations of Spark, Kafka, and Zookeeper are to some extend irrelevant to this integration.
After living in this planet for about 30 years, I decided to track my sport activities somehow which mostly is to record the amount of exercises I have done. I am also hoping that this kind of sport history will eventually encourage me to do more and more. Of course, for the sake of data science and analytics, activities are documents as in a json table and presented via javascript in particular D3.js.
Deploy ELK stack on Amazon AWS View Comments
This article will be a practical guide on deploying ELK stack to Amazon AWS.
Build a simple web application with Amazon AWS View Comments
The goal really is to define an Amazon lambda function which will eventually response to an event invoked by Amazon S3.
Build web applications with Flask+Heroku View Comments
Build web application with Flask on Heroku.
Calendar view of data in Jekyll with D3.js View Comments
Use Javascsript in particular D3.js in Jekyll bookstrap and Markdown syntax.
Documentation and test modules for Python View Comments
This article is dedicated to discuss Python documentation and test modules.