Article by Ishwar Dhumal, Big Data & Cloud Developer
In this blog, We are going to know about what is Apache Flink, what it is used for and its internal and also the comparison between Flink and Spark.
What is Apache Flink?
Apache Flink is yet another new generation general big data processing engine that targets to unify different data loads. Does it sound like Apache Spark? Exactly. Flink is trying to address the same issue that Spark is trying to solve. Both systems are targeted towards building a single platform where you can run the batch, streaming, interactive, graph processing, ML, etc. So Flink does not differ much from Spark in terms of ideology. But they do differ a lot in the implementation details.
Why Apache Flink?
Apache Flink is mainly based on the streaming model, Apache Flink iterates data by using streaming architecture. Now, the concept of an iterative algorithm is bound into a Flink query optimizer. So, Apache Flink’s pipelined architecture allows processing the streaming data faster with lower latency than micro-batch architectures (Spark). Suggested to use Flink where you want to use real-time processing.
The following are the features of Apache Flink.
=> It gives High throughput hence performance is high
=> Low latency
=> Fault Tolerence
=> One Runtime for both Streaming and Batch processing
=> Before execution of Program, it is optimized
Architecture, API, and Language support.
Apache Flink has Master-Slave architecture in which it can have multiple masters and several slaves as per project requirements. It also supports auto-scaling as and when required.
Apache Flink has API’s like Dataset for batch processing, DataStream for real-time processing, Table for SQL processing, Gelly for graph processing, and FlinkML for Machine Learning Algorithm.
It has language support of java, scala, and python. As Flink is written in java and scala it is better to use java or scala as a language for writing the scripts or big data logic because it has more support of inbuilt functions in java and scala.
How Fault Tolerance is Achieved in Flink?
The core concept through which fault tolerance is achieved in Flink is Check Pointing.
- The central part of Flink’s fault tolerance mechanism is drawing consistent snapshots of the distributed data stream and operator state. These snapshots act as consistent checkpoints to which the system can fall back in case of a failure.
- Flink’s mechanism for drawing these snapshots is inspired by the standard Chandy-Lamport algorithm for distributed snapshots and is specifically tailored to Flink’s execution model.
- In case of a program failure (due to machine-, network-, or software failure), Flink stops the distributed streaming dataflow. The system then restarts the operators and resets them to the latest successful checkpoint. The input streams are reset to the point of the state snapshot. Any records that are processed as part of the restarted parallel dataflow are guaranteed to not have been part of the previously checkpointed state.
Comparison between Spark and Flink.
Apache Flink and Spark are major technologies in the Big Data landscape. There is some overlap (and confusion) about what each does and do differently. This post will compare Spark and Flink to look at what they do, how they are different, what people use them for, and what streaming is.
How is Apache Flink different from Apache Spark, and, in particular, Apache Spark Streaming?
At first glance, Flink and Spark would appear to be the same. The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later.
Let’s take a look at the technical details of both.
Spark Micro Batches
Spark divides streaming into discrete chunks of data called micro-batches. Then it repeats that in a continuous loop. Flink takes a checkpoint on streaming data to break it into finite sets. Incoming data is not lost when taking this checkpoint as it is preserved in a queue in both products.
Either way, you do it there will always be some lag time processing live data, so dividing it into sets should not matter. After all, when a program runs a MapReduce operation, the reduce operation is run on the map dataset that was created a few seconds ago.
Using Flink for Batch Operations
Spark was built, like Hadoop, to run over static data sets. Flink can do this by just stopping the streaming source.
Flink processes data the same way, whether it is finite or infinite. Spark does not: it uses DStreams (Discretized Streams) for streaming data and RDD (Resilient distributed dataset) for batch data.
The Flink Streaming Model
To say that a program processes streaming data means it opens a file and never closes it. That is like keeping a TCP socket open, which is how Syslog works. Batch programs, on the other hand, open a file, process it, then close it.
Flink says it has developed a way to checkpoint streams without having to open and close them. To checkpoint means to notate where one has left off and then resume from that point. Then they run a type of iteration that lets their machine language algorithms run faster than Spark. That is not insignificant as ML routines can take many hours to run.
Flink versus Spark in Memory Management
Flink has a different approach to memory management. Flink pages out to disk when memory is full, which is what happens with Windows and Linux too. Spark crashes that node when it runs out of memory. But it does not lose data, since it is fault-tolerant. To fix that issue, Spark has a new project called Tungsten.
Flink and Spark Machine Learning
What Spark and Flink do well is take machine learning algorithms and make them run over a distributed architecture. That, for example, is something that the Python scikit-learn framework and CRAN (Comprehensive R Archive Network) cannot do. Those are designed to work on one file, on one system. Spark and Flink can scale and process enormous training and testing data sets over a distributed system. It is worth noting, however, that the Flink graph processing and ML libraries are still in beta test, as of January 2017.
Command Line Interface (CLI)
Spark has CLIs in Scala, Python, and R. Flink does not really have a CLI, but the distinction is subtle.
To have a Spark CLI means a user can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.
Flink has a Scala CLI too, but it is not the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.
Again this might be a matter of semantics. You could argue that spark-shell (Scala), pySpark (Python), are sparkR (R) are batch too. Spark is said to be “lazy.” That means when you create an object it only creates a pointer to it. It does not run any operation until you ask it to do something like count(), which would require creating the object to measure it. So it would submit that to its batch engine then.
Both languages, of course, support batch jobs, which is how most people would run their code once they have written and debugged it. With Flink, you can run Java and Scala code in batch. With Spark, you can run Java, Scala, and Python code in batch.
Support for Other Streaming Products
Both Flink and Spark work with Kafka, the streaming product written by LinkedIn. Flink also works with Storm.
Spark or Flink can run locally or on a cluster. To run Hadoop or Flink on a cluster one usually uses Yarn. Spark is usually run with Mesos. If you want to use Yarn with Spark then you have to download a version of Spark that has been compiled with Yarn support.
What Spark has that Flink does not is a large install base of production users. If you look on the Flink website you can see some examples of what some companies have built using that, including Capital One and Alibaba. So it’s certainly going to move beyond the beta stage and into the mainstream. Which one will have more users? No one can say, although more funding is pouring in Spark right now. Which one is more suitable for streaming data? That depends on the requirements and merits further study.