Apache Flink

Article by Ishwar Dhumal, Big Data & Cloud Developer

In this blog, We are going to know about what is Apache Flink, what it is used for and its internal and also the comparison between Flink and Spark.

What is Apache Flink?

Why Apache Flink?

The following are the features of Apache Flink.

Architecture, API, and Language support.

Apache Flink has API’s like Dataset for batch processing, DataStream for real-time processing, Table for SQL processing, Gelly for graph processing, and FlinkML for Machine Learning Algorithm.

It has language support of java, scala, and python. As Flink is written in java and scala it is better to use java or scala as a language for writing the scripts or big data logic because it has more support of inbuilt functions in java and scala.

Execution Model

Source — DataFlair

How Fault Tolerance is Achieved in Flink?

  • The central part of Flink’s fault tolerance mechanism is drawing consistent snapshots of the distributed data stream and operator state. These snapshots act as consistent checkpoints to which the system can fall back in case of a failure.​
  • Flink’s mechanism for drawing these snapshots is inspired by the standard Chandy-Lamport algorithm for distributed snapshots and is specifically tailored to Flink’s execution model.​
  • In case of a program failure (due to machine-, network-, or software failure), Flink stops the distributed streaming dataflow. The system then restarts the operators and resets them to the latest successful checkpoint.​ The input streams are reset to the point of the state snapshot.​ Any records that are processed as part of the restarted parallel dataflow are guaranteed to not have been part of the previously checkpointed state.

Comparison between Spark and Flink.

How is Apache Flink different from Apache Spark, and, in particular, Apache Spark Streaming?

Let’s take a look at the technical details of both.

Spark Micro Batches

Either way, you do it there will always be some lag time processing live data, so dividing it into sets should not matter. After all, when a program runs a MapReduce operation, the reduce operation is run on the map dataset that was created a few seconds ago.

Using Flink for Batch Operations

Flink processes data the same way, whether it is finite or infinite. Spark does not: it uses DStreams (Discretized Streams) for streaming data and RDD (Resilient distributed dataset) for batch data.

The Flink Streaming Model

Flink says it has developed a way to checkpoint streams without having to open and close them. To checkpoint means to notate where one has left off and then resume from that point. Then they run a type of iteration that lets their machine language algorithms run faster than Spark. That is not insignificant as ML routines can take many hours to run.

Flink versus Spark in Memory Management

Flink and Spark Machine Learning

Command Line Interface (CLI)

To have a Spark CLI means a user can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.

Flink has a Scala CLI too, but it is not the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.

Again this might be a matter of semantics. You could argue that spark-shell (Scala), pySpark (Python), are sparkR (R) are batch too. Spark is said to be “lazy.” That means when you create an object it only creates a pointer to it. It does not run any operation until you ask it to do something like count(), which would require creating the object to measure it. So it would submit that to its batch engine then.

Both languages, of course, support batch jobs, which is how most people would run their code once they have written and debugged it. With Flink, you can run Java and Scala code in batch. With Spark, you can run Java, Scala, and Python code in batch.

Support for Other Streaming Products

Cluster Operations

Conclusion

--

--

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey