Spark Context , Spark Session and JVM

March 26, 2024

Why we have one spark context per JVM ?

handling single data intensive application is painful enough (tuning GC, dealing with leaking resources, communication overhead). Mutliple Spark applications running in a single JVM would be impossible to tune and manage in a long run.

If one of the processes hangs, or fails, or it's security is compromised, the others don't get affected.

I think having separate runtimes also helps GC, because it has less references to handle than if it was altogether.

What is JVM ?

The Java Virtual Machine (JVM) is the virtual machine that runs the Java bytecodes. The JVM doesn't understand Java source code; that's why you need compile your *.java files to obtain *.class files that contain the bytecodes understood by the JVM. It's also the entity that allows Java to be a "portable language" (write once, run anywhere). Indeed, there are specific implementations of the JVM for different systems (Windows, Linux, macOS, see the Wikipedia list), the aim is that with the same bytecodes they all give the same results

Spark Concepts

how the resources are distributed when we create multiple spark session in one spark context

when multiple Spark sessions are created within a single Spark context, they share the same underlying resources allocated to the Spark context. The sessions allow for isolated SQL environments but do not have separate physical resources. The management of resources, including their allocation and potential dynamic adjustment, is handled at the Spark context level, affecting all sessions within the application.

In Apache Spark, SQL configurations are isolated across different Spark sessions to allow for different configurations and temporary views within the same Spark application. This isolation is achieved through the concept of session-scoped SQL configurations.

Separate Configurations: When you create a new SparkSession using the newSession() method on an existing SparkSession, the new session does not share SQL configurations with the parent session. This means that changes to configurations, such as those set using setConf method, in one session do not affect the other sessions.

Temporary Views: Temporary views and databases created in one SparkSession are scoped to that session and are not visible in other sessions. This allows users to create temporary tables with the same name in different sessions without conflict.
User-Defined Functions (UDFs): Similarly, UDFs registered in one SparkSession are not available in other sessions unless explicitly registered in each session.
Temporary views in Spark are session-scoped. This means that a temporary view created in one SparkSession is not visible in another SparkSession unless it's created as a global temporary view. Global temporary views are visible across all SparkSessions but are tied to a system-preserved database global_temp. This scoping allows for the same temporary view name to be reused across different sessions without conflict

https://medium.com/@kar9475/data-sharing-between-multiple-spark-jobs-in-databricks-308687c99897

Search This Blog

Decorators in Python

Spark Context , Spark Session and JVM

Comments

Post a Comment

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python