Posts

Showing posts from March, 2024

Performance factors affecting spark

 Spark Memory  Performance is sensitive to  application code,  configuration settings,  data layout and storage,  multi-tenancy,  resource allocation and  elasticity in cloud deployments like Amazon EMR, Microsoft Azure, Google Dataproc, Qubole, etc. tuning memory usage:  the  amount  of memory used by your objects,  the  cost  of accessing those objects,  and the  overhead  of “garbage collection”

Spark Context , Spark Session and JVM

 Why we have one spark context per JVM ? handling single data intensive application is painful enough (tuning GC, dealing with leaking resources, communication overhead). Mutliple Spark applications running in a single JVM would be impossible to tune and manage in a long run. If one of the processes hangs, or fails, or it's security is compromised, the others don't get affected. I think having separate runtimes also helps GC, because it has less references to handle than if it was altogether. What is JVM ? The  Java Virtual Machine  (JVM) is the virtual machine that runs the Java bytecodes. The JVM doesn't understand Java source code; that's why you need compile your  *.java  files to obtain  *.class  files that contain the bytecodes understood by the JVM. It's also the entity that allows Java to be a "portable language" ( write once, run anywhere ). Indeed, there are specific implementations of the JVM for different systems (Windows, Linux, macOS,...