Spark Memory
Spark early memory architecture :
Local JVM
heap size
By default, Spark starts with 512MB JVM heap.
Spark allows to utilize only 90% of the heap(460), which is controlled by the spark.storage.safetyFraction parameter of Spark. Ok, as you might have heard of Spark as an in-memory tool, Spark allows you to store some data in memory.
So some amount of memory is reserved for the caching of the data you are processing, and this part is usually 60% of the safe heap, which is controlled by the spark.storage.memoryFraction parameter
how much data you can cache in Spark,
you should take the sum of all the heap sizes for all the executors, multiply it by safetyFraction and by storage.memoryFraction, and by default it is 0.9 * 0.6 = 0.54 or 54% of the total heap size you allow Spark to use.
shuffle memory
Heap Size” * spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. Default value for spark.shuffle.safetyFraction is 0.8 or 80%, default value for spark.shuffle.memoryFraction is 0.2 or 20%. So finally you can use up to 0.8*0.2 = 0.16 or 16% of the JVM heap for the shuffle.
When the shuffle is performed, sometimes you as well need to sort the data. When you sort the data, you usually need a buffer to store the sorted data (remember, you cannot modify the data in the LRU cache in place as it is there to be reused later). So it needs some amount of RAM to store the sorted chunks of data.
Serialized data to deserialization - use- unroll memory
The amount of RAM that is allowed to be utilized by unroll process is spark.storage.unrollFraction * spark.storage.memoryFraction * spark.storage.safetyFraction, which with the default values equal to 0.2 * 0.6 * 0.9 = 0.108 or 10.8% of the heap. This is the memory that can be used when you are unrolling the block of data into the memory. Why do you need to unroll it after all? Spark allows you to store the data both in serialized and deserialized form. The data in serialized form cannot be used directly, so you have to unroll it before using, so this is the RAM that is used for unrolling. It is shared with the storage RAM, which means that if you need some memory to unroll the data,
Comments
Post a Comment