If you new to this page then please go through
What is Catalyst optimizer?
Spark SQL deals with both SQL queries and DataFrame API.
-> In catalyst optimizer, Catalyst optimization allows some advanced programming language features that allow you to build an extensible query optimizer.
Catalyst Optimizer supports both rule-based and cost-based optimization.
->In rule-based optimization: it uses set of rule to determine how to execute the query.
While the cost-based optimization finds the most suitable way to carry out the SQL statements. So when you execute a query then, in cost-based optimization, multiple plans are generated using rules and then their cost is computed.
What are the fundamentals of Catalyst Optimizer?
Catalyst optimizer uses standard pattern matching.Catalyst contains the tree and the set of rules to manipulate the tree.There are specific libraries to process relational queries.Analysis, query optimization, physical planning, and code generation are the steps which compile parts of queries into Java bytecode.
Trees
The main data type is a tree and it contains node object. These objects are immutable in nature.A node can have one or more children. And New nodes are always subclasses of TreeNode class.The objects can be manipulated using functional transformation.
How will you run application locally on 8 cores?
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master local[8] \
/path/to/examples.jar \
100
How will you run on a Spark standalone cluster in client deploy mode?
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master spark://207.184.161.138:7077 \
–executor-memory 20G \
–total-executor-cores 100 \
/path/to/examples.jar \
1000
How will you run on a Spark standalone cluster in cluster deploy mode with supervise?
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master spark://207.184.161.138:7077 \
–deploy-mode cluster \
–supervise \
–executor-memory 20G \
–total-executor-cores 100 \
/path/to/examples.jar \
1000
Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master yarn \
–deploy-mode cluster \ # can be client for client mode
–executor-memory 20G \
–num-executors 50 \
/path/to/examples.jar \
1000
Run a Python application on a Spark standalone cluster
./bin/spark-submit \
–master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \
1000
Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master mesos://207.184.161.138:7077 \
–deploy-mode cluster \
–supervise \
–executor-memory 20G \
–total-executor-cores 100 \
http://path/to/examples.jar \
1000
Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master k8s://xx.yy.zz.ww:443 \
–deploy-mode cluster \
–executor-memory 20G \
–num-executors 50 \
http://path/to/examples.jar \
1000
Do I need Hadoop to run Spark?
Answer: No, but if we run on a cluster, we need shared file system (NFS mounted on each node). If we have this type of filesystem, we can just deploy Spark in standalone mode.
What is the Default level of parallelism in Spark?
Answer) Default level of parallelism is the number of partitions
Is it possible to have multiple SparkContext in single JVM?
Answer)Yes, spark.driver.allow.MultipleContexts is true (default: false ). If it is true then multiple SparkContexts can run on single JVM.
What is the advantage of broadcasting values across Spark Cluster?
Answer) Spark transfers the value to Spark executors once, and whenever it is requested then tasks can share it without incurring repetitive network transmissions.
How do you disable Info Message when running Spark Application?
Answer)Navigate to $SPARK_HOME/conf dir and modify the log4j.properties file – change values INFO to ERROR
In your project ,you have access to a cluster (12 nodes where each node has 2 processors Intel(R) Xeon(R) CPU E5-2650 2.00GHz, where each processor has 8 cores). Now how will you tune application and to observe its performance?
For Tuning, we have to follow below points:
1)Monitor Application: your cluster is under-utilized or not. It depends on how many resources are used by your application. Monitoring can be done using various tools eg. Ganglia From Ganglia. In this you can find CPU, Memory, and Network Usage.
2) Serialization: what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application. We can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
What is Lazy evaluated RDD mean?
Answer)Lazy evaluated means the data inside RDD is not available or transformed until an action is executed that triggers the execution.
How would you control the number of partitions of a RDD?
Answer)using repartition or coalesce operations.
Data is spread in all the nodes of cluster, how spark tries to process this data?
Answer)By default, Spark tries to read data into an RDD from the nodes that are close to it. To optimize transformation operations spark creates partitions to hold the data chunks
What is coalesce transformation?
Answer)The coalesce transformation is used to change the number of partitions. It can trigger RDD shuffling depending on the second shuffle boolean input parameter (defaults to false ).
What is Shuffling?
Answer)Shuffling is a process of repartitioning data across partitions and may cause moving it across JVMs or even network when it is redistributed among executors.
What is the difference between groupByKey and reduceByKey ?
Answer)We should avoid groupByKey and use reduceByKey or combineByKey instead.
groupByKey shuffles all the data, which is slow.
reduceByKey shuffles only the results of sub-aggregations in each partition of the data.
What is checkpointing?
Answer)Checkpointing is a process of truncating RDD lineage graph and saving it to HDFS.RDD checkpointing saves the actual intermediate RDD data to a reliable distributed file system.
What is stage, with regards to Spark Job execution?
Answer)A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of a function executed as part of a Spark job.
What is Speculative Execution of a tasks?
Answer) Speculative execution of tasks is a health-check procedure that checks for tasks execution time. If any slow task found then such slow tasks will be re-launched in another worker with new copy in parallel.
Which all cluster manager can be used with Spark?
Answer)Apache Mesos, Hadoop YARN, Spark standalone
Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage.
What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER, DISK_ONLY
- OFF_HEAP
What is the difference between persist() and cache()
persist () allows the user to specify the storage level whereas cache () uses the default storage level.
What are the various data sources available in SparkSQL?
- Parquet file
- JSON Datasets
- Hive tables
What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
- Limit I/O operations
- Consumes less space
- Fetches only required columns.