Spark Objective

Question: Select the API used for performing ChiSquare tests on features in Spark.

ChiSquareSelector API

ChiSqSelector API

Ans:- ChiSqSelector API

Question: Which of these advantages of Hadoop over Spark is not true?

Better cost-efficiency

In-memory computation capabilities

Ans:- In-memory computation capabilities

Question: What makes Apache Spark stand out compared to other platforms?

Apache Spark is easily integrated with multiple programming languages and frameworks

Only Apache Spark supports Hadoop integration

Ans:- Apache Spark is easily integrated with multiple programming languages and frameworks

Question: Which statement best describes possible sources for RDD?

RDD can be created from parquet or .json file, HIVE table or JDBC database

RDD can be created from .json file, HIVE table or MongoDB database

Ans:- RDD can be created from parquet or .json file, HIVE table or JDBC database

Question: Which statement best describes RDDs?

RDD is a fault tolerant database running on a single machine

RDD is a fault tolerant database which is distributed across multiple nodes in a cluster

Ans:- RDD is a fault tolerant database which is distributed across multiple nodes in a cluster

Question: Which statement best describes how an RDD is create?

RDD can only be created by providing a database in Hadoop supported format

Create RDD by calling parallelize method on existing collection, providing external dataset or an existing RDD

Ans:- Create RDD by calling parallelize method on existing collection, providing external dataset or an existing RDD

Question: Match RDD operations with the correct descriptions.

Answer Options:
A:Transformations
B:Actions

Operations which trigger execution of RDD functionality

A

B
Ans:- B

Operations which accept RDD as an input and may produce multiple output RDDs

A

B
Ans:- A

Operations which send the execution results to the driver

A

B
Ans:- B

Lazy RDD operations combining multiple partitions into a single RDD or vice versa

A

B
Ans:- A

Question: Identify the statement that best describes the Apache Spark Data Frame sources.

Hive tables, structured data files, ORC or AVRO files

Structured data files, pandas Data Frames and Hive tables

Ans:- Hive tables, structured data files, ORC or AVRO files

Question: Which statement best describes a Resilient Distributed Data Frame?

Apache Spark Python API for working with Resilient Distributed Datasets

Named column dataset processed on a single cluster which supports multiple file formats, frameworks and languages

Ans:- Named column dataset processed on a single cluster which supports multiple file formats, frameworks and languages

Question: Which of the following cannot be used to create a Resilient Distributed Data Frame?

An .xls file

A .json file

Ans:- An .xls file

Question: Select the reasons for using ML pipelines instead of a straightforward approach?

ML pipeline is another word for a long sequential code training an ML model

ML pipelines can quickly iterate through multiple combinations of features and models

ML pipelines create a reusable declarative interface and create a high level overview of the workflow

Ans:-

ML pipelines can quickly iterate through multiple combinations of features and models

ML pipelines create a reusable declarative interface and create a high level overview of the workflow

Question: Which statement best describes what an ML Pipeline is?

ML pipeline is a combination of untrained estimators and transformers used to create ML workflow

ML pipeline is the process of fitting data which creates a functional ML model

Ans:- ML pipeline is a combination of untrained estimators and transformers used to create ML workflow

Question: Which object is not used to create an ML pipeline?

DataFrame

HashingTF

Ans:- DataFrame

Question: RDD in Spark stands for what?

Resilient Distributed Datasets

Recursive Data Distributions

Ans:- Resilient Distributed Datasets

Question: In which ways can a deep learning solution be deployed in Spark?

By registering a UDF from a saved model file

Using TensorRT inference server

As a SparkSQL UDF

Ans:-

By registering a UDF from a saved model file

As a SparkSQL UDF

Question: When was Apache Spark made open source?

2011

2010

Ans:- 2010

Question: What are some of the characteristics that define a Tuple in Apache Storm?

It is used to perform simple stream transformations

It is the fundamental unit of data that is transferred from node to node

It is a predefined named list of fields whose values can be of any type

Ans:-

It is the fundamental unit of data that is transferred from node to node

It is a predefined named list of fields whose values can be of any type

Question: What Spout classification operates on an at most once principle and does not have the ability to replay tuples?

Unistream

Unreliable

Ans:- Unreliable

Question: What are the features of streams in Apache Storm?

They are a container of Bolts that partition Tuples

They can be processed by single or multiple types of Bolts

They are a sequence of tuples that are handled in a parallel fashion

Ans:-

They can be processed by single or multiple types of Bolts

They are a sequence of tuples that are handled in a parallel fashion

Question: Which technique is used by Apache Spark?

In-memory computation

Static resource allocation

Ans:- In-memory computation

Question: Which is not a benefit of Apache Spark?

Reusable codebase

More people are skilled in Spark compared to Hadoop

Ans:- More people are skilled in Spark compared to Hadoop

Question: When would you use a standalone single server ZooKeeper setup?

When an odd number of nodes are needed

On, or for, a development system

Ans:- On, or for, a development system

Question: What ZooKeeper command starts a new node called zk_test_node in the root directory?

create /zk_test_node somedata

set /zk_test_node somedata

Ans:- create /zk_test_node somedata

Question: What file name should the ZooKeeper configuration file be given (assuming you are using the default expected by ZooKeeper)?

zoo.cfg

zoo_sample.conf

Ans:- zoo.cfg

Question: Which is not a feature of Spark NLP?

Sentiment detection

Use of noise-contrastive estimation loss

Ans:- Use of noise-contrastive estimation loss

Question: SURF and SIFT detectors for images are used to detect what?

Primarily local features

Primarily color features

Ans:- Primarily local features

Question: Which command starts the Nimbus daemon from the storm binaries directory?

binaries/nimbus -daemon

bin/storm nimbus

Ans:- bin/storm nimbus

Question: What CLI command would modify the parallelism of a topology called myTopo on the fly?

storm rebalance myTopo –n 7 –e spt=5 –e blt=12

storm modify –tp myTopo –n 7 –e spt=5 –e blt=12

Ans:- storm rebalance myTopo –n 7 –e spt=5 –e blt=12

Question: How many executors will be used in this sample Storm Topology?


TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“word”, new TestWordSpout(), 10);
builder.setBolt(“exclaim1”, new ExclamationBolt(), 3).
shuffleGrouping(“word”);
builder.setBolt(“exclaim2”, new ExclamationBolt(), 2).
shuffleGrouping(“exclaim1”);

15

10

60

Ans:- 15

Question: Which Stream Grouping Type effectively serializes the processing of the tuples?

Partial key grouping

Global grouping

Ans:- Global grouping

Question: What stream grouping type should be added to the last Bolt to group the word tuples from the SplitSentence bolt such that each word tuple is passed to a specific task?


TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“spout”, new RandomSentenceSpout(), 5);
builder.setBolt(“split”, new SplitSentence(), 8).
shuffleGrouping(“spout”);
builder.setBolt(“count”, new WordCount(), 12).

globalGrouping( “split”, new Fields(“word”));

fieldsGrouping( “split”, new Fields(“word”));

Ans:- fieldsGrouping( “split”, new Fields(“word”));

Question: Which of these is not a feature of distributed systems?

Fault tolerance

Static system structure

Ans:- Static system structure

Question: Why are memory-based systems considered better than disk-based systems?

Memory-based systems are quite fast

Memory-based systems can store information for a long time

Ans:- Memory-based systems are quite fast

Question: A transformation in Spark RDD___.

Takes RDD as an input and produces one or more RDD’s as outputs

Sends results from executors to the drive

Ans:- Takes RDD as an input and produces one or more RDD’s as outputs

Question: What does Storm use to guarantee messages from a spout that are fully processed?

Multicasting each message

Unique message IDs

Tuple tree timeouts

Acknowledgements

Ans:-

Unique message IDs

Acknowledgements

Question: In which ways can Storm recover from a failure of a worker?
Instruction: Choose all options that best answer the question.

The Supervisor can restart the worker

It lets it run sub-optimally and uses another worker to verify the output

If the worker continuously fails the Nimbus server can reassign the worker to another node

Ans:-

The Supervisor can restart the worker

If the worker continuously fails the Nimbus server can reassign the worker to another node

Question: Match the features to either Storm Core or Storm Trident.

Answer Options:
A:Event-streaming processing
B:Micro-batching processing
C:Sub-second Latency
D:Seconds Latency
E:Does not support stateful operations
F:Supports stateful operations

Storm Core

A

B

C

D

E

F
Ans:- A,C,E

Storm Trident

A

B

C

D

E

F
Ans:- B,D,F

Question: How are streams processed in a Trident topology?

Stream aggregations are performed as a single read and write request

Streams are processed as a series of micro-batches

Streams are partitioned among various nodes in a cluster

Stream operations are applied in parallel across each partition

Ans:-

Streams are processed as a series of micro-batches

Streams are partitioned among various nodes in a cluster

Stream operations are applied in parallel across each partition

Question: What are some of the stream operations available in Storm Trident?

Projection-Filter

Persistence

Repartitioning

Partition-local

Aggregation

Ans:-

Repartitioning

Partition-local

Aggregation