Question: Select the API used for performing ChiSquare tests on features in Spark.
ChiSquareSelector API
ChiSqSelector API
Ans:- ChiSqSelector API
Question: Which of these advantages of Hadoop over Spark is not true?
Better cost-efficiency
In-memory computation capabilities
Ans:- In-memory computation capabilities
Question: What makes Apache Spark stand out compared to other platforms?
Apache Spark is easily integrated with multiple programming languages and frameworks
Only Apache Spark supports Hadoop integration
Ans:- Apache Spark is easily integrated with multiple programming languages and frameworks
Question: Which statement best describes possible sources for RDD?
RDD can be created from parquet or .json file, HIVE table or JDBC database
RDD can be created from .json file, HIVE table or MongoDB database
Ans:- RDD can be created from parquet or .json file, HIVE table or JDBC database
Question: Which statement best describes RDDs?
RDD is a fault tolerant database running on a single machine
RDD is a fault tolerant database which is distributed across multiple nodes in a cluster
Ans:- RDD is a fault tolerant database which is distributed across multiple nodes in a cluster
Question: Which statement best describes how an RDD is create?
RDD can only be created by providing a database in Hadoop supported format
Create RDD by calling parallelize method on existing collection, providing external dataset or an existing RDD
Ans:- Create RDD by calling parallelize method on existing collection, providing external dataset or an existing RDD
Question: Match RDD operations with the correct descriptions.
Answer Options:
A:Transformations
B:Actions
Operations which trigger execution of RDD functionality
A
B
Ans:- B
Operations which accept RDD as an input and may produce multiple output RDDs
A
B
Ans:- A
Operations which send the execution results to the driver
A
B
Ans:- B
Lazy RDD operations combining multiple partitions into a single RDD or vice versa
A
B
Ans:- A
Question: Identify the statement that best describes the Apache Spark Data Frame sources.
Hive tables, structured data files, ORC or AVRO files
Structured data files, pandas Data Frames and Hive tables
Ans:- Hive tables, structured data files, ORC or AVRO files
Question: Which statement best describes a Resilient Distributed Data Frame?
Apache Spark Python API for working with Resilient Distributed Datasets
Named column dataset processed on a single cluster which supports multiple file formats, frameworks and languages
Ans:- Named column dataset processed on a single cluster which supports multiple file formats, frameworks and languages
Question: Which of the following cannot be used to create a Resilient Distributed Data Frame?
An .xls file
A .json file
Ans:- An .xls file
Question: Select the reasons for using ML pipelines instead of a straightforward approach?
ML pipeline is another word for a long sequential code training an ML model
ML pipelines can quickly iterate through multiple combinations of features and models
ML pipelines create a reusable declarative interface and create a high level overview of the workflow
Ans:-
ML pipelines can quickly iterate through multiple combinations of features and models
ML pipelines create a reusable declarative interface and create a high level overview of the workflow
Question: Which statement best describes what an ML Pipeline is?
ML pipeline is a combination of untrained estimators and transformers used to create ML workflow
ML pipeline is the process of fitting data which creates a functional ML model
Ans:- ML pipeline is a combination of untrained estimators and transformers used to create ML workflow
Question: Which object is not used to create an ML pipeline?
DataFrame
HashingTF
Ans:- DataFrame
Question: RDD in Spark stands for what?
Resilient Distributed Datasets
Recursive Data Distributions
Ans:- Resilient Distributed Datasets
Question: In which ways can a deep learning solution be deployed in Spark?
By registering a UDF from a saved model file
Using TensorRT inference server
As a SparkSQL UDF
Ans:-
By registering a UDF from a saved model file
As a SparkSQL UDF
Question: When was Apache Spark made open source?
2011
2010
Ans:- 2010
Question: What are some of the characteristics that define a Tuple in Apache Storm?
It is used to perform simple stream transformations
It is the fundamental unit of data that is transferred from node to node
It is a predefined named list of fields whose values can be of any type
Ans:-
It is the fundamental unit of data that is transferred from node to node
It is a predefined named list of fields whose values can be of any type
Question: What Spout classification operates on an at most once principle and does not have the ability to replay tuples?
Unistream
Unreliable
Ans:- Unreliable
Question: What are the features of streams in Apache Storm?
They are a container of Bolts that partition Tuples
They can be processed by single or multiple types of Bolts
They are a sequence of tuples that are handled in a parallel fashion
Ans:-
They can be processed by single or multiple types of Bolts
They are a sequence of tuples that are handled in a parallel fashion
Question: Which technique is used by Apache Spark?
In-memory computation
Static resource allocation
Ans:- In-memory computation
Question: Which is not a benefit of Apache Spark?
Reusable codebase
More people are skilled in Spark compared to Hadoop
Ans:- More people are skilled in Spark compared to Hadoop
Question: When would you use a standalone single server ZooKeeper setup?
When an odd number of nodes are needed
On, or for, a development system
Ans:- On, or for, a development system
Question: What ZooKeeper command starts a new node called zk_test_node in the root directory?
create /zk_test_node somedata
set /zk_test_node somedata
Ans:- create /zk_test_node somedata
Question: What file name should the ZooKeeper configuration file be given (assuming you are using the default expected by ZooKeeper)?
zoo.cfg
zoo_sample.conf
Ans:- zoo.cfg
Question: Which is not a feature of Spark NLP?
Sentiment detection
Use of noise-contrastive estimation loss
Ans:- Use of noise-contrastive estimation loss
Question: SURF and SIFT detectors for images are used to detect what?
Primarily local features
Primarily color features
Ans:- Primarily local features
Question: Which command starts the Nimbus daemon from the storm binaries directory?
binaries/nimbus -daemon
bin/storm nimbus
Ans:- bin/storm nimbus
Question: What CLI command would modify the parallelism of a topology called myTopo on the fly?
storm rebalance myTopo –n 7 –e spt=5 –e blt=12
storm modify –tp myTopo –n 7 –e spt=5 –e blt=12
Ans:- storm rebalance myTopo –n 7 –e spt=5 –e blt=12
Question: How many executors will be used in this sample Storm Topology?
…
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“word”, new TestWordSpout(), 10);
builder.setBolt(“exclaim1”, new ExclamationBolt(), 3).
shuffleGrouping(“word”);
builder.setBolt(“exclaim2”, new ExclamationBolt(), 2).
shuffleGrouping(“exclaim1”);
…
15
10
60
Ans:- 15
Question: Which Stream Grouping Type effectively serializes the processing of the tuples?
Partial key grouping
Global grouping
Ans:- Global grouping
Question: What stream grouping type should be added to the last Bolt to group the word tuples from the SplitSentence bolt such that each word tuple is passed to a specific task?
…
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“spout”, new RandomSentenceSpout(), 5);
builder.setBolt(“split”, new SplitSentence(), 8).
shuffleGrouping(“spout”);
builder.setBolt(“count”, new WordCount(), 12).
globalGrouping( “split”, new Fields(“word”));
fieldsGrouping( “split”, new Fields(“word”));
Ans:- fieldsGrouping( “split”, new Fields(“word”));
Question: Which of these is not a feature of distributed systems?
Fault tolerance
Static system structure
Ans:- Static system structure
Question: Why are memory-based systems considered better than disk-based systems?
Memory-based systems are quite fast
Memory-based systems can store information for a long time
Ans:- Memory-based systems are quite fast
Question: A transformation in Spark RDD___.
Takes RDD as an input and produces one or more RDD’s as outputs
Sends results from executors to the drive
Ans:- Takes RDD as an input and produces one or more RDD’s as outputs
Question: What does Storm use to guarantee messages from a spout that are fully processed?
Multicasting each message
Unique message IDs
Tuple tree timeouts
Acknowledgements
Ans:-
Unique message IDs
Acknowledgements
Question: In which ways can Storm recover from a failure of a worker?
Instruction: Choose all options that best answer the question.
The Supervisor can restart the worker
It lets it run sub-optimally and uses another worker to verify the output
If the worker continuously fails the Nimbus server can reassign the worker to another node
Ans:-
The Supervisor can restart the worker
If the worker continuously fails the Nimbus server can reassign the worker to another node
Question: Match the features to either Storm Core or Storm Trident.
Answer Options:
A:Event-streaming processing
B:Micro-batching processing
C:Sub-second Latency
D:Seconds Latency
E:Does not support stateful operations
F:Supports stateful operations
Storm Core
A
B
C
D
E
F
Ans:- A,C,E
Storm Trident
A
B
C
D
E
F
Ans:- B,D,F
Question: How are streams processed in a Trident topology?
Stream aggregations are performed as a single read and write request
Streams are processed as a series of micro-batches
Streams are partitioned among various nodes in a cluster
Stream operations are applied in parallel across each partition
Ans:-
Streams are processed as a series of micro-batches
Streams are partitioned among various nodes in a cluster
Stream operations are applied in parallel across each partition
Question: What are some of the stream operations available in Storm Trident?
Projection-Filter
Persistence
Repartitioning
Partition-local
Aggregation
Ans:-
Repartitioning
Partition-local
Aggregation