What are the key / value stores (opaque)?
  • Keys are mapped to values
  • Values are treated as BLOBs (opaque data)
  • No type information is stored ■
  • Values can be heterogenous
What are the Document stores (non-shaped) ?
  • Keys are mapped to documents
  • Documents consist of attributes
  • Attributes are name/typed value pairs, which may be nested
  • Type information is stored per attribute
  • Documents can be heterogenous
  • Documents may be organised in collections or databases
  • Relational databases
  • Tables (relations) consist of rows and columns
  • Columns have a type. Type information is stored once per column
  • A rows contains just values for a record (no type information)
  • All rows in a table have the same columns and are homogenous
What are Row-oriented storage and Column-oriented storage?
  • Row-oriented storage is especially inefficient when only a small amout of columns is needed but the table has many columns

Column-oriented storage

  • Column-oriented databases primarily work on columns
  • All columns are treated individually
  • Values of a single column are stored contiguously
  • This allows array-processing the values of a column
  • Rows may be constructed from column values later if required
  • This means column stores can still produce row output (tables)
  • Values from multiple columns need to be retrieved and assembled for that, making implementation of bit more complex
  • Query processors in columnar databases work on columns, too
  • Column stores can greatly improve the performance of queries that only touch a small amount of columns
  • This is because they will only access these columns’ particular data
  • Simple math: table t has a total of 10 GB data, with
  • column a: 4 GB
  • column b: 2 GB
  • column c: 3 GB
  • column d: 1 GB
  • If a query only uses column d, at most 1 GB of data will be processed by a column store

= In a row store the full 10 GB will ho processed

  • Column stores store data in column-specific files
  • Simplest case: one datafile per column
  • Row values for each column are stored contiguously
  • All data within each column datafile have the same type, making it ideal for compression
  • Usually a much better compression factor can be achieved for single columns than for entire rows
  • Compression allows reducing disk I/O when reading/writing column data but has some CPU cost
  • For data sets bigger than the memory size compression is often beneficial because disk access is slower than decompression
  • Reading all columns of a row is an expensive operation in a column store, so full row tuple construction is avoided or delayed as much as possible internally
  • Updating or inserting rows may also be very expensive and may cost much more time than in a row store
  • Some column stores are hybrids, with read-optimised (column) storage and write-optimised OLTP storage
  • Still, column stores are not really made for OLTP workloads, and if you need to work with many columns at once, you’ll pay a price in a column store
  • An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.
  • An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.
  • Amazon Redshift supports client connections with many types of applications, including business intelligence (Bl), reporting, data, and analytics tools.
  • When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result.
  • Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes.
  • Amazon Redshift is based on PostgreSQL 8.0.2. but have a number of very important differences too.
What is Client applications in Redshift?
  • Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (Bl) reporting, data mining, and analytics tools.
  • Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes.
What is Connections in Redshift?

Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers.

What is Clusters in Redshift?
  • The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
  • A cluster is composed of one or more compute nodes.
  • If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication.
  • Your client application interacts directly only with the leader node.
  • The compute nodes are transparent to external applications.

What is Leader node in Redshift?

  • The leader node manages communications with client programs and all communication with compute nodes.
  • It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries.
  • Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
  • The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes.
  • All other queries run exclusively on the leader node.
What is Compute nodes in Redshift?
  • The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes.
  • The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
What is Node slices?
  • A compute node is partitioned into slices.
  • Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node.
  • The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices.
  • The slices then work in parallel to complete the operation.
  • The number of slices per node is determined by the node size of the cluster.
Setting up multi-factor authentication in Redshift
  • To support multi-factor authentication (MFA), Amazon Redshift provides browser-based plugins.
  • Use the browser SAML plugin for Okta, PingOne, and Active Directory Federation Services, and the browser Azure AD plugin for Microsoft Azure Active Directory
Stored Procedures

You can define an Amazon Redshift stored procedure using the PostgreSQL procedural language PL/pgSQL to perform a set of SQL queries and logical operations. The procedure is stored in the database and is available for any user with sufficient privileges to run.

Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t need to return a value. You can use procedural language, including looping and conditional expressions, to control logical flow.

  • CALL
Enhanced VPC Routing in Redshift
  • When you use Amazon Redshift Enhanced VPC Routing, Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC. If Enhanced VPC Routing is not enabled, Amazon Redshift routes traffic through the Internet, including traffic to other services within the AWS network.
  • When you execute a COPY or UNLOAD command on a cluster that has Enhanced VPC Routing enabled, your VPC routes the traffic to the specified resource using the strictest, or most specific, network path available.
  • You can use a VPC endpoint to create a managed connection between your Amazon Redshift cluster in a VPC and Amazon Simple Storage Service (Amazon S3). When you do, COPY and UNLOAD traffic between your cluster and your data on Amazon S3 stays in your Amazon VPC.

You can attach an endpoint policy to your endpoint to more closely manage access to your data. For example, you can add a policy to your VPC endpoint that permits unloading data only to a specific Amazon S3 bucket in your account.

  • A VPC endpoint uses route tables to control the routing of traffic between a cluster in the VPC and Amazon S3. All clusters in subnets associated with the specified route tables automatically use that endpoint to access the service.
  • Your VPC uses the most specific, or most restrictive, route that matches your cluster’s traffic to determine how to route the traffic. For example, if you have a route in your route table for all Internet traffic ( ) that points to an Internet gateway and an Amazon S3 endpoint, the endpoint route takes precedence for all traffic destined for Amazon S3, because the IP address range for the Amazon S3 service is more specific than In this example, all other Internet traffic goes to your Internet gateway, including traffic that’s destined for Amazon S3 buckets in other regions.
  • You use endpoint policies to control access from your cluster to the Amazon S3 buckets that hold your data files. By default, the Create Endpoint wizard attaches an endpoint policy doesn’t further restrict access from any user or service within the VPC. For more specific control, you can optionally attach a custom endpoint policy.
Query Diagnosis Process

Run the EXPLAIN command to get a query plan. To analyze the data provided by the query plan, follow these steps:

  1. Identify the steps with the highest cost. Concentrate on optimizing those when proceeding through the remaining steps.
  2. Look at the join types:
  3. Nested Loop: Such joins usually occur because a join condition was omitted.
  4. Hash and Hash Join: Hash joins are used when joining tables where the join columns are not distribution keys and also not sort keys
  5. Merge Join: No change is needed.
  6. Notice which table is used for the inner join, and which for the outer join. The query engine generally chooses the smaller table for the inner join, and the larger table for the outer join. If such a choice doesn’t occur, your statistics are likely out of date.
  7. See if there are any high-cost sort operations.
  8. Look for the following broadcast operators where there are high-cost operations:
  9. DS_BCAST_INNER: Indicates the table is broadcast to all the compute nodes, which is fine for a small table but not ideal for a larger table.
  10. DS_DIST_ALL_INNER: Indicates that all of the workload is on a single slice.
  11. DS_DIST_BOTH: Indicates heavy redistribution.
Common Issues Affecting Query Performance
IssueHow to confirm the issueHow to fix the issue
Table Statistics Missing or Out of DateA warning message in EXPLAIN command results. A missing statistics alert event in STLALERTEVENTLOG.Run ANALYZE
Nested LoopA nested loop alert event in STLALERTEVENTLOGReview your query for cross-joins and remove them if possible.
Hash JoinHash and hash join operations in the query plan. An HJOIN step in the segment with the highest maxtime value in S V L_QU E R Y_S U M M A R Y.Rewrite the query to use a merge join if possible. You can do this by specifying join columns that are both distribution keys and sort keys.
Ghost Rows or Uncommitted RowsAn alert event in STLALERTEVENTLOG that indicates excessive ghost rows.If there are no active load operations, run VACUUM on the query tables to remove deleted rows.
Unsorted or Missorted RowsA very selective filter alert event in STL_ALERT_EVENT_LOGRun VACUUM on the query tables to re-sort the rows. Review SORT KEYS on query tables.
Suboptimal Data DistributionA serial execution, large broadcast, or large distribution alert event appears in STLALERTEVENTLOG. Slices are not processing approximately the same number of rows for a given step. Slices are not taking approximately the same amount of time for a given step. Verify if any of the tables in your query have data skew.Take another look at the distribution styles for the tables in the query and see if any improvements can be made.
Insufficient Memory Allocated to the QueryYou might see a step in SVL QUERY SUMMARY that has an is diskbased value of true.Allocate more memory to the query by temporarily increasing the number of query slots it uses. Workload Management (WLM) reserves slots in a query queue equivalent to the concurrency level set for the queue.
Common Issues Affecting Query Performance
IssueHow to confirm the issueHow to fix the issue
Suboptimal WHERE clauseYou might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARYAdd a WHERE clause to the query based on the primary sort column of the largest table. This approach will help minimize scanning time.
Insufficiently Restrictive Predicate A SCAN step in the segment with the highest maxtime value     Try adding a predicate to the query or making the existing   in SVL_QUERY_SUMMARY that has a very high rows value predicate more restrictive to narrow the output.   compared to the rows value in the final RETURN step in the query.
Very Large Result SetQuery the S V L_QU E R Y_S U M M A RY view.Consider rewriting the query to use UNLOAD to write the results to Amazon S3. This approach will improve the performance of the RETURN step by taking advantage of parallel processing.
Large SELECT list      You might see a bytes value that is high relative to the rows value for any step (in comparison to other steps) in SVL_QUERY_SUMMARY. This high bytes value can be an indicator that you are selecting a lot of columns.Review the columns you are selecting and see if any can be removed.
Amazon Redshift introduces support for multi-factor authentication
  • Amazon Redshift now supports multi-factor authentication (MFA). Customers can use MFA to provide additional security when authenticating to their Amazon Redshift cluster.
  • Customers can configure identity federation provider to require users to prove their identity using additional authentication factors.
  • Amazon Redshift now supports a browser-based authentication workflow that works with any MFA configured by the identity federation provider.
  • This browser-based authentication workflow also makes integration with any Security Assertion Markup Language (SAMI) based identity federation easier.
  • To use multi-factor authentication with Amazon Redshift. get Amazon Redshift JDBC driver version and ODBC driver version or later. Refer to the AWS Region Table for Amazon Redshift availability.
Concurrency Scaling Pricing in Redshift
  • Amazon Redshift automatically adds transient capacity to provide consistently fast performance, even with thousands of concurrent users and queries.
  • There are no resources to manage, no upfront costs, and you are not charged for the startup or shutdown time of the transient clusters. With free concurrency scaling credits that are sufficient for most customers’ concurrency needs, Amazon Redshift allows you to scale with minimal cost-impact and with predictability in your month-to-month cost, even during periods of fluctuating analytical demand.
  • You are charged the per-second on-demand rate for a transient cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a Concurrency Scaling cluster is activated.
  • The per-second on- demand rate is based on the type and number of nodes in your Amazon Redshift cluster.
What is Encryption and durability in Redshift?
  • When you provision the cluster, you can optionally choose to encrypt the cluster for additional security.
  • When you enable encryption, Amazon Redshift stores all data in user-created tables in an encrypted format.
  • Encryption is an immutable property of the cluster.
  • The only way to switch from an encrypted cluster to a nonencrypted cluster is to unload the data and reload it into a new cluster.
  • Encryption applies to the cluster and any backups.
  • When you restore a cluster from an encrypted snapshot, the new cluster is encrypted as well.
  • Durability – Because Amazon Redshift distributes and executes queries in parallel across all of a cluster’s compute nodes, you can increase query performance by adding nodes to your cluster.
  • Amazon Redshift also distributes your data across all compute nodes in a cluster.
  • When you run a cluster with at least two compute nodes, data on each node will always be mirrored on disks on another node and you reduce the risk of incurring data loss.
Explain important Redshift Query Basics.
  • The maximum size for a single Amazon Redshift SQL statement is 16 MB
  • Some Amazon Redshift queries are distributed and executed on the compute nodes, and other queries execute exclusively on the leader node
  • The leader node distributes SQL to the compute nodes whenever a query references user-created tables or system tables (tables with an STL or STV prefix and system views with an SVL or SVV prefix).
  • A query that references only catalog tables (tables with a PG prefix, such as PG TABLE DEF, which reside on the leader node) or that does not reference any tables, runs exclusively on the leader node.
  • To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter.
  • Amazon Redshift does not recognize the JDBC maxRows parameter. Instead, specify a LIMIT clause to restrict the result set. You can also use an OFFSET clause to skip to a specific starting point in the result set.
  • Removes a view from the database. Multiple views can be dropped with a single DROP VIEW command.
  • Syntax
  • DROP VIEW [ IF EXISTS ] name [, … ] [ CASCADE | RESTRICT ]
  • To remove a view that has dependencies, use the CASCADE option
  • drop view eventview cascade;
  • Removes a user-defined function (UDF) from the database. The function’s signature, or list of argument data types, must be specified because multiple functions can exist with the same name but different signatures.
  • This command is not reversible.
  • Syntax
  • DROP FUNCTION name ( [arg_name] arg_type [, …] )
  • Examples
  • The following example drops the function named f_sqrt:
  • drop function f_sqrt(int);
  • To remove a function that has dependencies, use the CASCADE option, as shown in the following example:
  • drop function f_sqrt(int)cascade;
CREATE TABLE in Redshift.
  • The maximum number of tables is 9,900 for large and xlarge cluster node types and 20,000 for 8xlarge cluster node types. The limit includes temporary tables and excludes views.
  • The maximum number of characters for a table name is 127.
  • The maximum number of columns you can define in a single table is 1,600.
  • The maximum number of SORTKEY columns you can define in a single table is 400.

Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABL command.


( { column_name dota_type [column_attributes] [ column_constraints ]
I table_constraints
[, … ] )
where column_attributes are:
[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ]
[ ENCODE encoding ]
and column_constraints are:
[ REFERENCES reftable [ ( refcolumn ) ] ]
and table_constraints are:
[ UNIQUE ( column_name [ , … ] ) ]
[ PRIMARY KEY ( column_name [,…]) ]
[ FOREIGN KEY (column_name [, … ] ) REFERENCES reftable [ ( refcolumn ) ]
and table_attributes are:
L DISTSTYLE { EVEN | KEY | ALL > J [ DISTKEY ( column_name ) ] k
[ [COMPOUND I INTERLEAVED ] SOIWWEY ( column_name [, …] ) ]
UNLOAD in Redshift

• UNLOAD connects to Amazon S3 using an HTTPS connection.
• By default, UNLOAD writes data in parallel to multiple files,according to the number of slices in the cluster.
• By default, UNLOAD writes one or more files per slice. UNLOAD appends a slice number and part number to the specified name prefix as follows:
• The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.
Unloads the result of a query to one or more files on Amazon Simple Storage Service (Amazon S3)

  • UNLOAD (‘select-statement’)
  • TO ‘s3://object-path/name-prefix’ authorization [ option [ …
  • ‘select-statement’ – A select query.
  • Authorization – The UNLOAD command uses the same parameters the COPY command uses for authorization MANIFEST – Creates a manifest file that explicitly lists the data files that are created by the UNLOAD process.
  • DELIMITER AS ‘delimiter_character’ – Single ASCII character that is used to separate fields in the output file, such as a pipe character ( | ), a comma (, ), or a tab ( \t ). The default delimiter is a pipe character. FIXEDWIDTH ‘fixedwidth_spec’ – Unloads the data to a file where each column width is a fixed length, rather than separated by a delimiter.
  • ENCRYPTED – A clause that specifies that the output files on Amazon S3 will be encrypted using Amazon S3 server-side encryption or client-side encryption. If MANIFEST is specified, the manifest file is also encrypted. BZIP2 – Unloads data to one or more bzip2-compressed files per slice. Each resulting file is appended with a .bz2 extension.
  • GZIP – Unloads data to one or more gzip-compressed files per slice. Each resulting file is appended with a .gz extension.
  • ADDQUOTES – Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.
  • NULL AS ‘null-string’ – Replaces the specified string in place of nulls in the unloaded files.
  • ESCAPE – For CFIAR and VARCFIAR columns in delimited unload files, an escape character () is placed before every occurrence of \n, \r, \, “, ‘.
  • ALLOWOVERWRITE – If fctLOWOVERWRITE is specified, UNLOAD will overwrite existing files, including the manifest file.
  • PARALLEL – The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a
  • creates in Amazon S3. Specify a creates value between 5 mb and 6.2 GB. the AS keyword is optional. The default unit is MB.
Data Load Operations in Redshift
  • Manage the default behavior of the load operation for troubleshooting or to reduce load times by specifying the following parameters.
  • COMPROWS – Specifies the number of rows to be used as the sample size for compression analysis. The analysis is run on rows from each data slice.
  • COMPUPDATE – Controls whether compression encodings are automatically applied during a COPY. If COMPUPDATE is omitted, COPY applies automatic compression only if the target table is empty and all the table columns either have RAW encoding or no encoding. This behavior is the default. With COMPUPDATE OFF (or FALSE), automatic compression is disabled.
  • MAXERROR – If the load returns the specified number of errors or greater, the load fails. If the load returns fewer errors, it continues and returns an INFO message that states the number of rows that could not be loaded.
  • NOLOAD – Checks the validity of the data file without actually loading the data. Use the NOLOAD parameter to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD parameter is much faster than loading the data because it only parses the files.
  • STATUPDATE – Governs automatic computation and refresh of optimizer statistics at the end
  • Using a Manifest to Specify Data Files

“entries”: [
{“url”:”s3://mybucket/custdata.1″,”mandatory”: true}, {“url”:”s3://mybucket/custdata.2″,”mandatory”:true}, {“url”:”s3://mybucket/custdata.3″,”mandatory”:true}

CREATE VIEW in Redshift
  • You can’t update, insert into, or delete from a view.
  • You can create a view even if the referenced objects don’t exist. Because there is no dependency, you can drop or alter a referenced object without affecting the view.
  • Amazon Redshift doesn’t check for dependencies until the view is queried.
  • Creates a view in a database. The view is not physically materialized; the query that defines the view is run every time the view is referenced in a query. To create a view with an external table, include the WITH NO SCHEMA BINDING clause.

Changes the definition of a database table or Amazon Redshift Spectrum external table. This command updates the values and properties set by CREATE TABLE or CREATE EXTERNAL TABLE.

ALTER TABLE table_name
ADD table_constraint \
OWNER TO new_owner |
RENAME TO new_name \
RENAME COLUMN column_name TO new_name \
ADD [ COLUMN ] column_rtame column_type [ DEFAULT default_expr ]
[ ENCODE encoding ]
DROP [ COLUMN ] column_nome [ RESTRICT | CASCADE ] }
where table_constraint is:
[ CONSTRAINT constraint_name ]
{ UNIQUE ( column_nome [,…]) I PRIMARY KEY ( column_name [,…]) |
FOREIGN KEY (column_name [, … ] )
REFERENCES reftable [ ( refcolumn ) ]}
The following options apply only to external tables:
SET LOCATION { ‘s3://bucket/folder/’ | ‘s3://bucket/manifest_fHe’ } |
SET TABLE PROPERTIES (‘property_name’=’property_value’) |
PARTITION ( partition_column=partition_value [, …] )
SET LOCATION { ‘s3://bucket/folder’ | ‘s3://bucket/manifest_file’ } |
ADD [IF NOT EXISTS] PARTITION ( partition_column=partition_value [, …] )
LOCATION { *s3://bucket/folder’ \’s3://bucket/manifest_file’ } |
DROP PARTITION ( partition_column=partition_value [, …] )
The following command renames the USERS table to USERS_BKUP:
in a note here.
alter table users rename to users _bkup

  • Deletes a user group. This command is not reversible. This command does not delete the individual users in a group.
    See DROP USER to delete an individual user.
  • Syntex
  • DROP GROUP name
    The following example deletes the GUEST user group:
    drop group guests;
    You can’t drop a group if the group has any privileges on an object. If you attempt to drop such a group, you will receive the following error.
    ERROR: group “guest” can’t be dropped because the group has a privilege on some object
    If the group has privileges for an object, first revoke the privileges before dropping the group. The following example revokes all privileges on all tables in the public schema from the guest user group, and then drops the group.
    revoke all on all tables in schema public from group guest; drop group guests;
DROP USER in Redshift
  • Drops a user from a database. Multiple users can be dropped with a single DROP USER command. You must be a database superuser to execute this command.
  • Syntax
    DROP USER [ IF EXISTS ] name [, … ]
  • Amazon Redshift checks only the current database before dropping a user.
  • DROP USER doesn’t return an error if the user owns database objects or has any privileges on objects in another database.
  • If you drop a user that owns objects in another database, the owner for those objects is changed to ‘unknown’.
  • If a user owns an object, first drop the object or change its ownership to another user before dropping the original user.
  • If the user has privileges for an object, first revoke the privileges before dropping the user. The following example shows dropping an object, changing ownership, and revoking privileges before dropping the user.
  • drop database dwdatabase;
    alter schema dw owner to dwadmin;
    revoke all on table dwtable from dwuser
    drop user dwuser;
Explain DROP SCHEMA in Redshift
  • Deletes a schema. This command is not reversible.
  • Syntax
  • Example
  • The following example deletes a schema named S_SAIES. This example uses RESTRICT as a safety mechanism so that the schema will not be deleted if it contains any objects. In this case, you would to delete the schema objects before deleting the schema:
  • drop schema s_sales reatrictjc
  • The following example deletes a schema named S.SALES and all objects that depend on that schema:
  • drop schema s_sales cascade;
  • The following example either drops the S.SALES schema if it exists, or does nothing and returns a message if it does not:
  • drop schema if exists s_sales;
DROP TABLE in Redshift
  • Removes a table from a database. Only the owner of the table, the schema owner, or a superuser can drop a table.
  • If you are trying to empty a table of rows, without removing the table, use the DELETE or TRUNCATE command.
  • DROP TABLE removes constraints that exist on the target table. Multiple tables can be removed with a single DROP TABLE command.
  • DROP TABLE with an external table can’t be used inside a transaction (BEGIN … END).
  • Syntax
  • Examples
  • Dropping a Table with No Dependencies
  • The following example creates and drops a table called FEEDBACK that has no dependencies:
  • drop table feedback;
Factors Affecting Query Performance
  • Number of nodes, processors, or slices – A compute node is partitioned into slices. More nodes means more processors and more slices, which enables your queries to process faster by running portions of the query concurrently across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system.       
  • Node types – An Amazon Redshift cluster can use either dense storage or dense compute nodes. The dense storage node types are recommended for substantial data storage needs, while dense compute node types are optimized for performance-intensive workloads. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster.
  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you execute a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed.
  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and the query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed.
  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate to restrict the query dataset.
  • Concurrent operations – Running multiple operations at once can affect query performance. Each operation takes one or more slots in an available query queue and uses the memory associated with those slots. If other operations are running, enough query queue slots might not be available. In this case, the query will have to wait for slots to open before it can begin processing.
  • Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs.
Query Performance Factors
  • Code compilation – Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used (LRU) cache and shared across sessions in a cluster. Thus, subsequent executions of the same query, even in different sessions and often even with different query parameters, will run faster because they can skip the initial generation and compilation steps.
  • The LRU cache persists through cluster reboots, but is wiped by maintenance upgrades. The compiled code executes faster because it eliminates the overhead of using an interpreter. You always have some overhead cost the first time code is generated and compiled. As a result, the performance of a query the first time you run it can be misleading. The overhead cost might be especially noticeable when you run one-off (ad hoc) queries. You should always run a query a second time to determine its typical performance.
  • Similarly, be careful about comparing the performance of the same query sent from different clients. The execution engine generates different code for the JDBC connection protocols and ODBC and psql (libpq) connection protocols. If two clients use different protocols, each client will incur the first-time cost of generating compiled code, even for the same query. Other clients that use the same protocol, however, will benefit from sharing the cached code. A client that uses ODBC and a client running psql with libpq can share the same compiled code.