REDSHIFT

Redshift Query Basics

  • The maximum size for a single Amazon Redshift SQL statement is 16 MB
  • Some Amazon Redshift queries are distributed and executed on the compute nodes, and other queries execute exclusively on the leader node

• The leader node distributes SQL to the compute nodes whenever a query references user-created tables or system tables (tables with an STL or STV prefix and system views with an SVL or SVV prefix).

  • A query that references only catalog tables (tables with a PG prefix, such as PG TABLE DEF, which reside on the leader node) or that does not reference any tables, runs exclusively on the leader node.
  • To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter.
  • Amazon Redshift does not recognize the JDBC maxRows parameter. Instead, specify a LIMIT clause to restrict the result set. You can also use an OFFSET clause to skip to a specific starting point in the result set.

Standard Identifiers

  • There are two types of identifiers, standard identifiers and quoted or delimited identifiers.
  • Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. ^
  • In query results, column names are returned as lowercase by default
  • Standard SQL identifiers adhere to a set of rules and must:
  • Begin with an an ASCII single-byte alphabetic character or underscore character, or a UTF-8 multibyte character two to four bytes long.
  • Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs, or UTF-8 multibyte characters two to four bytes long.
  • Be between 1 and 127 bytes in length, not including quotes for delimited identifiers.
  • Contain no quotation marks and no spaces.
  • Not be a reserved SQL key word.

Delimited Identifiers

  • Delimited identifiers (also known as quoted identifiers) begin and end with double quotation marks (“).
  • If you use a delimited identifier, you must use the double quotation marks for every reference to that object. The identifier can contain any standard UTF-8 printable characters other than the double quote itself.
  • Therefore, you can create column or table names that include otherwise illegal characters, such as spaces or the percent symbol.
  • To use a double quote in a string, you must precede it with another double quote character.
NULLs
  • If a column in a row is missing, unknown, or not applicable, it is a null value or is said to contain null.
  • Nulls can appear in fields of any data type that are not restricted by primary key or NOT NULL constraints.
  • A null is not equivalent to the value zero or to an empty string.
  • Any arithmetic expression containing a null always evaluates to a null. All operators except concatenation return a null when given a null argument or operand.
  • To test for nulls, use the comparison conditions IS NULL and IS NOT NULL.
  • Because null represents a lack of data, a null is not equal or unequal to any value or to another null. 
Implicit and Explication Conversions
  • In general, data types that fall into the same type category (such as different numeric data types) are compatible and can be implicitly converted. For example, with implicit conversion you can insert a decimal value into an integer column. The decimal is rounded to produce a whole number. Or you can extract a numeric value, such as 2008, from a date and insert that value into an integer column.
  • Numeric data types enforce overflow conditions that occur when you attempt to insert out-of-range values. For example, a decimal value with a precision of 5 does not fit into a decimal column that was defined with a precision of 4. An integer or the whole part of a decimal is never truncated; however, the fractional part of a decimal can be rounded up or down, as appropriate.
  • If you compare numeric values with character strings, the numeric values are converted to character strings. To enforce the opposite conversion (converting character strings to numeric values), use an explicit function, such as CAST and CONVERT.
  • To convert 64-bit DECIMAL or NUMERIC values to a higher precision, you must use an explicit conversion function such as the CAST or CONVERT functions.
  • When converting DATE or TIMESTAMP to TIMESTAMPTZ, DATE or TIMESTAMP are assumed to use the current session time zone. The session time zone is UTC by default.
Operators and Precedence

A compound arithmetic expression can be constructed using the following operators, in this order of precedence:
( ) : parentheses to control the order of evaluation

  • , – : positive and negative sign/operator
  • ,I/, II/ : exponentiation, square root, cube root
  • *,/,%: multiplication, division, and modulo operators
  • : absolute value
  • , – : addition and subtraction
  • & , I, #, ~, «, » : AND, OR, NOT, shift left, shift right bitwise operators
  • ||: concatenation

Conditional Operators

Comparison conditions state logical relationships between two values. All comparison conditions are binary operators with a Boolean return type. Amazon Redshift supports the comparison operators described in the following table:

OperatorSyntaxDescription
<      a < bValue a is less than value b.
a > bValue a is greater than value b.
<=a <= bValue a is less than or equal to value b.
>=a >= bValue a is greater than or equal to value b.
=a = bValue a is equal to value b.
<> or !=a <> b or a != bValue a is not equal to value b.
ANY | SOMEa = ANY(subquery)Value a is equal to any value returned by the subquery.
ALLa <> ALL or != ALLValue a is not equal to any value returned by the
 (subquery))subquery.
IS TRUE |FALSE | UNKNOWNa IS TRUEValue a is Boolean TRUE.

Pattern Matching Expressions

A pattern-matching operator searches a string for a pattern specified in the conditional expression and returns true or false depend on whether it finds a match. Amazon Redshift uses three methods for pattern matching:

  • LIKE expressions

The LIKE operator compares a string expression, such as a column name, with a pattern that uses the wildcard characters % (percent) and _ (underscore). LIKE pattern matching always covers the entire string. LIKE performs a case-sensitive match and ILIKE performs a case-insensitive match.

  • SIMILAR TO regular expressions

The SIMILAR TO operator matches a string expression with a SQL standard regular expression pattern, which can include a set of pattern-matching metacharacters that includes the two supported by the LIKE operator. SIMILAR TO matches the entire string and performs a case-sensitive match.

  • POSIX-style regular expressions

POSIX regular expressions provide a more powerful means for pattern matching than the LIKE and SIMILAR TO operators. POSIX regular expression patterns can match any portion of the string and performs a case-sensitive match.

Note: Regular expression matching, using SIMILAR TO or POSIX operators, is computationally expensive. We recommend using LIKE whenever possible, especially when processing a very large number of rows.

LIKE Operator

The LIKE operator compares a string expression, such as a column name, with a pattern that uses the wildcard characters % (percent) and _ (underscore). LIKE pattern matching always covers the entire string. To match a sequence anywhere within a string, the pattern must start and end with a percent sign.

OperatorDescription
%Matches any sequence of zero or more characters.
Matches any single character.

SIMILAR TO Operator

The SIMILAR TO operator matches a string expression, such as a column name, with a SQL standard regular expression pattern. A SQL regular expression pattern can include a set of pattern-matching metacharacters. The SIMILAR TO operator returns true only if its pattern matches the entire string, unlike POSIX regular expression behavior, where the pattern can match any portion of the string. SIMILAR TO performs a case-sensitive match.

OperatorDescriptionExpression
%Matches any sequence of zero or more characters.‘abc* SIMILAR TO ‘abc’
_Matches any single character. ‘abc’ SIMILAR TO ‘_b_’
‘abc’ SIMILAR TO *_A_’Denotes alternation (either of two alternatives). Repeat the previous item zero or more times.
Repeat the previous item one or more times.•abc* SIMILAR TO ’*(b|d)V
?Repeat the previous item zero or one time.‘abc’ SIMILAR TO ‘(b|c)V
<m>Repeat the previous item exactly m times.‘AbcAbcdefgefgl2efgefgl2’ SIMILAR TO
<m,>Repeat the previous item m or more times.•<(Ab)?c)+£(efg)*(l2))*’
{m,n}Repeat the previous item at least m and not more than n times.•aaaaaablllllxy’ SIMILAR TO *a{6}_ [0-9]{5>(x|y){2>’
OParentheses group items into a single logical item.
[•••]A bracket expression specifies a character class, just as in POSIX regular expressions.‘$0.87′ SIMILAR TO •$[0-9]♦(.[0-9] [0-9])?’

Operators and Precedence

A compound arithmetic expression can be constructed using the following operators, in this order of precedence:


( ) : parentheses to control the order of evaluation

  • , – : positive and negative sign/operator
  • ^ , 1/, 11/ : exponentiation, square root, cube root
  • *,/,%: multiplication, division, and modulo operators
  • @ : absolute value + , – : addition and subtraction
  • & , I, #, ~, «, » : AND, OR, NOT, shift left, shift right bitwise operators
  • ||: concatenation

CREATE GROUP

Defines a new user group. Only a superuser can create a group.

Syntax

CREATE GROUP group_name

[ [ WITH ] [ USER username ] (, …] ]

Parameters

group_name

Name of the new user group. Group names beginning with two underscores are reserved for Amazon Redshift internal use.

WITH

Optional syntax to indicate additional parameters for CREATE GROUP.

USER

Add one or more users to the group. username

Name of the user to add to the group.

CREATE USER

Creates a new database user account. You must be a database superuser to execute this command.

Syntax
CREATE USER name [ WITH ]
PASSWORD { ‘password’ | ‘md5hash’ | DISABLE }
[ option [ … ] ]

where option can be:

CREATEDB | NOCREATEDB
CREATEUSER | NOCREATEUSER
SYSLOG ACCESS { RESTRICTED | UNRESTRICTED } IN GROUP groupname [, … ]
VALID UNTIL abstime
CONNECTION LIMIT { limit | UNLIMITED }
Parameters
name
The name of the user account to create. The user name can’t be public.
WITH
Optional keyword. WITH is ignored by Amazon Redshift PASSWORD {‘password’ \ ‘md5hash’ | DISABLE }
Sets the user’s password.
By default, users can change their own passwords, unless the password is disabled. To disable a user’s password, specify DISABLE. When a user’s password is disabled, the password is deleted from the system and the user can log on only using temporary 1AM user credentials. You can specify the password in clear text or as an MD5 hash string.

CREATE USER

As a more secure alternative to passing the CREATE USER password parameter as clear text, you can specify an MD5 hash of a string that includes the password and user name.

Creates a new database user account. You must be a database superuser to execute this command.
When you launch a new cluster using the AWS Management Console, AWS CLI, or Amazon Redshift API, you must supply a clear text password for the master database user.
CREATEDB | NOCREATEDB
The CREATEDB option allows the new user account to create databases. The default is NOCREATEDB.
CREATEUSER | NOCREATEUSER
The CREATEUSER option creates a superuser with all database privileges, including CREATE USER. The default is NOCREATEUSER.
SYSLOG ACCESS { RESTRICTED | UNRESTRICTED }
A clause that specifies the level of access the user has to the Amazon Redshift system tables and views. If RESTRICTED is specified, the user can see only the rows generated by that user in user-visible system tables and views. The default is RESTRICTED.
• IN GROUP groupname
Specifies the name of an existing group that the user belongs to. Multiple group names may be listed.
• VALID UNTIL abstime
The VALID UNTIL option sets an absolute time after which the user account password is no longer valid. By default the password has no time limit.
• CONNECTION LIMIT {limit | UNLIMITED }
The maximum number of database connections the user is permitted to have open concurrently. The limit is not enforced for super users. Use the UNLIMITED keyword to permit the maximum number of concurrent connections. The limit of concurrent connections for each cluster is 500.

CREATE DATABASE

•           Maximum of 60 user- defined databases per cluster.

•           Maximum of 127 bytes for a database name.

•           Cannot be a reserved word.

•           Maximum 500 concurrent connections for each cluster.

•           Limit is not enforced for super users.

Syntax

CREATE DATABASE database_name [ WITH ]

[ OWNER [=] db_owner ]

[ CONNECTION LIMIT { limit | UNLIMITED } ]

Parameters

•           database_name

Name of the new database.

•           WITH

Optional keyword.

•           OWNER

Specifies a database owner.

•           —

Optional character.

•           db_owner

Username for the database owner.

•           CONNECTION LIMIT {limit | UNLIMITED }

The maximum number of database connections users are permitted to have open concurrently. Use the UNLIMITED keyword to permit the maximum number of concurrent connections. The default is UNLIMITED.

CREATE SCHEMA

There is a maximum of 9900 schemas per database.

Defines a new schema for the current database.

Syntax

CREATESCHEMA[ IF NOT EXISTS ] schema_name [AUTHORIZATIONusername ] [ schema_element
[ …] ]    
CREATESCHEMAAUTHORIZATION username [schema_element [ …] ]


Parameters

  • IF NOT EXISTS

Clause that indicates that if the specified schema already exists, the command should make no changes and return a message that the schema exists, rather than terminating with an error. This clause is useful when scripting, so the script doesn’t fail if CREATE SCHEMA tries to create a schema that already exists.

  • schema_name

Name of the new schema. The schema name can’t be PUBLIC.

  • AUTHORIZATION

Clause that gives ownership to a specified user.

  • username

Name of the schema owner.

  • schema_element

Definition for one or more objects to be created within the schema.

CREATE TABLE

The maximum number of tables is 9,900 for large and xlarge cluster node types and 20,000 for 8xlarge cluster node types. The limit includes temporary tables and excludes views.
The maximum number of characters for a table name is 127.
The maximum number of columns you can define in a single table is 1,600.
The maximum number of SORTKEY columns you can define in a single table is 400.

Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABL command.

Syntax

CREATE [ [LOCAL ] { TEMPORARY | TEMP > ] TABLE [ IF NOT EXISTS ] table_name
( { column_name dota_type [column_attributes] [ column_constraints ]
I table_constraints
I LIKE parent_table [ { INCLUDING I EXCLUDING } DEFAULTS ] }
[, … ] )
[ BACKUP { YES I NO } ]
[table_attribute]
where column_attributes are:
[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ]
[ ENCODE encoding ]
[ DISTKEY ]
[ SORTKEY ]
and column_constraints are:
[ { NOT NULL I NULL } ]
[ { UNIQUE I PRIMARY KEY } ]
[ REFERENCES reftable [ ( refcolumn ) ] ]
and table_constraints are:
[ UNIQUE ( column_name [ , … ] ) ]
[ PRIMARY KEY ( column_name [,…]) ]
[ FOREIGN KEY (column_name [, … ] ) REFERENCES reftable [ ( refcolumn ) ]
and table_attributes are:
L DISTSTYLE { EVEN | KEY | ALL > J [ DISTKEY ( column_name ) ] k
[ [COMPOUND I INTERLEAVED ] SOIWWEY ( column_name [, …] ) ]
CREATE TABLE AS

•           CREATE TABLE AS (CTAS) tables don’t inherit constraints, identity columns, default column values, or the primary key from the table that they were created from.

•           You can’t specify column compression encodings for CTAS tables. Amazon Redshift automatically assigns compression encoding as follows:

•           Columns that are defined as sort keys are assigned RAW compression.

•           Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.

•           All other columns are assigned LZO compression.

Creates a new table based on a query. The owner of this table is the user that issues the command.

The new table is loaded with data defined by the query in the command. The table columns have names and data types associated with the output columns of the query. The CREATE TABLE AS (CTAS) command creates a new table and evaluates the query to load the new table.

Syntax
CREATE [ [LOCAL ] { TEMPORARY | TEMP } ]
TABLE table_name
[ ( column_name [, … ] ) ]
[ table_attributes ]
[ BACKUP { YES I NO } ]
AS query

where table_attributes are:
[ DISTSTYLE { EVEN | ALL | KEY } ]
[ DISTKEY ( distkey_identifier ) ]
[ [ { COMPOUND I INTERLEAVED } ] SORTKEY ( column_name [, …] ) ]

CREATE VIEW

  • You can’t update, insert into, or delete from a view.
  • You can create a view even if the referenced objects don’t exist. Because there is no dependency, you can drop or alter a referenced object without affecting the view.
  • Amazon Redshift doesn’t check for dependencies until the view is queried..

Creates a view in a database. The view is not physically materialized; the query that defines the view is run every time the view is referenced in a query. To create a view with an external table, include the WITH NO SCHEMA BINDING clause.
Having ownership of a view, or having privileges granted on a view, does not imply access to the underlying tables. You need to grant access to the underlying tables explicitly.
Syntax
CREATE [ OR REPLACE ] VIEW name [ ( column_nome [, …] ) ] AS query [ WITH NO SCHEMA BINDING ]
Parameters
· OR REPLACE
If a view of the same name already exists, the view is replaced. You can only replace a view with a new query that generates the identical set of columns, using the same column names and data types. CREATE OR REPLACE VIEW locks the view for reads and writes until the operation completes.
· name
The name of the view. If a schema name is given (such as myschema.myview) the view is created using the specified schema. Otherwise, the view is created in the current schema. If you specify a view name that begins with ‘# ‘, the view will be created as a temporary view that is visible only in the current session.
· column_name
Optional list of names to be used for the columns in the view. If no column names are given, the column names are derived from the query. The maximum number of columns you can define in a single view is 1,600.
· query
A query (in the form of a SELECT statement) that evaluates to a table. This table defines the columns and rows in the view.
· WITH NO SCHEMA BINDING
Clause that specifies that the view is not bound to the underlying database objects, such as tables and user-defined functions. As a result, there is no dependency between the view and the objects it references

CREATE LIBRARY

Installs a Python library, which will be available for users to incorporate when creating a user-defined function (UDF) with the CREATE FUNCTION command.

The total size of user-installed libraries can’t exceed 100 MB. CREATE LIBRARY can’t be run inside a transaction block (BEGIN … END).

• Amazon Redshift supports Python version 2.7.

Syntax
CREATE [ OR REPLACE ] LIBRARY library_name LANGUAGE plpythonu FROM
{ ‘https ://file_url’
I ‘s3://bucketname/file_name’ authorization
[ REGION [AS] * aws__region’ ]
}

Examples
The following command installs a UDF library named f_urlparse from a package that has been uploaded to an Amazon S3 bucket located in the US East region.
create library f_urlparse language plpythonu
from ‘s3://mybucket/urlparse3-l.0.3.zip’
credentials ‘aws_access_key_id=;aws_secret_access_key=’ region as ‘us-east-1’;

The following example installs a library named f_urlparse from a library file on a website.

create library f_urlparse
language plpythonu
from ‘https ://example.com/packages/urlparse3-l.0.3.zip’;

CREATE FUNCTION

Creates a new scalar user-defined function (UDF) using either a SQL SELECT clause or a Python program.

Syntax
CREATE [ OR REPLACE ] FUNCTION f_function_name
( { [py_org_name py_org_doto_type \ sql_arg_data_type }[,…]])
RETURNS data_type
{ VOLATILE I STABLE | IMMUTABLE }
AS $$
{ python_progrom | SELECT_clause }
$$ LANGUAGE { plpythonu I sql }

ALTER GROUP

Changes a user group. Use this command to add users to the group, drop users from the group, or rename the group.

Syntax
ALTER GROUP group_name
{
ADD USER username [, ….]
DROP USER username [, …..]
RENAME TO new_name
}

Examples
The following example adds a user named DWUSER to the ADMIN_GROUP group:
alter group admin_group
add user dwuser;

The following example renames the group ADMIN_GROUP to

ADMINISTRATORS:
alter group adrain_group
rename to administrators;

ALTER USER

ALTER USER

Changes a database user account. If you are the current user, you can change your own password. For all other options, you must be a database superuser to execute this command.

Syntax
ALTER USER username [ WITH ] option [, ••• ]
where option is
CREATEDB | NOCREATEDB
| CREATEUSER | NOCREATEUSER
| SYSLOG ACCESS { RESTRICTED | UNRESTRICTED }
| PASSWORD { ‘password’ | ’md5hash’
[ VALID UNTIL ‘expiration_date* ]
| RENAME TO new_name \ | DISABLE }
| CONNECTION LIMIT { limit | UNLIMITED }
| SET parameter { TO | = } { value I DEFAULT }
| RESET parameter
Examples

The following example gives the user ADMIN the privilege to create databases:
alter user admin createdb;

The following example sets the password of the user ADMIN to adminPass9 and sets an expiration date and time for the password:

alter user admin password ‘adminPass9’
valid until ‘2017-12-31 23:59’;
The following example renames the user ADMIN to SYSADMIN:

alter user admin rename to sysadmin;

ALTER DATABASE

Changes the attributes of a database.

Syntax
ALTER DATABASE dotabose_name
I RENAME TO new_name
I OWNER TO new_owner
I CONNECTION LIMIT { limit | UNLIMITED } ]

Usage Notes
ALTER DATABASE commands apply to subsequent sessions not current sessions. You need to reconnect to the altered database to see the effect of the change.
alter database tickit owner to dwuser;
Examples
The following example renames a database named TICKIT_SANDBOX to TICKIT_TEST:
alter database tickit_sandbox rename to tickit_test;
The following example changes the owner of the TICKIT database (the current database) to DWUSER:
ALTER DATABASE

ALTER TABLE

Changes the definition of a database table or Amazon Redshift Spectrum external table. This command updates the values and properties set by CREATE TABLE or CREATE EXTERNAL TABLE.

Syntax
ALTER TABLE table_name
{
ADD table_constraint \
DROP CONSTRAINT constraint_name [ RESTRICT | CASCADE ] |
OWNER TO new_owner |
RENAME TO new_name \
RENAME COLUMN column_name TO new_name \
ADD [ COLUMN ] column_rtame column_type [ DEFAULT default_expr ]
[ ENCODE encoding ]
[ NOT NULL I NULL ] |
DROP [ COLUMN ] column_nome [ RESTRICT | CASCADE ] }
where table_constraint is:
[ CONSTRAINT constraint_name ]
{ UNIQUE ( column_nome [,…]) I PRIMARY KEY ( column_name [,…]) |
FOREIGN KEY (column_name [, … ] )
REFERENCES reftable [ ( refcolumn ) ]}
The following options apply only to external tables:
SET LOCATION { ‘s3://bucket/folder/’ | ‘s3://bucket/manifest_fHe’ } |
SET FILE FORMAT format \
SET TABLE PROPERTIES (‘property_name’=’property_value’) |
PARTITION ( partition_column=partition_value [, …] )
SET LOCATION { ‘s3://bucket/folder’ | ‘s3://bucket/manifest_file’ } |
ADD [IF NOT EXISTS] PARTITION ( partition_column=partition_value [, …] )
LOCATION { *s3://bucket/folder’ \’s3://bucket/manifest_file’ } |
DROP PARTITION ( partition_column=partition_value [, …] )
The following command renames the USERS table to USERS_BKUP:
in a note here.
alter table users rename to users _bkup

DROP GROUP

Deletes a user group. This command is not reversible. This command does not delete the individual users in a group.
See DROP USER to delete an individual user.

Syntex

DROP GROUP name

EXAMPLE
The following example deletes the GUEST user group:
drop group guests;
You can’t drop a group if the group has any privileges on an object. If you attempt to drop such a group, you will receive the following error.
ERROR: group “guest” can’t be dropped because the group has a privilege on some object
If the group has privileges for an object, first revoke the privileges before dropping the group. The following example revokes all privileges on all tables in the public schema from the guest user group, and then drops the group.
revoke all on all tables in schema public from group guest; drop group guests;

DROP USER

Drops a user from a database. Multiple users can be dropped with a single DROP USER command. You must be a database superuser to execute this command.
Syntax
DROP USER [ IF EXISTS ] name [, … ]
You can’t drop a user it the user owns any database object, such as a schema, database, table, or view, or if the user has any privileges on a table, database, or group. If you attempt to drop such a user, you will receive one of the following errors.
ERROR: user “username” can’t be dropped because the user owns some object [SQL State=55006]
ERROR: user “username” can’t be dropped because the user has a privilege on some object
[SQL State=55006]
Amazon Redshift checks only the current database before dropping a user. DROP USER doesn’t return an error if the user owns database objects or has any privileges on objects in another database. If you drop a user that owns objects in another database, the owner for those objects is changed to ‘unknown’.
If a user owns an object, first drop the object or change its ownership to another user before dropping the original user. If the user has privileges for an object, first revoke the privileges before dropping the user. The following example shows dropping an object, changing ownership, and revoking privileges before dropping the user.

drop database dwdatabase;
alter schema dw owner to dwadmin;
revoke all on table dwtable from dwuser
drop user dwuser;
The following example drops the user account danny if it exists, or does nothing and returns a message if it does not:

DROP SCHEMA

Deletes a schema. This command is not reversible.
Syntax
DROP SCHEMA [ IT EXISTS ) name (, …J ( CASCADE | RESTRICT J
Example
The following example deletes a schema named S_SAIES. This example uses RESTRICT as a safety mechanism so that the schema will not be deleted if it contains any objects. In this case, you would to delete the schema objects before deleting the schema:
drop schema s_sales reatrictjc
The following example deletes a schema named S.SALES and all objects that depend on that schema:
drop schema s_sales cascade;
The following example either drops the S.SALES schema if it exists, or does nothing and returns a message if it does not:
drop schema if exists s_sales;

DROP TABLE

Removes a table from a database. Only the owner of the table, the schema owner, or a superuser can drop a table.
If you are trying to empty a table of rows, without removing the table, use the DELETE or TRUNCATE command.
DROP TABLE removes constraints that exist on the target table. Multiple tables can be removed with a single DROP TABLE command.
DROP TABLE with an external table can’t be used inside a transaction (BEGIN … END).
Syntax
DROP TABLE ( IF EXISTS J no«* (, …| ( CASCADE | RESTRICT )
Examples
Dropping a Table with No Dependencies
The following example creates and drops a table called FEEDBACK that has no dependencies:
créât« table feedback(a lnt);
drop table feedback;

DROP VIEW

Removes a view from the database. Multiple views can be dropped with a single DROP VIEW command. This command is not reversible.
Syntax
DROP VIEW [ IF EXISTS ] name [, … ] [ CASCADE | RESTRICT ]
Examples
The following example drops the view called event:
drop view event;
To remove a view that has dependencies, use the CASCADE option
drop view eventview cascade;

DROP FUNCTION

Removes a user-defined function (UDF) from the database. The function’s signature, or list of argument data types, must be specified because multiple functions can exist with the same name but different signatures. You can’t drop an Amazon Redshift built-in function.
This command is not reversible.
Syntax
DROP FUNCTION name ( [arg_name] arg_type [, …] )
[ CASCADE I RESTRICT ]
Examples
The following example drops the function named f_sqrt:
drop function f_sqrt(int);
To remove a function that has dependencies, use the CASCADE option, as shown in the following example:
drop function f_sqrt(int)cascade;

COPY
The COPY command appends the new input data to any existing rows in the table.

· The maximum size of a single input row from any source is 4 MB.
· To use the COPY command, you must have INSERT privilege for the Amazon Redshift table.

Loads data into a table from data files or from an Amazon DynamoDB table. The files can be located in an Amazon Simple Storage Service (Amazon S3) bucket, an Amazon EMR cluster, or a remote host that is accessed using a Secure Shell (SSH) connection.
COPY Syntax
COPY table-name [ column-list ]
FROM data__source authorization
[ [ FORMAT ] [ AS ] data_format ]
[ parameter [ argument ] [, … ] ]
Amazon Redshift extends the functionality of the COPY command to enable you to load data in several
data formats from multiple data sources, control access to load data, manage data transformations, and manage the load operation.
Optional Parameters
You can optionally specify how COPY will map field data to columns in the target table, define source data attributes to enable the COPY command to correctly read and parse the source data, and manage which operations the COPY command performs during the load process.
· Column Mapping Options
· Data Format Parameters
· Data Conversion Parameters
· Data Load Operations

COPY

If a column in the target table is omitted from the column list, then COPY loads the target column’s DEFAULT expression.
If the target column does not have a default, then COPY attempts to load NULL.
If COPY attempts to assign NULL to a column that is defined as NOT NULL, the COPY command fails.

COPY
Column Mapping
By default, COPY inserts field values into the target table’s columns in the same order as the fields occur in the data files. If the default column order will not work, you can specify a column list or use JSONPath expressions to map source data fields to the target columns.
Column List
You can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
When loading from an Amazon DynamoDB table, order does not matter. The COPY command _ matches attribute names in the items retrieved from the DynamoDB table to column names in the Q
Amazon Redshift table.
The format for a column list is as follows.
JSONPaths File
When loading from data files in JSON or Avro format, COPY automatically maps the data elements in the JSON or Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table or column list.
If your column names and field names don’t match, or to map to deeper levels in the data hierarchy, you can use a JSONPaths file to explicitly map JSON or Avro data elements to columns.

Data Format Parameters

By default, the COPY command expects the source data to be character-delimited UTF-8 text. The default delimiter is a pipe character ( | ). If the source data is in another format, use the following parameters to specify the data format.
· FORMAT
· CSV
· DELIMITER
· FIXEDWIDTH
· AVRO
· JSON
· ENCRYPTED
· BZIP2
· GZIP
· LZOP

Data Conversion Parameters

As it loads the table, COPY attempts to implicitly convert the strings in the source data to the data type of the target column. If you need to specify a conversion that is different from the default behavior, or if the default conversion results in errors, you can manage data conversions by specifying the following parameters.
• If you specify ‘auto’ as the argument for the DATEFORMAT or TIMEFORMAT parameter, Amazon Redshift will automatically recognize and convert the date format or time format in your source data.
ACCEPTANYDATE – Allows any date format, to be loaded without generating an error.
ACCEPTINVCHARS – Enables loading invalid UTF-8 characters in VARCHAR columns.
BLANKSASNULL – Loads blank fields, which consist of only white space characters, as NULL
DATEFORMAT – If no DATEFORMAT is specified, the default format is ‘YYYY-MM-DD’.
EMPTYASNULL – Indicates to load empty CHAR and VARCHAR fields as NULL.
ENCODING – Specifies encoding type of the data to be loaded. UTF8/16/16LE/16BE
ESCAPE – Specifies to treat the backslash character () in data as an escape character
EXPLICIT_IDS – Overrides IDENTITY columns and populates it with values from load data
FILLRECORD – Fills missing contiguous columns at the end of records with NULLs or blanks.
IGNOREBLANKLINES- Ignores and does not load blank lines that only contain a line feed.
IGNOREHEADER – Treats specified number of rows as a file header and does not load them
NULL AS – Loads fields that match the specified value as NULL
REMOVEQUOTES – Removes surrounding quotation marks from strings in the incoming data.
ROUNDEC – Rounds up values when the input value’s scale is greater than target column scale.
TIMEFORMAT – Specifies the time format. Default format is YYYY-MM-DD HH:MI:SS
TRIMBLANKS – Removes the trailing white space characters from a VARCHAR string.
• TRUNCATECOLUMNS – Truncates data in columns to fit the VARCHAR or CHAR column size.

Data Load Operations

Manage the default behavior of the load operation for troubleshooting or to reduce
load times by specifying the following parameters.
• COMPROWS – Specifies the number of rows to be used as the sample size for compression analysis. The analysis is run on rows from each data slice. For example, if you specify COMPROWS 1000000 (1,000,000) and the system contains four total slices, no more than 250,000 rows for each slice are read and analyzed. If COMPROWS is not specified, the sample size defaults to 100,000 for each slice.
· If the COPY command fails, the
< entire transaction is aborted and all
changes are rolled back.
· After a COPY command is successfully initiated, it does not fail if the session terminates, for example when the client disconnects.
· However, if the COPY command is within a BEGIN … END transaction block that does not complete because the session terminates, the entire transaction, including the COPY, is rolled back.
COMPUPDATE – Controls whether compression encodings are automatically applied during a COPY. If COMPUPDATE is omitted, COPY applies automatic compression only if the target table is empty and all the table columns either have RAW encoding or no encoding. This behavior is the default. With COMPUPDATE OFF (or FALSE), automatic compression is disabled.
MAXERROR – If the load returns the specified number of errors or greater, the load fails. If the load returns fewer errors, it continues and returns an INFO message that states the number of rows that could not be loaded.
NOLOAD – Checks the validity of the data file without actually loading the data. Use the NOLOAD parameter to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD parameter is much faster than loading the data because it only parses the files.
• STATUPDATE – Governs automatic computation and refresh of optimizer statistics at the end
of a successful COPY command. Byjxjefault, if the STATUPDATE parameter is not used, statistics

Using a Manifest to Specify Data Files

You can use a manifest to ensure that your COPY command loads all of the required files, and only the required files, from Amazon S3. You can also use a manifest when you need to load multiple files from different buckets or files that do not share the same prefix.
For example, suppose you need to load the following three files: custdatai. txt, custdata2. txt, and custdata3. txt. You could use the following command to load all of the files in mybucket that begin with custdata by specifying a prefix:
copy category
from ‘s3://mybucket/custdata’
iam_role ‘arn:aws:iam:: 0123456789012 :role/MyRedshiftRole’;

If only two of the files exist because of an error, COPY will load only those two files and finish successfully, resulting in an incomplete data load. If the bucket also contains an unwanted file that happens to use the same prefix, such as a file named custdata.backup for example, COPY will load that file as well, resulting in unwanted data being loaded.
To ensure that all of tffc’required files are loaded and to prevent unwanted files from being loaded, you can use a manifest file. The manifest is a JSON-formatted text file that lists the files to be processed by the COPY command. For example, the following manifest loads the three files in the previous example.

{
“entries”: [
{“url”:”s3://mybucket/custdata.1″,”mandatory”: true}, {“url”:”s3://mybucket/custdata.2″,”mandatory”:true}, {“url”:”s3://mybucket/custdata.3″,”mandatory”:true}
]
}

The optional mandatory flag indicates whether COPY should terminate if the file does not exist. The default is false. Regardless of any mandatory settings, COPY will terminate if no files are found.

UNLOAD

• UNLOAD connects to Amazon S3 using an HTTPS connection.
• By default, UNLOAD writes data in parallel to multiple files,according to the number of slices in the cluster.
• By default, UNLOAD writes one or more files per slice. UNLOAD appends a slice number and part number to the specified name prefix as follows:
/part
• The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.
Unloads the result of a query to one or more files on Amazon Simple Storage Service (Amazon S3),

UNLOAD (‘select-statement’)
TO ‘s3://object-path/name-prefix’ authorization [ option [ …
‘select-statement’ – A select query. Recommended to unload data in sorted order using ORDER BY in query. TO ‘s3://object-path/name-prefix’ – Full S3 path.
Authorization – The UNLOAD command uses the same parameters the COPY command uses for authorization MANIFEST – Creates a manifest file that explicitly lists the data files that are created by the UNLOAD process. DELIMITER AS ‘delimiter_character’ – Single ASCII character that is used to separate fields in the output file, such as a pipe character ( | ), a comma (, ), or a tab ( \t ). The default delimiter is a pipe character. FIXEDWIDTH ‘fixedwidth_spec’ – Unloads the data to a file where each column width is a fixed length, rather than separated by a delimiter.
ENCRYPTED – A clause that specifies that the output files on Amazon S3 will be encrypted using Amazon S3 server-side encryption or client-side encryption. If MANIFEST is specified, the manifest file is also encrypted. BZIP2 – Unloads data to one or more bzip2-compressed files per slice. Each resulting file is appended with a .bz2 extension.
GZIP – Unloads data to one or more gzip-compressed files per slice. Each resulting file is appended with a .gz extension.
ADDQUOTES – Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.
NULL AS ‘null-string’ – Replaces the specified string in place of nulls in the unloaded files.
ESCAPE – For CFIAR and VARCFIAR columns in delimited unload files, an escape character () is placed before every occurrence of \n, \r, \, “, ‘.
ALLOWOVERWRITE – If fctLOWOVERWRITE is specified, UNLOAD will overwrite existing files, including the manifest file.
• By default, UNLOAD fails if it finds
‭ ‬ 11 ^ssibly overwrite.‬
PARALLEL – The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a
data file is 6.2 GB.
• MAXFILESIZE AS max-size[MB|GB]- The maximum size of files UNLOAD
creates in Amazon S3. Specify a creates value between 5 mb and 6.2 GB. the AS keyword is optional. The default unit is MB.

COMMENT

Creates or changes a comment about a database object.
Syntax
COMMENT ON
{
TABLE object_name |
COLUMN object_name.column_name \
CONSTRAINT constraint_name ON table_name \ DATABASE object_name \
VIEW object_name
IS •text*
Example
The following example adds a descriptive comment to the EVENT table:
comment on table
event is ‘Contains listings of individual events.’;
To view comments, query the PG_DESCRIPTION system catalog. The following example returns the description for the EVENT table.
select * from pg_catalog.pg_description where objoid =
(select oid from pg_class where relname = ‘event’ and relnamespace =
(select oid from pg_catalog.pg_namespace where nspname = ‘public’) );
objoid I classoid | objsubid | description

INSERT

• The maximum size for a single SQL statement is 16 MB.

INSERT
Inserts new rows into a table. You can insert a single row with the VALUES syntax, multiple rows with the VALUES syntax, or one or more rows defined by the results of a query (INSERT INTO…SELECT).
Syntax
INSERT INTO table_name [ ( column [, …) ) ]
{DEFAULT VALUES |
VALUES ( { expression | DEFAULT } [, …] )
[, ( ( expression | DEFAULT ) [, …] )
[. ….] ] I
query }

INSERT

The maximum size for a single SQL statement is 16 MB.

• If you do not specify a column list, the values to be inserted must correspond to the table columns in the order in which they were declared in the CREATE TABLE statement.

Inserts new rows into a table. You can insert a single row with the VALUES syntax, multiple rows with the VALUES syntax, or one or more rows defined by the results of a query (INSERT INTO…SELECT).
Syntax
INSERT INTO table_name [ ( column [, …) ) ] {DEFAULT VALUES |
VALUES ( < expression I DEFAULT ) [, …] )
(« ( < expression | DEFAULT } [, …] )
(• •••] 1 I
query }

table_name – A temporary or persistent table. Only the owner of the table or a user with INSERT privilege on the table can insert rows.
DEFAULT VALUES – If the columns in the table were assigned default values when the table was created, use these keywords to insert a row that consists entirely of default values. If any of the columns do not have default values, nulls are inserted into those columns. If any of the columns are declared NOT NULL, the INSERT statement returns an error.
expression – A single value or an expression that evaluates to a single value. Each value must be compatible with the data type of the column where it is being inserted. If possible, a value whose data type does not match the column’s declared data type is automatically converted to a compatible data type.
DEFAULT – Use this keyword to insert the default value for a column, as defined when the table was created. If no default value exists for a column, a null is inserted..

SELECT

•           When a query doesn’t contain an ORDER BY clause, the system returns result sets with no predictable ordering of the rows. The same query executed twice might return the result set in a different order

•           The LIMIT and OFFSET options can be used without an ORDER BY clause; however, to return a consistent set of rows, use these options in conjunction

Returns rows from tables, views, and user-defined functions.
[ WITH with_subquery [, …] ]
SELECT
[ TOP number | [ ALL | DISTINCT ]

  • I expression [ AS output_name ] [, …] ]
    [ FROM table_reference [, …] ]
    [ WHERE condition ]
    [ GROUP BY expression [, •••] ]
    [ HAVING condition ]
    [ { UNION I ALL I INTERSECT | EXCEPT | MINUS > query ] ( ORDER BY expression [ ASC I DESC ]
    [ NULLS FIRST | NULLS LAST ]
    [ LIMIT { count I ALL ) ]
    [ OFFSET start ]

NULLS FIRST | NULLS LAST – Option that specifies whether NULL values should be ordered first, before non-null values, or last, after non-null values. By default, NULL values are sorted and ranked last in default ASC ordering, and sorted and ranked first in DESC ordering.
LIMIT number | ALL – Option that controls the number of sorted rows that the query returns. The LIMIT number must be a positive integer; the maximum value is 2147483647. LIMIT 0 returns no rows. You can use this syntax for testing purposes: to check that a query runs (without displaying any rows) or to return a column list from a table. An ORDER BY clause is redundant if you are using LIMIT 0 to return a column list. The default is LIMIT ALL.
OFFSET start – Option that specifies to skip the number of rows before start before beginning to return rows. The OFFSET number must be a positive integer; the maximum value is 2147483647. When used with the LIMIT option, OFFSET rows are skipped before starting to count the LIMIT rows that are returned. If the LIMIT option is not used, the number of rows in the result set is reduced by the number of rows that are skipped. The rows skipped by an OFFSET clause still have to be scanned, so it might be inefficient to use a large OFFSET value.
Titten a note here.

UPDATE

• After updating a large number of rows in a table:

• Analyze the table to update statistics for the query planner.

• When the target table is joined to itself or another table, a best practice is to use a subquery that clearly separates the join conditions from the criteria

that qualify rows for updates.

Syntax
UPDATE table_name SET column = { expression | DEFAULT } [,…]
[ FROM fromlist ]
[ WHERE condition ]

DEFAULT – Updates the column with the default value that was assigned to the column in the CREATE TABLE statement.
FROM tablelist – You can update a table by referencing information in other tables. List these other tables in the FROM clause or use a subquery as part of the WHERE condition. Tables listed in the FROM clause can have aliases. If you need to include the target table of the UPDATE statement in the list, use an alias.

DELETE

Deletes rows from tables.

Syntax

DELETE [ FROM ] table_name [ {USING } table_name, … ] [ WHERE condition ]

FROM -The FROM keyword is optional, except when the USING clause is specified. The statements
delete from event; and delete event; are equivalent operations that remove all of the rows from the EVENT table.
USING table_name – The USING keyword is used to introduce a table list when additional tables are referenced in the WHERE clause condition. For example, the following statement deletes all of the rows from the EVENT table that satisfy the join condition over the EVENT and SALES tables. The SALES table must be explicitly named in the FROM list:
If you repeat the target table name in the USING clause, the DELETE operation runs a self-join. You can use a subquery in the WHERE clause instead of the USING syntax as an alternative way to write the same query.

TRUNCATE

TRUNCATE
Deletes all of the rows from a table without doing a table scan: this operation is a faster alternative to an unqualified DELETE operation. To execute a TRUNCATE command, you must be the owner of the table or a superuser.
TRUNCATE is much more efficient than DELETE and does not require a VACUUM and ANALYZE. However, be aware that TRUNCATE commits the transaction in which it is run.
Syntax
TRUNCATE [ TABLE ] table_name
Examples
Use the TRUNCATE command to delete all of the rows from the CATEGORY table:
truncate category;
Attempt to roll back a TRUNCATE operation:
begin;
truncate date; rollback;
select count(*) from date; count
0
(1 row)

VACUM

Reclaims space and resorts rows in either a specified table or all tables in the current database.
• By default, VACUUM skips the sort phase for any table where more than 95 percent of the
table’s rows are already sorted. Skipping the sort phase can significantly improve VACUUM
performance.
• If the sort threshold is not met (for example, if 90 percent of rows are sorted) and VACUUM
performs a full sort, then it also performs a complete delete operation, recovering space from
100 percent of deleted rows.
Syntax
You can’t run VACUUM within a transaction block (BEGIN … END).
You can run only one VACUUM command on a cluster at any given time. If you attempt to run multiple vacuum operations concurrently, Amazon Redshift returns an error.
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ]
table_name ] [ TO threshold PERCENT ]]
During vacuum operations, some degree of query performance degradation is expected. Normal performance resumes as soon as the vacuum operation is complete.
A vacuum operation might not be able to start if a load or insert operation is already in progress.
Vacuum operations temporarily require exclusive access to tables in order to start. This exclusive access is required briefly, so vacuum operations don’t block concurred loads and inserts for any significant period of time.

FULL-Sorts the Specified table (or all tables in the current database) and reclaims disk space
occupied by rows that were marked for deletion by previous UPDATE and DELETE operations.
VACUUM FULL is the default.
SORT ONLY – Sorts the specified table (or all tables in the current database) without reclaiming space freed by deleted rows. This option is useful when reclaiming disk space is not important but resorting new rows is important.
DELETE ONLY – Reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations, and compacts the table to free up the consumed space. A DELETE ONLY vacuum operation doesn’t sort table data.
Table_name TO threshold PERCENT – You can change the default vacuum threshold only for a single
table. To change the default vacuum threshold for a single table, include the table name and the to threshold PERCENT parameter.

VACUUM

VACUUM
Reclaims space and resorts rows in either a specified table or all tables in the current database.
Examples
Reclaim space and database and resort rows in alls tables based on the default 95 percent vacuum threshold.
vacuum;
Reclaim space and resort rows in the SALES table based on the default 95 percent threshold.
vacuum sales;
Always reclaim space and resort rows in the SALES table.
vacuum sales to 100 percent;
Resort rows in the SALES table only if fewer than 75 percent of rows are already sorted.
Reclaim space in the SALES to 75 percent that at least 75 percent of the remaining rows are not marked for deletion following the vacuum.
Reindex and then vacuum the LISTING table.
vacuum reindex listing;

BEGIN |START TRANSACTION END |COMMIT ABORT |ROLLBACK

BEGIN I START TRANSACTION
Starts a transaction.
In general, all commands in a transaction execute on a snapshot of the database. By default, individual Amazon Redshift operations (queries, DDL statements, loads) are automatically committed to the database. If you want to suspend the commit for an operation until subsequent work is completed, you need to open a transaction with the BEGIN statement, then run the required commands, then close the transaction with a COMMIT or END statement.
Syntax
• ABORT and ROLLBACK commands performs the same function.
• ABORT and ROLLBACK command aborts the current transaction and discards all updates made by that transaction. This command is the exactly opposite of COMMIT command.
BEGIN [ WORK | TRANSACTION ] [ ISOLATION LEVEL option ] [ READ WRITE | READ ONLY ]
START TRANSACTION [ ISOLATION LEVEL option ] [ READ WRITE | READ ONLY ]
Where option is
SERIALIZABLE I READ UNCOMMITTED I REAR COMMITTED I REPEATABLE READ
Note: READ UNCOMMITTED, READ COMMITTED, and REPEATABLE READ have no operational impact and map to SERIALIZABLE in Amazon Redshift.
END COMMIT
BEGIN I START TRANSACTION
END I COMMIT
ABORT I ROLLBACK
Commits the current transaction and makes the database updates from the transaction permanent. Both commands perform exactly the same function.
Syntax
END [ WORK I TRANSACTION ]

PREPARE
EXECUTE I DEALLOCATE

• Prepared statements only last for the duration of the current session. When the session ends, the prepared statement is discarded, so it must be re-created before being used again.
• Prepared statements have the largest performance advantage when a single session is being used to execute a large number of similar statements.
• When an EXECUTE command is issued for the prepared statement, for each new execution of a prepared statement, Amazon Reashift may revise the query execution plan to improve performance based on the specified parameter values.
• DEALLOCATE is used to deallocate a previously prepared SQL statement. If you do not explicitly deallocate a prepared statement, it is deallocated when the current session ends.

PREPARE
Prepare a statement for execution.
PREPARE creates a prepared statement. When the PREPARE statement is executed, the specified statement (SELECT, INSERT, UPDATE, or DELETE) is parsed, rewritten, and planned. When an EXECUTE command is then issued for the prepared statement, Amazon Redshift may optionally revise the query executif plan (to improve performance based on the specified parameter values) before executing the prepared statement.

Syntax
PREPARE plan_namc [ (datatype [, …] ) ] AS statement
plan_name – An arbitrary name given to this particular prepared statement. It must be unique within a single session and is subsequently used to execute or deallocate a previously prepared statement, datatype – The data type of a parameter to the prepared statement. To refer to the parameters in the prepared statement itself, use $1, $2, and so on.
statement – Any SELECT, INSERT, UPDATE, or DELETE statement.
Examples
Create a temporary table, prepare INSERT statement and then execute it:
DROP TABLE IF EXISTS prepl;
CREATE TABLE prepl (cl int, c2 char(20)); PREPARE prep_insert_plan (int, char)
AS insert into prepl values ($1, $2); EXECUTE prep_insert_plan (1, •one’); EXECUTE prep_insert_plan (2, ‘two’); EXECUTE prep_insert_plan (3, ’three’); DEALLOCATE prep_insert_plan;
Prepare a SELECT statement and then execute it:
PREPARE prep_select_plan (int)
AS select * from prepl where cl = $1; EXECUTE prep_select_plan (2);
EXECUTE prep_select_plan (3);

LOCK SET SESSION AUTHORIZATION

The LOCK command obtains a table-level lock in “ACCESS EXCLUSIVE” mode, waiting if necessary for any conflicting locks to be released.

LOCK
Restricts access to a database table. This command is only meaningful when it is run inside a transaction block. An explicit table lock created by one user temporarily prevents another user from selecting data from that table or loading data into it. The lock is released when the transaction that contains the LOCK command completes.
Syntax
LOCK [ TABLE ] table_name [, …]
SET SESSION AUTHORIZATION
Sets the user name for the current session. You can use the SET SESSION AUTHORIZATION command, for example, to test database access by stemporarily running a session or transaction as an unprivileged user.
Syntax
SET [ SESSION I LOCAL ] SESSION AUTHORIZATION { user_name | DEFAULT }
Examples
The following example sets the user name for the current session to dwuser:
SET SESSION AUTHORIZATION ‘dwuser’;
The following example sets the user name for the current transaction to dwuser:
SET LOCAL SESSION AUTHORIZATION ‘dwuser’?
This example sets the user name for the current session to the default user name:
SET SESSION AUTHORIZATION DEFAULT;

DECLARE CURSOR FETCH | CLOSE

You must declare a cursor within a transaction block. Only one cursor at a time can be open per session.
Because of the potential negative performance impact of using cursors with large result sets, its recommend to use alternative approaches whenever possible.

DECLARE CURSOR
Defines a new cursor. Use a cursor to retrieve a few rows at a time from the result set of a larger query. When the first row of a cursor is fetched, the entire result set is materialized on the leader node, in memory or on disk, if needed
FETCH
FETCH retrieves rows based on the current position within the cursor. When a cursor is created, it is positioned before the first row. After a FETCH, the cursor is positioned on the last row retrieved. If FETCH runs off the end of the available rows, such as following a FETCH ALL, the cursor is left positioned after the last row.
CLOSE
Closes all of the free resources that are associated with an open cursor.
Cursor Constraints
When the first row of a cursor is fetched, the entire result set is materialized on the leader node.
If the result set does not fit in memory, it is written to disk as needed. To protect the integrity of the leader node, Amazon Redshift enforces constraints on the size of all cursor result sets, based on the cluster’s node type.
Node type Maximum result set per cluster (MB)
DS1 or DS2 XL single node 64000
DS1 or DS2 XL multiple nodes 1800000
DS1 or DS2 8XL multiple nodes 14400000
DC1 Large single node 16000
DC1 Large multiple nodes 384000
DC1 8XL multiple nodes 3000000
DC2 Large single node 8000
DC2 Large multiple nodes 192000

EXPLAIN

•      You can use EXPLAIN only for the following commands:

,      SELECT, SELECT INTO, CREATE

TABLE AS, INSERT, UPDATE, DELETE

•      The EXPLAIN command will fail if you use it for other SQL commands, such as data definition language (DDL) or database operations.

Displays the execution plan for a query statement without running the query.
Syntax
EXPLAIN [ VERBOSE ] query
Parameters
VERBOSE
Displays the full query plan instead of just a summary. query
Query statement to explain. The query can be a SELECT, INSERT, CREATE TABLE AS, UPDATE, or DELETE statement.
The following example returns the query plan for a query that selects the EVENTID, EVENTNAME, VENUEID, and VENUENAME from the EVENT and VENUE tables:
explain
select eventid, eventname, event.venueid, venuename from event, venue
where event.venueid = venue.venueid;
QUERY PLAN
XN Hash Join DS_DIST_OUTER (cost=2.52..58653620.93 rows=8712 width=43) Hash Cond: (“outer”.venueid = “inner”.venueid)
-> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=23)
-> XN Hash (cost=2.02..2.02 rows=202 width=22)
-> XN Seq Scan on venue (cost=0.00..2.02 rows=202 width=22)
(5 rows)

Displays the execution plan for a query statement without running the query.
EXPLAIN Operators Query Execution Steps Description
SCAN:
Sequential Scan scan Amazon Redshift relation scan or table scan operator or step. Scans whole table sequentially from beginning to end; also evaluates query constraints for every row (Filter) if specified with WHERE clause. Also used to run INSERT, UPDATE, and DELETE statements.
JOINS: Amazon Redshift uses different join operators based on the physical design of the tables being joined, the location of the data required for the join, and specific attributes of the query itself. Subquery Scan — Subquery scan and append are used to run UNION queries.
Nested Loop nloop Least optimal join; mainly used for cross-joins (Cartesian products; without a join condition) and some inequality joins.
Hash Join hjoin Also used for inner joins and left and right outer joins and typically faster than a nested loop join. Hash Join reads the outer table, hashes the joining column, and finds matches in the inner hash table. Step can spill to disk. (Inner input of hjoin is hash step which can be disk-based.)
Merge Join mjoin Also used for inner joins and outer joins (for join tables that are both distributed and sorted on the joining columns). Typically the fastest Amazon Redshift join algorithm, not including other cost considerations.

EXPLAIN

Displays the execution plan for a query statement without running the query.

EXPLAIN OperatorsQuery Execution StepsDescription
 
Network Operations:
Network (Broadcast)beastBroadcast is also an attribute of Join Explain operators and steps.
Network (Distribute)distDistribute rows to compute nodes for parallel processing by data warehouse cluster.
Network (Send to Leader)returnSends results back to the leader for further processing.
DML Operations (operators that modify data):
Insert (using Result)insertInserts data.
Delete (Scan + Filter)deleteDeletes data. Can operate from disk.
Update (Scan + Filter)delete, insertImplemented as delete and Insert.

ANALYZE

To view the results of ANALYZE operations, query the STL_ANALYZE system table.
You can analyze specific tables, including temporary tables. You can qualify the table with its schema name. You can optionally specify a table name to analyze a single table.
You can’t specify more than one table name with a single ANALYZE table name statement. If you don’t specify a table name value, all of the tables in the currently connected database are analyzed, including the persistent tables in the system catalog.
Amazon Redshift skips analyzing a table if the percentage of rows that have changed since the last ANALYZE is lower than the analyze threshold.
If you specify a table name, you can also specify one or more columns in the table (as a column separated list within parentheses). If a column list is specified, only the listed columns are analyzed.

ANALYZE
Updates table statistics for use by the query planner.
Syntax
ANALYZE [ VERBOSE ]
[ [ table_name [ ( column_name [, …] ) ] ]
[ PREDICATE COLUMNS | ALL COLUMNS ]
To reduce processing time and improve overall system performance, Amazon Redshift skips ANALYZE for a table if the percentage of rows that have changed since the last ANALYZE command run is lower than the analyze threshold specified by the analyze_threshold_percent parameter.
By default, analyze_threshold_percent is 10. To change analyze_threshold_percent for the current session, execute the SET command. The following example changes analyze_threshold_percent to 20 percent.
set analyze_threshold_percent to 20;
To analyze tables when only a small number of rows have changed, set analyze_threshold_percent to an arbitrarily small number. For example, if you set analyze_threshold_percent to 0.01, then a table with 100,000,000 rows will not be skipped if at least 10,000 rows have changed.
set analyze_threshold_percent to O.Ol;
If ANALYZE skips a table because it doesn’t meet the analyze threshold, Amazon Redshift returns the following message.
ANALYZE SKIP
To analyze all tables even if no rows have changed, set analyze_threshold_percent to 0.

ANALYZE

Updates table statistics for use by the query planner.
Examples
Analyze all of the tables in the TICKIT database and return progress information.
analyze verbose;
Analyze the LISTING table only.
analyze listing;
Analyze the VENUEID and VENUENAME columns in the VENUE table.
analyze venue(venueid, venuename);
Analyze only predicate columns in the VENUE table.
analyze venue predicate columns;

ANALYZE COMPRESSION

ANALYZE COMPRESSION acquires an exclusive table lock, which prevents concurrent reads and writes against the table. Only run the ANALYZE COMPRESSION command when the table is idle.
ANALYZE COMPRESSION is an advisory tool and doesn’t modify the column encodings of the table.
The analysis is run on rows from each data slice. For example, if you specify COMPROWS 1000000 (1,000,000) and the system contains 4 total slices, no more than 250,000 rows per slice are read and analyzed. If COMPROWS is not specified, the sample size defaults to 100,000 per slice.

Run ANALYZE COMPRESSION to get recommendations for column encoding schemes, based on a sample of the table’s contents. The suggested encoding can be applied by recreating the table, or creating a new table with the same schema.

Recreating an uncompressed table with appropriate encoding schemes can significantly reduce its on-disk footprint, saving disk space and improving query performance for 10 bound workloads.

Syntax

ANALYZE COMPRESSION [ [ table_name ]

[ ( column_name [, …] ) ] ]

[COMPROWS numrows]

Built-in Aggregation and Conditional Functions

AVG – The AVG function returns the average (arithmetic mean) of the input expression values.

COUNT – The COUNT function counts the rows defined by the expression.

MAX – Returns the maximum value in a set of rows MEDIAN – Calculates the median value for the range of values.

MIN – The MIN function returns the minimum value in a set of rows.

STDDEV_SAMP/ STDDEV_POP- Returns sample and population standard deviation.

SUM – Returns the sum of the input column or expression values.

VAR_SAMP, VAR_POP- Returns sample and population variance deviation BIT_AND, BIT_OR – Performs bit-wise AND and OR operations BOOL_AND, BOOL_OR- Performs boolean AND and OR operations.

CASE – A conditional expression, similar to if/then/else statements found in other languages.

DECODE – Equivalent to the operation of a simple CASE or an IF-THEN-ELSE statement

GREATEST, LEAST – Returns the largest or smallest value from a list of any expressions.

NVL, COALESCE – Returns the value of the first expression in the list that is not null.

NULLIF – Compares two arguments and returns null if the arguments are equal. Inverse of NVL

LISTAGG – For each group in a query, the LISTAGG aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates values into a single string.

APPROXIMATE PERCENTILE_DISC – For a given percentile value, this function uses a quantile summary algorithm to approximate the discrete percentile of the expression in the ORDER BY clause.

PERCENTILE_CONT – This function takes a percentile value and a sort specification, and returns an intgmnialgiü^aliie that would fall into the given percentile value with respect to the sort specification.

COMPRESSION ENCODING
  • Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.
  • By default, Amazon Redshift stores data in its raw, uncompressed format. You can apply a compression type, or encoding, to the columns in a table manually when you create the table, or you can use the COPY command to analyze and apply compression automatically.
  • If you choose to apply compression encodings manually, you can run the ANALYZE COMPRESSION command against an already populated table and use the results to choose compression encodings.
  • If you do not specify a compression encoding during the table creation process using CREATE TABLE command, the default encodings are applied. In particular, LZO is the default compression encoding for columns of CHAR and VARCHAR data types.
  • Compression can save significant storage space and can improve I/O performance, which can enable using the most efficient size and type of the cluster, which directly translates into cost savings.
  • There are 8 types of compression encodings in Amazon Redshift – RAW, BYTE DICT, DELTA, LZO, MOSTLY, RUNLENGTH, TEXT, and STANDARD. The most frequently used are the LZO, TEXT,

BTYE and ZSTD encodings. ZSTD is a universal encoding applicable to all data-types.

If no compression is specified in a CREATE TABLE or ALTER TABLE statement, Amazon Redshift automatically assigns compression encoding as follows:
· Columns that are defined as sort keys are assigned RAW compression.
· Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.
· All other columns are assigned LZO compression.

COMPRESSION ENCODING

Encoding

Description

  1. Raw
  2. Delta
  3. LZO
  4. Mostly
  5. Runlength
  6. Text
  7. Zstandard
    Raw:-Default encoding for columns that are designated as sort keys and columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw, uncompressed form.
    Byte:- In byte dictionary encoding, a separate dictionary of unique values is created for each block of column values on disk. The dictionary contains the indexed unique values, and the table itself contains only the one- byte subscripts of the corresponding values. Suitable for columns with unique limited values like COUNTRY for example.
    Delta:-Compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk. Useful for date time columns. Comes is two variations – DELTA and DELTA32K
    LZO:-Provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings, especially free form text or JSON strings. LZO is the default encoding except for columns that are designated as sort keys and columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types.
    Mostly:-Comes in three variations – MOSTLY8, MOSTLY16 and MOSTLY32. Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form. For ex, you can compress a 16-bit column, such as an INT2 column, to 8-bit storage.
    Runlength:-Replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences. A separate dictionary of unique values is created for each block of column values on disk. This encoding is best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values.
    Test:- Comes in two variations – TEXT255 and TEXT32k. Useful for compressing VARCHAR columns in which the same words recur often. A separate dictionary of unique words is created for each block of column values on disk. The dictionary contains the first 245 unique words in the column which are replaced on disk by a one- byte index value and any words that are not represented in the dictionary are stored uncompressed. For the text32k encoding, the principle is the same, except the dictionary indexes each unique word it finds until the combined entries reach a length of 32K, minus some overhead. The index values are stored in two bytes
    Zstandard:- Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse data sets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings and JSON strings. Where some algorithms, such as Delta encoding or Mostly encoding, can potentially use more storage space than no compression, ZSTD is very unlikely to increase disk usage. ZSTD supports all Amazon Redshift data types.

–Default Encoding
create table cartesian_venue_default(
venueid smallint not null distkey sortkey,
venuename varchar(100) ,
venuecity varchar(30) ,
venuestate char(2),
venueseats integer);
select “column”, type, encoding from pg_table_def
where tablename = ‘cartesian_venue_default’; insert into cartesian_venue_default
select venueid, venuename, venuecity, venuestate, venueseats from venue, listing;
analyze compression cartesian_venue_default
–Raw Encoding
create table cartesian_venue_raw(
venueid smallint not null distkey sortkey encode raw,
venuename varchar(100) encode raw,
venuecity varchar(30) encode raw,
venuestate char(2) encode raw,
venueseats integer encode raw);

–Default Encoding
create table cartesian_venue_default(
venueid smallint not null distkey sortkey,
venuename varchar(100) ,
venuecity varchar(30) ,
venuestate char(2),
venueseats integer);
select “column”, type, encoding
from pg_table_def
where tablename = ‘cartesian_venue_default’;
insert into cartesian_venue_default
select venueid, venuename, venuecity, venuestate, venueseats
from venue, listing; ^
analyze compression cartesian_venue_default
–Raw Encoding
create table cartesian_venue_raw(
venueid smallint not null distkey sortkey encode raw,
venuename varchar(100) encode raw,
venuecity varchar(30) encode raw,
venuestate char(2) encode raw,
venueseats integer encode raw);

from pg_table_def
where tablename = ‘cartesian_venue_default’;
insert into cartesian_venue_default
select venueid, venuename, venuecity, venuestate, venueseats from venue, listing;
analyze compression cartesian_venue_default
–Raw Encoding
create table cartesian_venue_raw(
venueid smallint not null distkey sortkey encode raw,
venuename varchar(lOO) encode raw,
venuecity varchar(30) encode raw,
venjestate char(2) encode raw,
venueseats integer encode raw);
select “column”, type, encoding from pg_table_def
where tablename = 1 cartesian_venue_raw’ insert into cartesian_venue_raw
select venueid, venuename, venuecity, venuestate, venueseats from venue, listing;
analyze compression cartesian_venue_raw

DISTRIBUTION STYLE

You cannot change a table’s distribution style after it is created. To recreate tables with a different distribution style, use a deep copy.
Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, and ROWJD (OID).
A table with 3 user-defined columns contains 6 actual columns, and the user-defined columns are internally numbered as 0,1, and 2. The INSERT_XID, DELETE_XID, and ROWJD columns are numbered 3, 4, and 5, respectively.
EXPLAIN command can be used to analyze query plans which provides insight into the data redistribution between nodes, that is a result of distribution type

TYPES OF DISTRIBUTION
• Even distribution – The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column. Appropriate when a table does not participate in joins. Default distribution style.
• Key distribution – The rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice, so that matching values from the common columns are physically stored together, only one pair of tables can be collocated using key distribution
ALL distribution – A copy of the entire table is distributed to every node. Ensures that every row is collocated for every join that the table participates in.

Evaluating Query Plans

You can use query plans to identify candidates for optimizing the distribution style. Below are the table redistribution identifiers with the description of their functionality.
DS_DIST_h|ONE
No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.
DS_DIST_ALL_NONE
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node.
DS_DISTJNNER
The inner table is redistributed.
DS_DIST_OUTER
The outer table is redistributed.
DS.BCASTJNNER
A copy of the entire inner table is broadcast to all the compute nodes.
DS_DIST_ALL_INNER
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
DS_DIST_BOTH
Both tables are redistributed.

Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata.

If query uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip over large numbers of blocks during table scans.

Cost consists of two decimal values separated by two periods The first value, for example 15345152254.79, provides the relative cost of returning the first row for this operation. The second value, provides the relative cost of completing the operation. The costs in the query plan are cumulative as you read up the plan, so the HashAggregate cost in this example (15345152254.79) includes the cost of the operations below it (0.00..87.98).

Shown below is the output of EXPLAIN command in Aginity Workbench for Redshift

Sort Keys

  • When you create a table, you can define one or more of its columns as sort keys. When data is initially loaded into the empty table, the rows are stored on disk in sorted order. Information abort sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. Sorting enables efficient handling of range-restricted predicates.
  • You can specify either a compound or interleaved sort key. To define a sort type, use either the INTERLEAVED or COMPOUND keyword with your CREATE TABLE or CREATE TABLE AS statement. The default is COMPOUND.
  • COMPOUND SORT KEY – Specifies that the data is sorted using a compound key made up of all of the listed columns, in the order they are listed. A compound sort key is most useful when a query scans rows according to the order of the sort columns. The performance benefits of sorting with a compound key decrease when queries depend only on secondary sort columns, without referencing the primary columns. Compound sort keys also help improve compression.
  • INTERLEAVED SORT KEY – An interleaved sort gives equal weight to each column, or subset of columns, in the sort key, so queries do not depend on the order of the columns in the sort key. When a query uses one or more secondary sort columns, interleaved sorting significantly improves query performance. The performance improvements you gain by implementing an interleaved sort key should be weighed against increased load and vacuum times.
  • As you add rows to a sorted table that already contains data, the unsorted region grows, which has a significant effect on performance. The effect is greater when the table uses interleaved sorting

Sorting enables efficient handling of range- restricted predicates.

To view the sort keys for a table, query the SW_TABLE_INFO system view.

If you do not specify any sort keys, the table is not sorted by default. You can define a maximum of 400 COMPOUND SORTKEY columns or 8 INTERLEAVED SORTKEY columns per table.

You should run a Vacuum operation regularly, especially after large data loads, to re-sort and  re-analyze the data.

• After vacuuming to resort the data, it’s a good practice to run an ANALYZ command to update the statistical metadata for the

RESTORING SORT ORDER

  • When a set of rows is added by a COPY operation, the new set of rows is sorted on the sort key as it is added to the unsorted region at the end of the table.
  • The new rows are ordered within their own set, but not within the unsorted region.

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort key columns and uses that information for optimal interleaving of the sort key columns.

As a table grows, the distribution of the values in the sort key columns can change, or skew, especially with date or timestamp columns.

If the skew becomes too large, performance might be affected.

To re-analyze the sort keys and restore performance, run the VACUUM command with the REINDEX key word.

SORT KEYS
  • VACUUM TYPES:
  • VACUUM FULL
  • VACUUM DELETE ONLY
  • VACUUM SORT ONLY
  • VACUUM REINDEX
  • To determine what proportion of a table was remerged, query

SVV VACUUM_SUMMARY after the

vacuum operation completes.

RESTORING SORT ORDER

A vacuum restores the table’s sort order in two stages:

  • Sort the unsorted region into a newly-sorted region.
  • Merge the newly-sorted region with the previously-sorted region

STAR SCHEMA BENCHMARK (SSB)

SSB Dataset is hosted on Amazon S3 buckets owned by Amazon Redshift, and read access is available to all authenticated AWS users. This is an ideal dataset for any performance testing of table design related queries. Loading this dataset requires an available storage space of around 100 GB.

CREATE TABLE

•    The maximum number of tables is 9,900 for large and xlarge cluster node types and 20,000 for 8xlarge cluster node types. The limit includes temporary tables, and excludes views.

•    The maximum number of characters for a table name is 127.

•    The maximum number of columns you can define in a single table is 1,600.

•    The maximum number of SORTKEY columns you can define in a single table is 400.

Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABLE

command.

Syntax

CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE [ IF NOT EXISTS ] table_name

( { column_name data_type [column_attributes] [ column_constraints ] I table__constraints

I LIKE parent_table [ { INCLUDING I EXCLUDING } DEFAULTS ]  }

[, ] )

[ BACKUP { YES I NO } ]

[table attribute]

where column_attributes are:

[ DEFAULT defanlt_expr ]

[ IDENTITY ( s    step ) ]

[ ENCODE encoding ]

[ DISTKEY ]

[ SORTKEY ]

and column_constroints are:

[ { NOT NULL I NULL } ]

[ { UNIQUE I PRIMARY KEY } ]

[ REFERENCES reftable [ ( refcolumn ) ] ]

and     table_constraints are:    

[    UNIQUE ( column_name [, … ] )     ]

i     PRIMARY KEY ( column_name [,   .. . . ] ) ]

c    FOREIGN KEY (column_name [,    …  ] ) REFERENCES reftable [ ( refcolumn ) ]

and table_attributes are:

[ DISTSTYLE { EVEN | KEY | ALL } ]

[ DISTKEY ( column_name ) ]

[ [COMPOUND I INTERLEAVED ] SORTKEY ( column_name [,  …] ) ]

CREATE TABLE

TEMPORARY | TEMP – Keyword that creates a temporary table that is visible only within the current session. The table is automatically dropped at the end of the session in which it is created. The temporary table can have the same name as a permanent table.

The temporary table is created in a separate, session specific schema. (You can’t specify a name for this schema.) This temporary schema becomes the first schema in the search path, so the temporary table will take precedence over the permanent table unless you qualify the table name with the schema name to access the permanent table.

By default, users have permission to create temporary tables by their automatic membership in the PUBLIC group.

To deny this privilege to a user, revoke the TEMP privilege from the PUBLIC group, and then explicitly grant the TEMP privilege only to specific users or groups of users.

LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ] – A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. Tables created with the LIKE option don’t inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can’t explicitly set them in the CREATE TABLE … LIKE statement.

BACKUP {YES I NO}-A clause that specifies whether the table should be included in automated and manual cluster snapshots. For tables, such as staging tables, that won’t contain critical data, specify BACKUP NO to save processing time when creating snapshots and restoring from snapshots and to reduce storage space on Amazon Simple Storage Service. The BACKUP NO setting has no effect on automatic replication of data to other nodes within the cluster, so tables with BACKUP NO specified are restored in a node failure. The default is BACKUP YES.

CREATE TABLE

PRIMARY KEY, FOREIGN KEY, and UNIQUE CONSTRAINTS are informational only. They are not enforced by the system, but they are used by the planner.

NOT NULL constraints are enforced by Amazon Redshift.

Amazon Redshift assumes that all keys in tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.

Do not define key constraints for your tables if you doubt their validity.

UNIQUE – Keyword that specifies that the column can contain only unique values. The behavior of the unique table constraint is the same as that for column constraints, with the additional capability to span multiple columns. To define a unique table constraint, use the UNIQUE ( column_name [,… ] ) syntax.

PRIMARY KEY – Keyword that specifies that the column is the primary key for the table. Only one column can be defined as the primary key by using a column definition. To define a table constraint with a multiple column primary key, use the PRIMARY KEY ( column_name [,… ] ) syntax.

FOREIGN KEY ( column_name [,… ] ) REFERENCES reftable [ ( refcolumn ) ] – Constraint that specifies a foreign key constraint, which requires that a group of one or more columns of the new table must only contain values that match values in the referenced column(s) of some row of the referenced table. If refcolumn is omitted, the primary key of reftable is used. The referenced columns must be the columns of a unique or primary key constraint in the referenced table.

CREATE TABLE

  • The maximum number of tables is 9,900 for large and large cluster node types and 20,000 for 8xlarge cluster node types. The limit includes temporary tables, and excludes views.
  • The maximum number of characters for a table name is 127.
  • The maximum number of columns you can define in a single table is 1,600.
  • The maximum number of SORTKEY columns you can define in a single table is 400.
  • Creates a new table in the current database. The owner of this table is the issuer of the CREATE TABLE command.

Syntax
CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE
[ IF NOTkEXISTS ] table_name
( { colum^_name data_type [column_attributes] [ column_constraints ]
| table__constraints
| LIKE parent_table [ { INCLUDING I EXCLUDING } DEFAULTS ] }
[, ] )
[ BACKUP { YES | NO > ]
[table attribute]
where column_attributes are:
[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ]
[ ENCODE encoding ]
[ DISTKEY ]
[ SORTKEY ]
and column_constraints are:
[ { NOT NULL I NULL } ]
[ { UNIQUE I PRIMARY KEY } ]
[ REFERENCES reftable [ ( refcolumn ) ] ]
and table_constraints are:
[ UNIQUE ( column_name [, … ] ) ]
t PRIMARY KEY ( column_name [, .. . ] ) ]
[ FOREIGN KEY (column_name [, … ] ) REFERENCES reftable [ ( refcolumn ) ]
and table_attributes are:
[ DISTSTYLE { EVEN | KEY | ALL } ]
[ DISTKEY ( column_name ) ]
[ [COMPOUND I INTERLEAVED ] SORTKEY ( column_name [, …] ) ]

Amazon Redshift System and Catalog Tables

STL tables are generated from logs that have been persisted to disk to provide a history of the system. These files reside on every node in the data warehouse cluster. The STL tables take the information from the logs and format them into usable tables for system administrators.

To manage disk space, the STL log tables only retain approximately two to five days of log history, depending on log usage and available disk space. If you want to retain the log data, you will need to periodically copy it to other tables or unload it to Amazon S3.

STV tables are virtual tables that contain snapshots of the current system data. They are based on transient in-memory data and are not persisted to disk-based logs or regular tables.

The system catalogs store schema metadata, such as information about tables and columns. System catalog tables have a PG prefix. The standard PostgreSQL catalog tables are accessible to Amazon Redshift users.

There are two classes of visibility for data in system tables and views: visible to users and visible to superusers.

Only users with superuser privileges can see the data in those tables that are in the superuser-visible category.

STL – System Tables – Network and Authentication

STL_BCAST – Logs information about network activity during execution of query steps that broadcast data.

STL_BCAST-To identify broadcast steps in a query, look for beast labels in the SVL_QUERY_SUMMARY view or run the EXPLAIN command and then look for step attributes that include beast.

STL_COMMIT_STATS- Provides metrics related to commit performance, including the timing of the various stages of commit and the number of blocks committed. Query STL_COMMIT_STATS to determine what portion of a transaction was spent on commit and how much queuing is occurring.

STL_CONNECTION_LOG – Logs authentication attempts and connections and disconnections.

STL_DIST – Logs information about network activity during execution of query steps that distribute data. To identify distribution steps in a query, look for dist labels in the QUERY_SUMMARY view or run the EXPLAIN command and then look for step attributes that include dist.

STL_SESSION – Returns information about user session history.

STL_RESTARTED_SESSIONS-To maintain continuous availability following certain internal events, Amazon Redshift might restart an active session with a new process ID (PID). When Amazon Redshift restarts a session, STL_RESTARTED_SESSIONS records the new PID and the old PID.

STL_USERLOG – Records details for the following changes to a database user: Create user, Drop user, Alter user (rename), Alter user (alter properties)

STL – System Tables – Data Transfers

•             STL_FILE_SCAN – Returns the files that Amazon Redshift read while loading data via the COPY command. Querying this table can help troubleshoot Lr data load errors. STL_FILE_SCAN can be particularly helpful with pinpointing issues in parallel data loads because parallel data loads typically load

many files with a single COPY command.

•             STL_LOAD_COMMITS – Returns information to track or troubleshoot a data load. This table records the progress of each data file as it is loaded into a database table.

•             STL_LOAD_ERRORS – Contains and displays the records of all Amazon Redshift load errors.

•             STL_LOADERROR_DETAIL – Displays a log of data parse errors that occurred while using a COPY command to load tables. To conserve disk space, a maximum of 20 errors per node slice are logged for each load operation.

•             STL_ERROR – Records internal processing errors generated by the Amazon Redshift database engine. STL_ERROR does not record SQL errors or messages. The information in STL_ERROR is useful for troubleshooting certain errors. An AWS support engineer might ask you to provide this information as part of the troubleshooting process.

•             STL_S3CLIENT- Records transfer time and other performance metrics. Use the STL_S3CLIENT table to find the time spent transferring data from Amazon S3 as part of a COPY command.

•             STL_S3CLIENT_ERROR – Records errors encountered by a slice while loading a file from Amazon S3. Use the STL_S3CLIENT_ERROR to find details for errors encountered while transferring data from Amazon S3 as part of a COPY command.

•             STL_REPLACEMENTS – Displays a log that records when invalid UTF-8 characters were replaced by the COPY command with the ACCEPTINVCHARS option. A log entry is added to STL_REPLACEMENTS for each of the first 100 rows on each node slice that required at least one replacement.

•             STL_SSHCLIENT_ERROR – Records all errors seen by the SSH client.

•             STL_UNLOAD_LOG – Records the details for an unload operation. STL_UNLOAD_LOG records one row for each file created by an UNLOAD statement. For example, if an UNLOAD creates 12 files, STL_UNLOAD_LOG will contain 12 corresponding rows.

•             STL_VACUUM – Displays row and block statistics for tables that have been vacuumed. The table shows information specific to when each vacuum operation started and finished, and demonstrates the benefits of running the operation

people have written a note here.

STL – System Tables – Transactions and Workload

•             STLJDDLTEXT – Captures CREATE / DROP / AL&ER DDL Statements on SCHEMA / TABLE / VIEW Objects.

•             STL_QUERYTEXT- Captures the query text for SQL commands. Query the STL_QUERYTEXT table to capture the SQL that was logged for the following statements: SELECT, SELECT INTO, INSERT, UPDATE, DELETE, COPY, VACUUM, ANALYZE, CREATE TABLE AS (CTAS)

•             STL_UTILITYTEXT- Captures the text of non-SELECT SQL commands run on the database. Query the STL_UTILITYTEXT table to capture the following subset of SQL statements that were run on the system: ABORT, BEGIN, COMMIT, END, ROLLBACK, CANCEL, COMMENT, CREATE, ALTER, DROP DATABASE, CREATE, ALTER, DROP USER, EXPLAIN, GRANT, REVOKE, LOCK, RESET, SET, SHOW, TRUNCATE

•             STL_WLM_ERROR – Records all Work Load Management related errors as they occur.

•             STL_WLM_RULE_ACTION – Records details about actions resulting from WLM query monitoring rules associated with user-defined queues.

•             STL_WLM_QUERY – Contains a record of each attempted execution of a query in a service class handled by WLM.

•             STL_TR_CONFLICT – Displays information to identify and resolve transaction conflicts with database tables. A transaction conflict occurs when two or more users are querying and modifying data rows from tables such that their transactions cannot be serialized. The transaction that executes a statement that would break serializability is aborted and rolled back. Every time a transaction conflict occurs, Amazon Redshift writes a data row to the STL_TR_CONFLICT system table containing details about the aborted transaction.

•             STL_UNDONE – Displays information about transactions that have been undone.

STL – System Tables – Query execution

STL_EXPLAIN – Displays the EXPLAIN plan for a query that has been submitted for execution.

STL_PLAN_INFO – Use the STL_PLAN_INFO table to look at the EXPLAIN output for a query in terms of a set of rows. This is an alternative way to look at query plans. j\,

STL_QUERY – Returns execution information about a database query. To manage disk space, the STL log tables only retain approximately two to five days of log history, depending on log usage and available disk space.

STL_QUERY_METRICS – Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for queries that have completed running in user-defined query queues (service classes).

STL_ALERT_EVENT_LOG – Records an alert when the query optimizer identifies conditions that might indicate performance issues. Use it to identify opportunities to improve query performance.

STL_STREAM_SEGS – Lists the relationship between streams and concurrent segments.

STL_PARSE – Analyzes query steps that parse strings into binary values for loading.

STL_SCAN – Analyzes table scan steps for queries. The step number for rows in this table is always 0 because a scan is the first step in a segment.

STL_HASH – Analyzes hash execution steps for queries.

STL_HASHJOIN – Analyzes hash join execution steps for queries.

STL_DELETE – Analyzes delete execution steps for queries.

STLJNSERT- Analyzes insert execution steps for queries.

STL_LIMIT- Analyzes the execution steps that occur when a LIMIT clause is used in a SELECT query.

STL – System Tables – Query execution

STL_MERGE – Analyzes merge execution steps for queries. These steps occur when the results of parallel operations (such as sorts and joins) are merged for subsequent processing.

STL_MERGEJOIN – Analyzes merge join execution steps for queries.

STL_NESTLOOP – Analyzes nested-loop join execution steps for queries.

STL_PROJECT – Contains rows for query steps that are used to evaluate expressions.

STL_SORT- Displays sort execution steps for queries, such as steps that use ORDER BY processing.

STL_UNIQUE – Analyzes execution steps that occur when a DISTINCT function is used in the SELECT list or when duplicates are removed in a UNION or INTERSECT query.

STL_WINDOW – Analyzes query steps that execute window functions.

STL_RETURN – Contains details for return steps in queries. A return step returns the results of queries executed on the compute nodes to the leader node. The leader node then merges the data and returns the results to the requesting client. For queries executed on the leader node, a return step returns results to the client.

STL_SAVE – Contains details for save steps in queries. A save step saves the input stream to a transient table. A transient table is a temporary table that stores intermediate results during query execution.

STL_AGGR – Analyzes aggregate execution steps for queries. These steps occur during execution of aggregate functions and GROUP BY clauses.

STL_ANALYZE – Records details of ANALYZE operations

STV – System Tables – Data Transfers and Structures

STV_STARTUP_RECOVERY_STATE – Records the state of tables that are temporarily locked during cluster restart operations. Amazon Redshift places a temporary lock on tables while they are being processed to resolve stale transactions following a cluster restart.

STV_SLICES – Use this table to view the current mapping of a slice to a node. The information in STV_SLICES is used mainly for investigation purposes.

STV_BLOCKLIST- Contains the number of 1 MB disk blocks that are used by each slice, table, or column in a database. Use aggregate queries with STV_BLOCKLIST, to determine the number of 1 MB disk blocks allocated per database, table, slice, or column.

STV_LOCKS – Use this table to view any current updates on tables in the database. Amazon Redshift locks tables to prevent two users from updating the same table at the same time.

STV_LOAD_STATE – Use this table to find information about current state of ongoing COPY statements. The COPY command updates this table after every million records are loaded.

STV_ACTIVE_CURSORS – Displays details for currently open cursors.

STV_CURSOR_CONFIGURATION – Displays cursor configuration constraints.

STV_TABLE_PERM – The STV_TBL_PERM table contains information about the permanent tables in Amazon Redshift, including temporary tables created by a user for the current session. STV_TBL_PERM contains information for all tables in all databases.

STV_TBL_TRANS – Use the STV_TBL_TRANS table to find out information about the transient database tables that are currently in memory. Transient tables are typically temporary row sets that are used as intermediate results while a query runs.

STV – System Tables – Query execution

•             STV_QUERY_METRICS – Contains metrics information, such as the number of rows processed, CPU usage, input/output, and disk use, for active queries running in user-defined query queues (service classes).

•             STV_EXEC_STATE – Use this table to find out information about queries and query steps that are actively running on compute nodes. This information is usually used only to troubleshoot engineering issues.

•             STV_IINFLIGHT- Use this table to determine what queries are currently running on the cluster. STVJNFLIGHT does not show leader-node only queries.

•             STV_RECENTS – Use this table to find out information about the currently active and recently run queries against a database.

Amazon Redshift System Views

System views contain a subset of data found in several of the STL and STV system tables.

System views that contain any reference to a transient STV table are called SVV views.

Views containing only references to STL tables are called SVL views.

These views provide quicker and easier access to commonly queried data found in those tables.

SW and SVL – System Views – System Information

SVV_TABLES – Use SVV_TABLES to view tables in local catalogs.

SVV_TABLE_INFO – Shows summary information for tables in the database. The view filters system tables and shows only user-defined tables. You can use the SVV_TABLE_INFO view to diagnose and address table design issues that can influence query performance, including issues with compression encoding, distribution keys, sort style, data distribution skew, table size, and statistics. The SVV_TABLEJIMFO view doesn’t return any information for empty tables. The SVV_TABLE_INFO view summarizes information from the STV_BLOCKLIST, STV_PARTITIONS, STV_TBL_PERM, and STV_SLICES system tables and from the PG_DATABASE, PG_ATTRIBUTE, PG_CLASS, PG_N AM ESPACE, and PG_TYPE catalog tables.

SVV_COLUMNS – Use SW_COLUMNS to view catalog information about the columns of local and external tables and views.

SVV_DISKUSAGE – Amazon Redshift creates the SVV_DISKUSAGE system view by joining the STV_TBL_PERM and STV_BLOCKLIST tables. The SVV_DISKUSAGE view contains information about data allocation for the tables in a database. Use aggregate queries with SVV_DISKUSAGE, to determine the number of disk blocks allocated per database, table, slice, or column. Each data block uses 1MB.

SVV_INTERLEAVED_COLUMNS – Use the SVV_INTERLEAVED_COLUMNS view to help determine whether a table that uses interleaved sort keys should be re-indexed using VACUUM REINDEX.

SW and SVL – System Views – Query execution

SVL_QLOG – The SVL_QLOG view contains a log of all queries run against the database. Amazon Redshift creates the SVL_QLOG view as a readable subset of information from the STL_QUERY table. Use this table to find the query ID for a recently run query or to see how long it took a query to complete.

SVV_QUERY_INFLIGHT- Use the SW_QUERY_IINFLIGHT view to determine what queries are currently running on the database. This view joins STV_IINFLIGHT to STL_QUERYTEXT. SVV_QUERY_INFLIGHT does not show leader-node only queries.

SVL_QUERY_QUEUE_INFO – Summarizes details for queries that spent time in a workload management (WLM) query queue or a commit queue. The SVL_QUERY_QUEUE_INFO view filters queries executed by the system and shows only queries executed by a user. The SVL_QUERY_QUEUE_INFO view summarizes information from the STL_QUERY, STL_WLM_QUERY (p. 825), and STL_COMMIT_STATS system tables.

SVL_QUERY_METRICS – The SVL_QUERY_METRICS view shows the metrics for completed queries. This view is derived from the STL_QUERY_METRICS system table. Use the values in this view as an aid to determine threshold values for defining query monitoring rules.

S V L_QU E R Y_M ETRI CS_S U M M A R Y – The SVL_QUERY_METRICS_SUMMARY view shows the maximum values of metrics for completed queries. This view is derived from the STL_QUERY_METRICS system table. Use the values in this view as an aid to determine threshold values for defining query monitoring rules.

SVL_QUERY_REPORT – Amazon Redshift creates the SVL_QUERY_REPORT view from a UNION of a number of Amazon Redshift STL system tables to provide information about executed query steps. This view breaks down the information about executed queries by slice and by step, which can help with troubleshooting node and slice issues in the Amazon Redshift cluster..

SVV_QUERY_STATE – Use SVV_QUERY_STATE to view information about the execution of currently running queries. The SVV_QUERY_STATE view contains a data subset of the STV_EXEC_STATE table.

SVL_QUERY_SUMMARY – Use the SVL_QUERY_SUMMARY view to find general information about the execution of a query. The SVL_QUERY_SUMMARY view contains a subset of data from the SVL_QUERY_REPORT view. Note that the information in SVL_QUERY_SUMMARY is aggregated from all nodes.

SVL_COMPILE – Records compile time and location for each query segment of queries.

SW and SVL – System Views -Vacuum

•             SVV_VACUUM_PROGRESS – This view returns an estimate of how much time it will take to complete a vacuum operation that is currently progress.

•             SVV_VACUUM_SUMMARY – The SVV_VACUUM_SUMMARY view joins the STL_VACUUM, STL_QUERY, and STV_TBL_PERM tables to summarize information about vacuum operations logged by the system. The view returns one row per table per vacuum transaction. The view records the elapsed time of the operation, the number of sort partitions created, the number of merge increments required, and deltas in row and block counts before and after the operation was performed.

•             SVL_VACUUM_PERCENTAGE – The SVL_VACUUM_PERCENTAGE view reports the percentage of data blocks allocated to a table after performing a vacuum. This percentage number shows how much disk space was reclaimed.

Query Planning and Execution Workflow

The query planning and execution workflow follows these steps:

  1. The leader node receives the query and parses the SQL.
  2. The parser produces an initial query tree that is a < logical representation of the original query. Amazon

Redshift then inputs this query tree into the query optimizer.

  1. The optimizer evaluates and if necessary rewrites the query to maximize its efficiency. This process sometimes results in creating multiple related queries to replace a single one.
  2. The optimizer generates a query plan (or several, if the previous step resulted in multiple queries) for the execution with the best performance. The query plan specifies execution options such as join types, join order, aggregation options, and data distribution requirements.
  3.  

6. The compute node slices execute the query segments in parallel. As part of this process, Amazon Redshift takes advantage of optimized network communication, memory, and disk management to pass intermediate results from one query plan step to the next, which also helps to speed query execution.

Steps 5 and 6 happen once for each stream. The engine creates the executable segmij^its for one stream and sends them to the compute nodes. When the segments ofthat stream are complete, the engine generates the segments for the next stream. In this way, the engine can analyze what happened in the prior stream (for example, whether operations were disk-based) to influence the generation of segments in the next stream.

When the compute nodes are done, they return the query results to the leader node for final processing. The leader node merges the data into a single result set and addresses any needed sorting or aggregation. The leader node then returns the results to the client.

Note: The compute nodes might return some data to the leader node during query execution if necessary. For example, if you have a subquery with a LIMIT clause, the limit is applied on the leader node before data is re4istribated acro^the. cluster for further processing.

5. The execution engine translates the query plan into steps, segments and streams:

Step Each ^ep is an individual operation needed during query execution. Steps can be combined to allow compute nodes to perform a query, join, or other database operation.

Segment: A combination of several steps that can be done by a single process, also the smallest compilation unit executable by a compute node slice. A slice is the unit of parallel processing in Amazon Redshift. The segments in a stream run in parallel.

Stream: A collection of segments to be parceled out over the available compute node slices. The execution engine generates compiled C++ code based on steps, segments, and streams. Compiled code executes faster than interpreted code and uses less compute capacity. This compiled code is then broadcast to the compute nodes.

Factors Affecting Query Performance

Number of nodes, processors, or slices – A compute node is partitioned into slices. More nodes means more processors and more slices, which enables your queries to process faster by running portions of the query concurrently across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system.        [\,

Node types – An Amazon Redshift cluster can use either dense storage or dense compute nodes. The dense storage node types are recommended for substantial data storage needs, while dense compute node types are optimized for performance-intensive workloads. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster.

Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you execute a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed.

Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and the query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed.

Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate to restrict the query dataset.

Concurrent operations – Running multiple operations at once can affect query performance. Each operation takes one or more slots in an available query queue and uses the memory associated with those slots. If other operations are running, enough query queue slots might not be available. In this case, the query will have to wait for slots to open before it can begin processing.

Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs.

Query Performance Factors

Code compilation – Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used (LRU) cache and shared across sessions in a cluster. Thus, subsequent executions of the same query, even in different sessions and often even with different query parameters, will run faster because they can skip the initial generation and compilation steps.

The LRU cache persists through cluster reboots, but is wiped by maintenance upgrades. The compiled code executes faster because it eliminates the overhead of using an interpreter. You always have some overhead cost the first time code is generated and compiled. As a result, the performance of a query the first time you run it can be misleading. The overhead cost might be especially noticeable when you run one-off (ad hoc) queries. You should always run a query a second time to determine its typical performance.

Similarly, be careful about comparing the performance of the same query sent from different clients. The execution engine generates different code for the JDBC connection protocols and ODBC and psql (libpq) connection protocols. If two clients use different protocols, each client will incur the first-time cost of generating compiled code, even for the same query. Other clients that use the same protocol, however, will benefit from sharing the cached code. A client that uses ODBC and a client running psql with libpq can share the same compiled code.

Query Diagnosis Process

Run the EXPLAIN command to get a query plan. To analyze the data provided by the query plan, follow these steps:

  1. Identify the steps with the highest cost. Concentrate on optimizing those when proceeding through the remaining steps.
  2. Look at the join types:
  3. Nested Loop: Such joins usually occur because a join condition was omitted.
  4. Hash and Hash Join: Hash joins are used when joining tables where the join columns are not distribution keys and also not sort keys
  5. Merge Join: No change is needed.
  6. Notice which table is used for the inner join, and which for the outer join. The query engine generally chooses the smaller table for the inner join, and the larger table for the outer join. If such a choice doesn’t occur, your statistics are likely out of date.
  7. See if there are any high-cost sort operations.
  8. Look for the following broadcast operators where there are high-cost operations:
  9. DS_BCAST_INNER: Indicates the table is broadcast to all the compute nodes, which is fine for a small table but not ideal for a larger table.
  10. DS_DIST_ALL_INNER: Indicates that all of the workload is on a single slice.
  11. DS_DIST_BOTH: Indicates heavy redistribution.
Common Issues Affecting Query Performance
IssueHow to confirm the issueHow to fix the issue
Table Statistics Missing or Out of DateA warning message in EXPLAIN command results. A missing statistics alert event in STLALERTEVENTLOG.Run ANALYZE
Nested LoopA nested loop alert event in STLALERTEVENTLOGReview your query for cross-joins and remove them if possible.
Hash JoinHash and hash join operations in the query plan. An HJOIN step in the segment with the highest maxtime value in S V L_QU E R Y_S U M M A R Y.Rewrite the query to use a merge join if possible. You can do this by specifying join columns that are both distribution keys and sort keys.
Ghost Rows or Uncommitted RowsAn alert event in STLALERTEVENTLOG that indicates excessive ghost rows.If there are no active load operations, run VACUUM on the query tables to remove deleted rows.
Unsorted or Missorted RowsA very selective filter alert event in STL_ALERT_EVENT_LOGRun VACUUM on the query tables to re-sort the rows. Review SORT KEYS on query tables.
Suboptimal Data DistributionA serial execution, large broadcast, or large distribution alert event appears in STLALERTEVENTLOG. Slices are not processing approximately the same number of rows for a given step. Slices are not taking approximately the same amount of time for a given step. Verify if any of the tables in your query have data skew.Take another look at the distribution styles for the tables in the query and see if any improvements can be made.
Insufficient Memory Allocated to the QueryYou might see a step in SVL QUERY SUMMARY that has an is diskbased value of true.Allocate more memory to the query by temporarily increasing the number of query slots it uses. Workload Management (WLM) reserves slots in a query queue equivalent to the concurrency level set for the queue.
Common Issues Affecting Query Performance
IssueHow to confirm the issueHow to fix the issue
Suboptimal WHERE clauseYou might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARYAdd a WHERE clause to the query based on the primary sort column of the largest table. This approach will help minimize scanning time.
Insufficiently Restrictive Predicate A SCAN step in the segment with the highest maxtime value     Try adding a predicate to the query or making the existing   in SVL_QUERY_SUMMARY that has a very high rows value predicate more restrictive to narrow the output.   compared to the rows value in the final RETURN step in the query.
Very Large Result SetQuery the S V L_QU E R Y_S U M M A RY view.Consider rewriting the query to use UNLOAD to write the results to Amazon S3. This approach will improve the performance of the RETURN step by taking advantage of parallel processing.
Large SELECT list      You might see a bytes value that is high relative to the rows value for any step (in comparison to other steps) in SVL_QUERY_SUMMARY. This high bytes value can be an indicator that you are selecting a lot of columns.Review the columns you are selecting and see if any can be removed.
PARAMETER GROUPS

In Amazon Redshift, you associate a parameter group with each cluster that you create. The parameter group is a group of parameters that apply to all of the databases that you create in the cluster. These parameters configure database settings such as query timeout and date style.

Each parameter group has several parameters to configure settings for the database. The list of available parameters depends on the parameter group family to which the parameter group belongs. The parameter group family is the version of the Amazon Redshift engine to which the parameters in the parameter group apply.

Amazon Redshift provides one default parameter group for each parameter group family. The default parameter group has preset values for each of its parameters, and it cannot be modified. The format of the default parameter group name is default.parameter_group_family, where parameter_group_family is the version of the engine to which the parameter group belongs. For example, default.redshift-1.0.

If you want to use different parameter values than the default parameter group, you must create a custom parameter group and then associate your cluster with it.

For parameter groups that you create, you can modify a parameter value at any time, or you can reset all parameter values to their defaults. You can also associate a different parameter group with a cluster. If you modify parameter values in a parameter group that is already associated with a cluster or you associate a different parameter group with the cluster, you might need to restart the cluster for the updated parameter values to take effect.

PARAMETER GROUPS

You can improve system performance and your users’ experience by modifying your WLM configuration to create separate queues for the long-running queries and the short-running queries.

When users run queries in Amazon Redshift, the queries are routed to query queues.

At run time, you can route queries to these queues according to user groups or query groups.

In Amazon Redshift, you use workload management (WLM) to define the number of query queues that are available, and how queries are routed to those queues for processing.

WLM is part of parameter group configuration. A cluster uses the WLM configuration that is specified in its associated parameter group.

When you create a parameter group, the default WLM configuration contains one queue that can run up to five queries concurrently. You can add additional queues and configure WLM properties in each of them if you want more control over query processing.

Each queue that you add has the same default WLM configuration until you configure its properties.

When you add additional queues, the last queue in the configuration is the default queue. Unless a query is routed to another queue based on criteria in the WLM configuration, it is processed by the default queue.

You cannot specify user groups or query groups for the default queue.

Workload Management Configuration Queues

By default, Amazon Redshift configures the following query queues:

One superuser queue:

  • The superuser queue is reserved for superusers only and it can’t be configured. You should only use this queue when you need to run queries that affect the system or for troubleshooting purposes.
  • The queue doesn’t appear in the console, but it does appear in the system tables in the database as the fifth queue. To run a query in the superuser queue, a user must be logged in as a superuser and must run the query using the predefined superuser query group.

One default user queue:

  • The default queue is initially configured to run five queries concurrently. You can change the concurrency, timeout, and memory allocation properties for the default queue, but you cannot specify user groups or query groups. The default queue must be the last queue in the WLM configuration. Any queries that are not routed to other queues run in the default queue.
  • The WLM configuration is an editable parameter (wlm_Json_configuration) in a parameter group, which can be associated with one or more clusters.

Additional queues:

  • You can set up rules to route queries to particular queues based on the user running the query or labels that you specify. You can also configure the amount of memory allocated to each queue so that large queries run in queues with more memory than other queues.
  • Each query queue contains a number of query slots. Each queue is allocated a portion of the cluster’s available memory. A queue’s memory is divided among the queue’s query slots. You can also use the wlm query_slot_count parameter, which is separate from the WLM properties, to temporarily enable queries to use more memory by allocating multiple slots.
  • You can add additional query queues to the default WLM configuration, up to a total of eight user queues. You can configure the following for each query queue:
  • Concurrency level
  • User groups
  • Query groups
  • WLM memory percent to use
  • WLM timeout
  • WLM query queue hopping
  • Query monitoring rules
Workload Management Configuration Dynamic Properties

To configure WLM, you modify the wlmJson_configuration parameter. The value is formatted in JavaScript Object Notation (JSON).

Enable short query acceleration • Short query acceleration (SQA) prioritizes selected short running queries ahead of longer-running queries. SQA executes short-running queries in a dedicated space so that SQA queries aren’t forced to wait in queues behind longer queries. With SQA, short-running queries begin executing more quickly and users see results sooner. When you enable SQA, you can also specify the maximum run time for short queries. To enable SQA, specify true. The default is false. When you enable SQA, you can specify a maximum run time for short queries between 1 and 20 seconds. The default value of S seconds works well for most use cases.
JSON property: short_query_queue
Maximum run time for short queries • When you enable SQA, you can specify a maximum run time for short queries between 1 and 20 seconds, in milliseconds. The default value is 5000.
JSON property: max_execution_time
Concurrency – The number of queries that can run concurrently in a queue. When a queue reaches the concurrency level, any subsequent queries wait in the queue until resources are available to process them. The range is between 1 and 50.
JSON property: query_concurrency
Timeout (ms) – The maximum time, in milliseconds, queries can run before being canceled. If a read-only query, such as a SELECT statement, is canceled due to a WLM timeout, WLM attempts to route the query to the next matching queue based on the WLM Queue Assignment Rules. If the query doesn’t match any other queue definition, the query is canceled; it is not assigned to the default queue.
JS^N
property: max_execution_time
• Memory (%) – The percentage of memory to allocate to the queue. If you specify a memory percentage for at least one of the queues, you must specify a percentage for all of the other queues up to a total of 100 percent. If your memory allocation is below 100 percent across all of the queues, the unallocated memory is managed by the service and can be temporarily given to a queue that requests additional memory for processing.

Workload Management Configuration Static Properties

The memory that is allocated to each queue is divided among the query slots in that queue. The amount of memory available to a query is the memory allocated to the query slot in which the query is running, regardless of the number of queries that are actually running concurrently. If a specific query needs more memory than is allocated to a single query slot, you can increase the available memory by increasing the wlm_query_slot_count parameter

JSON property: memory_percent_to_use

Query Groups – A comma-separated list of query groups. When members of the query group run queries in the database, their queries are routed to the queue that is associated with their query group.

JSON property: query_group

Query Group Wildcard – A Boolean value that indicates whether to enable wildcards for query groups. If this is 0, wildcards are disabled; if this is 1, wildcards are enabled. When wildcards are enabled, you can use “*” or “?” to specify multiple query groups when running queries.

JSON property: query_group_wild_card

User Groups – A comma-separated list of user group names. When members of the user group run queries in the database, their queries are routed to the queue that is associated with their user group.

JSON property: user_group

User Group Wildcard – A Boolean value that indicates whether to enable wildcards for user groups. If this is 0, wildcards are disabled; if this is 1, wildcards are enabled. When wildcards are enabled, you can use “*” or “?” to specify multiple user groups when running queries.

JSON property: user_group_wild_card

Workload Management Configuration Query Monitoring Rules

Query Monitoring Rules – You can use WLM query monitoring rules to continuously monitor your WLM queues for queries based on criteria, or predicates, that you specify. For example, you might monitor queries that tend to consume excessive system resources, and then initiate a specified action when a query exceeds your specified performance boundaries. You associate a query monitoring rule with a specific query queue. You can have up to eight rules per queue, and the total limit for all queues is eight rules.
WLM evaluates metrics every 10 seconds. If more than one rule is triggered during the same period, WLM initiates the most severe action—abort, then hop, then log. If the action is hop or abort, the action is logged and the query is evicted from the queue. If the action is log, the query continues to run in the queue. WLM initiates only one log action per query per rule. If the queue contains other rules, those rules remain in effect. If the action is hop and the query is routed to another queue, the rules for the new queue apply.
JSON property: rules

“rules”:


{
{
“rule_name”: “rule_l”,
“predicate”: [
{
“metric_name”: “query_cpu_time”, “operator”: “>”,
“value”: 100000
{
“metric_name”: “query_blocks_read”, “operator”:
“value”: 1000
}
]]
“action”: “hop”

}

}

REDSHIFT LIMITS
ResourceDefault Limit
Nodes per cluster101
Nodes200
Reserved Nodes200
Snapshots20
Parameter Groups20
Security Groups20
Subnet Groups20
Subnets per Subnet Group20
Event Subscriptions20

Amazon Redshift has quotas that limit the total number of nodes that you can provision, and the number of snapshots that you can create; these quotas are per AWS account per region.

Amazon Redshift has a default quota for each of these. If you attempt to exceed any of these quotas, the attempt will fail.

To increase these Amazon Redshift quota limits for your account in a region, request a change by submitting an Amazon Redshift Limit Increase Form.

Max Total tables – 9,900 for large and xlarge cluster node types and 20,000 for 8xlarge cluster node types.

Max Total user-defined databases – 60 per cluster ^

Max Total schemas – 9900 per cluster

Max Total concurrent user connections – 500 per cluster

Max Concurrency level in workload management – 50 for all user-defined queues

Max AWS Accounts authorized to restored snapshot – 20 per snapshot Max size of single row in a COPY command – 4 MB Max IAM roles associated with a cluster – 10 roles

Enhanced VPC Routing
  • When you use Amazon Redshift Enhanced VPC Routing, Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC. If Enhanced VPC Routing is not enabled, Amazon Redshift routes traffic through the Internet, including traffic to other services within the AWS network.
  • When you execute a COPY or UNLOAD command on a cluster that has Enhanced VPC Routing enabled, your VPC routes the traffic to the specified resource using the strictest, or most specific, network path available.
  • You can use a VPC endpoint to create a managed connection between your Amazon Redshift cluster in a VPC and Amazon Simple Storage Service (Amazon S3). When you do, COPY and UNLOAD traffic between your cluster and your data on Amazon S3 stays in your Amazon VPC.

You can attach an endpoint policy to your endpoint to more closely manage access to your data. For example, you can add a policy to your VPC endpoint that permits unloading data only to a specific Amazon S3 bucket in your account.

  • A VPC endpoint uses route tables to control the routing of traffic between a cluster in the VPC and Amazon S3. All clusters in subnets associated with the specified route tables automatically use that endpoint to access the service.
  • Your VPC uses the most specific, or most restrictive, route that matches your cluster’s traffic to determine how to route the traffic. For example, if you have a route in your route table for all Internet traffic ( 0.0.0.0/0 ) that points to an Internet gateway and an Amazon S3 endpoint, the endpoint route takes precedence for all traffic destined for Amazon S3, because the IP address range for the Amazon S3 service is more specific than 0.0.0.0/0. In this example, all other Internet traffic goes to your Internet gateway, including traffic that’s destined for Amazon S3 buckets in other regions.
  • You use endpoint policies to control access from your cluster to the Amazon S3 buckets that hold your data files. By default, the Create Endpoint wizard attaches an endpoint policy doesn’t further restrict access from any user or service within the VPC. For more specific control, you can optionally attach a custom endpoint policy.

Enhanced VPC Routing

To work with Enhanced VPC Routing, your cluster must meet the following

requirements and constraints:

  • Your cluster must be in a VPC.
  • You must enable Domain Name Service (DNS) resolution in your VPC, or if you’re using your own DNS server, ensurd^that DNS requests to Amazon S3 are resolved correctly to the IP addresses maintained by AWS.
  • DNS hostnames must be enabled in your VPC. DNS hostnames are enabled by default.
  • Your VPC endpoint policies must allow access to any Amazon S3 buckets used with COPY, UNLOAD, or CREATE LIBRARY calls in Amazon Redshift, including access to any manifest files involved.
Tagging Resources in Amazon Redshift

In AWS, tags are user-defined labelsjjhat consist of key-value pairs. Amazon Redshift supports tagging to provide’metadata about resources at a glance, and to categorize your billing reports based on cost allocation.

Tags are retained for resources after you resize a cluster, and after you restore a snapshot of a cluster within the same region. However, tags are not retained if you copy a snapshot to another region, so you must recreate the tags in the new region. If you delete a resource, any associated tags are deleted.

Each resource has one tag set, which is a collection of one or more tags assigned to the resource. Each resource can have up to 50 tags per tag set. You can add tags when you create a resource and after a resource has been created. You can add tags to the following resource types in Amazon Redshift:

  • CIDR/IP
  • Cluster
  • Cluster security group
  • Cluster security group ingress rule
  • EC2 security group
  • HSM connection
  • HSM client certificate
  • Parameter group
  • Snapshot
  • Subnet group
Amazon Redshift Auditing and Logging

Amazon Redshift logs information in the following log files:

  • Connection log — logs authentication attempts, and connections and disconnections.
  • User log — logs information about changes to database user definitions.
  • User activity log — logs each query before it is run on the database.

The connection and user logs are useful primarily for security purposes. You can use the connection log to monitor information about the users who are connecting to the database and the related connection information, such as their IP address, when they made the request, what type of authentication they used, and so on. You can use the user log to monitor changes to the definitions of database users.

The user activity log is useful primarily for troubleshooting purposes. It tracks information about the types of queries that both the users and the system perform in the database.

The connection log and user log both correspond to information that is stored in the system tables in your database. You can use the system tables to obtain the same information, but the log files provide an easier mechanism for retrieval and review.

The log files rely on Amazon S3 permissions rather than database permissions to perform queries against the tables. Additionally, by viewing the information in log files rather than querying the system tables, you reduce any impact of interacting with the database.

  • The connection log, user log. and user activity log are enabled together by using the AWS Management Console.
  • For the user activity log, you must also enable the enable_user_activityJogging database parameter.
  • If you enable only the audit logging feature, but not the associated parameter, the database audit logs will log information for only the connection log and user log. but not for the user activity log.
  • The enable_user_actrvityJogging parameter is disabled (false) by default, but you can set it to true to enable the user activity log.

Amazon Redshift Backup

Snapshots are point-in-time backups of a cluster. There are two types of snapshots: automated and manual. Amazon Redshift stores these snapshots internally in Amazon S3 by using an encrypted Secure Sockets Layer (SSL) connection. If you need to restore from a snapshot, Amazon Redshift creates a new cluster and imports data from the snapshot that you specify.

When you restore from a snapshot, Amazon Redshift creates a new cluster and makes the new cluster available before all of the data is loaded, so you can begin querying the new cluster immediately. The cluster streams data on demand from the snapshot in response to active queries, then loads the remaining data in the background.

Amazon Redshift periodically takes incremental snapshots that track changes to the cluster since the previous snapshot. Amazon Redshift retains all of the data required to restore a cluster from a snapshot.

Amazon Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster until you delete the cluster. After you reach the free snapshot storage limit, you are charged for any additional storage at the normal rate.

Because of this, you should evaluate how many days you need to keep automated snapshots and configure their retention period accordingly, and delete any manual snapshots that you no longer need.

Amazon Redshift Automated Snapshots

When automated snapshots are enabled for a cluster, Amazon Redshift periodically takes snapshots of that cluster, usually every eight hours or following every 5 GB per node of data changes, or whichever comes first.

Automated snapshots are enabled by default when you create a cluster. These snapshots are deleted at the end of a retention period. The default retention period is one day, but you can modify it by using the Amazon Redshift console or programmatically by using the Amazon Redshift API.

To disable automated snapshots, set the retention period to zero. If you disable automated snapshots, Amazon Redshift stops taking snapshots and deletes any existing automated snapshots for the cluster. Only Amazon Redshift can delete an automated snapshot; you cannot delete them manually.

Amazon Redshift deletes automated snapshots at the end of a snapshot’s retention period, when you disable automated snapshots, or when you delete the cluster. Amazon Redshift retains the latest automated snapshot until you disable automated snapshots or delete the cluster.

If you want to keep an automated snapshot for a longer period, you can create a copy of it as a manual snapshot. The automated snapshot is retained until the end of retention period, but the corresponding manual snapshot is retained until you manually delete it.

Amazon Redshift Manual Snapshots

Regardless of whether you enable automated snapshots, you can take a manual snapshot whenever you want.

Amazon Redshift will never automatically delete a manual snapshot.

Manual snapshots are retained even after you delete your cluster.

Because manual snapshots accrue storage charges, it’s important that you manually delete them if you no longer need them.

If you delete a manual snapshot, you cannot start any new operations that reference that snapshot. However, if a restore operation is in progress, that restore operation will run to completion.

Amazon Redshift has a quota that limits the total number of manual snapshots that you can create

• By default, all user-defined permanent tables are included in snapshots.

  • If a table, such as a staging table, doesn’t need to be backed up, you can significantly reduce the time needed to create snapshots and restore from snapshots. You also reduce storage space on Amazon S3 by using a no-backup table.
  • To create a no-backup table, include the BACÇJJP NO parameter when you create the table.

CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE
[ IF NOT EXISTS ] table_name
( { column_name data_type [column_attributes] [ column_constraints ] I table_constraints
I LIKE parent_table [ { INCLUDING | EXCLUDING > DEFAULTS ] )
[, … ])
[ BACKUP { YES I NO } ]

[table_attribute]

Amazon Redshift Disaster Recovery

You can configure Amazon Redshift to automatically copy snapshots (automated or manual) for a cluster to another region.

When a snapshot is created in the cluster’s primary region, it will be copied to a secondary region; these are known respectively as the source region and destination region.

By storing a copy of your snapshots in another region, you have the ability to restore your cluster from recent data if anything affects the primary region. You can configure your cluster to copy snapshots to only one destination region at a time

When you enable Amazon Redshift to automatically copy snapshots to another region, you specify the destination region where you want snapshots to be copied. In the case ot automated snapshots, you can also specify the retention period that they should be kept in the destination region.

After an automated snapshot is copied to the destination region and it reaches the retention time period there, it is deleted from the destination region keeping vour snapshot usage low. You can change this retention period if you need to keep the automated snapshots for a shorter or longer period of time in the destination region.

The retention period that you set for automated snapshots that are copied to the destination region is separate from the retention period for automated snapshots in the source region. The default retention period for copied snapshots is seven days. That seven-day period only applies to automated snapshots.

You can disable automatic snapshot copy for a cluster at any time. When you disable this feature, snapshots are no longer copied from the source region to the destination region.

If you want to change the destination region that you copy snapshots to, you have to first disable the automatic copy feature and then re-enable it, specifying the new destination region.

Copying snapshots across regions incurs data transfer charges. Once a snapshot is copied to the destination region, it becomes active and available for restoration purposes.

Amazon Redshift Full Restore

A snapshot contains data from any databases that are running on your cluster, and also information about your cluster, including the number of nodes, node type, and master user name.
If you need to restore your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster and then restores all the databases from the snapshot data.
The new cluster that Amazon Redshift creates from the snapshot will have same configuration, including the number and type of nodes, as the original cluster from which the snapshot was taken.
The cluster is restored in the same region and a random, system-chosen Availability Zone, unless you specify another Availability Zone in your request.
You cannot use a snapshot to revert an active cluster to a previous state.
When you restore a snapshot into a new cluster, the default security group and parameter group are used unless you specify different values.

Amazon Redshift Table Restore

You can restore a single table from a snapshot instead of restoring an entire cluster. When you restore a single table from a snapshot, you specify the source snapshot, database, schema, and table name, and the target cluster, schema, and a new table name for the restored table.
• The new table name cannot be the name of an existing table. To replace an existing table with a restored table from a snapshot, rename or drop the existing table before you restore the table from the snapshot.

The target table is created using the source table’s column definitions, table attributes, and column attributes except for foreign keys.
If the owner of the source table exists, then that user is the owner of the restored table, provided that the user has sufficient permissions to become the owner of a relation in the specified database and schema. Otherwise, the restored table is owned by the master user that was created when the cluster was launched.
The restored table returns to the state it was in at the time the backup was taken.

Restoring a table from a snapshot has the following limitations:
· You can restore a table only to the current, active running cluster and from a snapshot that was taken of that cluster.
· You can restore only one table at a time.
· You cannot restore a table from a cluster snapshot that was taken prior to a cluster being resized.

Comparing Reserved Node Offerings

Payment OptionPayment ScheduleComparative SavingsDuration
No UpfrontMonthly installments for the duration of the reservation. No upfront payment.About a 20 percent discount over on- demand rates.One-year term
Partial UpfrontPartial upfront payment, and monthly installments for the duration of the reservation.Up to 41 percent to 73 percent discount depending on duration.One-year or three- year term
All UpfrontFull upfront payment for the reservation. No monthly charges.Up to 42 percent to 76 percent discount depending on duration.One-year or three- year term
Database Encryption for Amazon Redshift

In Amazon Redshift, you can enable database encryption for your clusters to help protect data at rest.

When you enable encryption for a cluster, the data blocks and system metadata are encrypted for the cluster and its snapshots.

Encryption is an optional, immutable setting of a cluster. If you want encryption, you enable it during the cluster launch process. To go from an unencrypted cluster to an encrypted cluster or the other way arouncjxunload your data from the existing cluster and reload it in a new cluster with the chosen encryption setting.

Amazon Redshift uses a hierarchy of encryption keys to encrypt the database. You can use either AWS Key Management Service (AWS KMS) or a hardware security module (HSM) to manage the top level encryption keys in this hierarchy. The process that Amazon Redshift uses for encryption differs depending on how you manage keys.

When you launch your cluster, Amazon Redshift returns a list of the customer master keys (CMKs) that your AWS account has created or has permission to use in AWS KMS. You select a CMK to use as your master key in the encryption hierarchy.

When you choose AWS KMS for key management with Amazon Redshift, there is a four-tier hierarchy of encryption keys. These keys, in hierarchical order, are the master key, a cluster encryption key (CEK), a database encryption key (DEK), and data encryption keys.

By default, Amazon Redshift selects your default key as the master key. Your default key is an AWS managed key that is created for your AWS account to use in Amazon Redshift. AWS KMS creates this key the first time you launch an encrypted cluster in a region and choose the default key.

If you don’t want to use the default key, you must have (or create) a customer-managed CMK separately in AWS KMS before you launch your cluster in Amazon Redshift. Customer-managed CMKs give you more flexibility, including the ability to create, rotate, disable, define access control for, and audit the encryption keys used to help protect your data.

Amazon Redshift automatically integrates with AWS KMS but not with an HSM. When you use an HSM, you must use client and server certificates to configure a trusted connection between Amazon Redshift and your HSM.

HSMs are devices that provide direct control of key generation and management. They provide greater security by separating key management from the application and database layers. Amazon Redshift supports AWS CloudHSM Classic for key management. The encryption process is different when you use HSM to manage your encryption keys instead of AWS KMS.

Amazon Redshift supports only AWS CloudHSM Classic, and not the AWS CloudHSM service. HSM encryption is not supported for DC2 node types.

Amazon Redshift announces Query Editor to run queries directly from the AWS Management Console

You can now query data in your Amazon Redshift cluster directly from your AWS Management console, using the new Query Editor. This provides an easier way for admins and end users to run SQL queries without having to install and setup an external JDBC/ODBC client. Query results are instantly visible within the console.

With the Query Editor’s Saved Queries feature, you can save your most commonly run queries and have them conveniently available the next time you need them. You can also view the time taken in every step of query execution with View Execution button to isolate bottlenecks and optimize the query. Finally, you can easily export the results of your query with the Download CSV button for further analysis.

Amazon Redshift Makes Short Query Acceleration Self- Optimizing

Amazon Redshift improved Short Query Acceleration (SQA) by automating the maximum time-out setting for short queries. On Nov 20th, 2017, we announced SQA which uses machine learning algorithms to predict the execution time of a query and move short-running queries to an express queue for immediate processing. In the past, SQA needed user-defined ’maximum run time’ (between 1- 20 seconds) to identify short queries. With this update, by setting the maximum short query run time to ’dynamic,’ you can let Redshift automate this setting. In addition, SQA’s improved machine learning algorithm adapts the maximum run time to the changing workload, making it easier to use to minimize short query queuing time and increase throughput.

If you have SQA enabled already, select ‘dynamic1 in the ‘Maximum run time for short queried box to enable self-optimization. If you have not enabled SQA, you can enable dynamic SQA in three easy steps. Step one, edit your cluster parameter group by going to your Workload Management (WLM) settings in your console and click on ‘Edit.’ Step two, select the ‘Enable short query acceleration’ checkbox. Step three, ‘Save’ to enable dynamic SQA on your cluster. Alternatively, you can enable dynamic SQA using the AWS command line interface (CLI). To learn more, refer to our documentation.

Dynamic SQA is available with the latest build 1.0.2294 or Build 1.0.2369 in all AWS commercial regions. Please see the AWS Region Table for Amazon Redshift availability.

Amazon Redshift Makes Short Query Acceleration Self- Optimizing

Amazon Redshift improved Short Query Acceleration (SQA) by automating the maximum time-out setting for short queries. On Nov 20th, 2017, we announced SQA which uses machine learning algorithms to predict the execution time of a query, and move short-running queries to an express queue for immediate processing. In the past, SQA needed user defined ’maximum run time’ (between 1- 20 seconds) to identify short queries. With this update, by setting the maximum short query run time to ’dynamic,’ you can let Redshift automate this setting. In addition, SQA’s improved machine learning algorithm adapts the maximum run time to the changing workload, making it easier to use to minimize short query queuing time and increase throughput.

If you have SQA enabled already, select ‘dynamic1 in the ‘Maximum run time for short queried box to enable self-optimization. If you have not enabled SQA, you can enable dynamic SQA in three easy steps. Step one, edit your cluster parameter group by going to your Workload Management (WLM) settings in your console and click on ‘Edit.’ Step two, select the ‘Enable short query acceleration’ checkbox. Step three, ‘Save’ to enable dynamic SQA on your cluster. Alternatively, you can enable dynamic SQA using the AWS command line interface (CLI). To learn more, refer to our documentation.

Dynamic SQA is available with the latest build 1.0.2294 or Build 1.0.2369 in all AWS commercial regions. Please see the AWS Region Table for Amazon Redshift availability.

Amazon Redshift automatically enables short query acceleration

Amazon Redshift now enables short query acceleration by default to speed up execution of short-running emeries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to provide higher performance, faster results, and better predictability of query execution times.

We plan to enable short query acceleration (SQA) for all clusters in the coming weeks. The exact timing of the change depends on the AWS Region that your cluster is in and your cluster’s maintenance window. You can opt out of this change by deselecting the ‘Enable short query acceleration’ check box in the Workload Management configuration on your Amazon Redshift console. If you do so, SQA won’t be enabled when we update your cluster to the latest build. Also we won’t enable SQA if you have tried it in the past, and switched it off. To learn more, see our documentation.

We strongly recommend to have SQA enabled, and Maximum run time set to ‘dynamic.’ With SQA, short-running queries begin executing more quickly, and users see results sooner. The acceleration varies based on workload, though we have observed 3x improvements in short query performance for internal workloads.

You can still choose to immediately turn on or turn off SQA via selecting or deselecting ‘Enable short query acceleration’ checkbox in the Workload Management configuration on your Amazon Redshift console. Please do reach out to us if you have feedback regarding Short Query Acceleration via sqa-feedback@amazon.com. Refer to the AWS Region Table for Amazon Redshift availability.

Amazon Redshift Can Now COPY from Parquet and ORC File Formats

You can now COPY Apache Parquet and Apache ORC file formats from Amazon S3 to your Amazon Redshift cluster. Apache Parquet and ORC are columnar data formats that allow users to store their data more efficiently and cost-effectively. With this update, Redshift now supports COPY from six file formats: AVRO, CSV, JSON, Parquet, ORC and TXT.

The nomenclature for copying Parquet or ORC is the same as existing COPY command. For example, to load the Parquet files inside “parquet” folder at the Amazon S3 location “s3://mybucket/data/listings/parquet/”, you would use the following command:

COPY listing
FROM ‘s3://mybucket/data/listings/parquet/’
IAM_ROLE ‘ arn:aws:iam::0123456789012:role/MyRedshiftRole’ FORMAT AS PARQUET;

All general purpose Amazon S3 storage classes are supported by this new feature, including S3 Standard, S3 Standard-Infrequent Access, and S3 One Zone-Infrequent Access. The current version of the COPY function supports certain parameters, such as FROM, IAM ROLE, CREDENTIALS, STARTUPDATE, and MANIFEST. Succeeding versions will include more COPY parameters. The Amazon Redshift documentation lists the current restrictions on the function.

COPY from Parquet and ORC is available with the latest release <1.0.2294> in the following AWS regions: US East (N. Virginia, Ohio), US West (Oregon, N. California), Canada (Central), South America (Sao Paulo), EU (Frankfurt, Ireland, London), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, Tokyo).

Amazon Redshift now provides customized best practice recommendations with Advisor

Amazon Redshift announces Advisor, a new feature that provides automated recommendations to help you optimize database performance and decrease operating costs. Advisor is available via the Amazon Redshift console at no charge

Advisor is like a personal database assistant that generates tailored recommendations related to database operations and cluster configuration based on analyzing your cluster’s performance and usage metrics. Currently, Advisor can show up to seven recommendations to help you optimize your cluster. However, it displays only those recommendations that will have a significant impact for your workload. When Advisor determines that a recommendation has been addressed, it will be removed from your recommendation list.

Advisor is accessible via the left-hand navigation menu on your Amazon Redshift console. On the Advisor page, simply select the cluster for which you want to see the recommendations. Do try out the Advisor today and provide feedback for each recommendation by clicking on the ‘Was this useful?’ button.

Read more about Amazon Redshift Advisor in our documentation.

Advisor is available in the following AWS regions: US East (N. Virginia), US West (N. California), US West (Oregon), EU (Frankfurt), EU (Ireland), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo).

Amazon Redshift Advisor Recommendations

Amazon Redshift Advisor offers recommendations about how to optimize your Amazon Redshift cluster to increase performance and save on operating costs. You can find explanations for each recommendation in the console, as described preceding. You can find further details on these recommendations in the following sections.

Topics

Compress Table Data

Compress Amazon S3 File Objects Loaded by COPY

Isolate Multiple Active Databases

Reallocate Workload Management (WLM) Memory

Skip Compression Analysis During COPY

Split Amazon S3 Objects Loaded by COPY

Update Table Statistics

Enable Short Query Acceleration

Replace Single-Column Interleaved Sort Keys

Compress Table Data

Amazon Redshift is optimized to reduce your storage footprint and improve query performance by using compression encodings. When you don’t use compression, data consumes additional space and requires additional disk I/O. Applying compression to large uncompressed columns can have a big impact on your cluster.

Analysis

The compression analysis in Advisor tracks uncompressed storage allocated to permanent user tables. It reviews storage metadata

Amazon Redshift announces new metrics to help optimize duster performance

You can now get a detailed view of your Amazon Redshift cluster performance with the Workload Execution Breakdown graph on the console Database Performance page or the QueryRuntimeBreakdown metric via Cloudwatch. You can use these new metrics to optimize cluster performance to provide higher throughput and faster results.

Workload Execution Breakdown shows the amount of time queries have spent in the plan, wait, read, and write stages. With this update, the Database Performance page now contains 5 key performance monitoring metrics – Workload Execution Breakdown, Query Throughput, Query Duration, Query Throughput by WLM Queues, and Query Duration by WLM Queues.

To learn how to use these graphs to optimize your cluster settings, see Analyzing Workload Performance in the Cluster Management Guide. These new metrics are now available in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon) Canada (Central), South America (Sao Paulo), EU (Frankfurt), EU (Ireland). EU (London), Asia Pacific (Beijing), Asia Pacific (Mumbai) Asia Pacific (Seoul) Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), China (Ningxia).

Amazon Redshift announces support for lateral column alias reference

Amazon Redshift now enables you to write queries that refer to a column alias within the same query immediately after it is declared, improving the readability of complex SQL queries.

The support for lateral column alias reference enables you to write queries without repeating the same expressions in the SELECT list. For example, you can define the alias ‘probability’ and use it within the same select statement:

select clicks / impressions as probability, round(100

probability, 1) as percentage from raw_data;

For more information on how to use lateral column alias reference, refer to our documentation. See the AWS Global Region Table for region availability.

Encrypt your previously unencrypted Amazon Redshift cluster with 1 -click

You can now easily encrypt a previously unencrypted Amazon Redshift cluster with an AWS Key Management Service (AWS KMS) encryption key.

When you enable KMS encryption, Amazoii’Sedshift automatically migrates your data to a new, encrypted cluster. The data blocks and system metadata are encrypted on the new cluster, as well as on subsequent snapshots. With encryption, you can protect your sensitive data at rest and be GDPR ready.

You can use one-click encryption only when migrating to a KMS-encrypted cluster. To convert to a cluster using a hardware security module (HSM), you can create a new encrypted cluster and move your data to it.

You can modify your cluster’s encryption using the AWS management console or the AWS CLI. During the migration process, the cluster is available in read-only mode and the cluster status appears as ‘resizing’. To learn more, see Amazon Redshift Database Encryption in the Amazon Redshift Cluster Management Guide.

Amazon Redshift announces Elastic resize: Add and remove nodes in minutes

You can now quickly resize your Amazon Redshift cluster in minutes by adding nodes to get better performance and more storage for demanding workloads or by removing nodes to save cost.

Elastic resize is significantly faster than the Classic resize operation, which Amazon Redshift has had since its inception. Additional nodes are added or removed in minutes with minimal disruption to on-going read and write queries. This enables you to quickly size up for faster performance and size down when the job is complete to save cost. While Elastic resize is the ideal approach for adding or removing nodes, Classic resize can still be used when you need to change the node type.

Additional nodes that are not covered by existing Reserved Instances (RIs) are charged at the on-demand pricing rate in that region. To learn more, visit the Amazon Redshift pricing page. Elastic resize is not available for DC1 node types. To use elastic resize, please migrate to DC2 for free.

You can resize using the AWS Management console or programmatically using the AWS CLI or API. To learn more, see Resizing Clusters in the Amazon Redshift Cluster Management Guide.

With elastic resize you can now add or subtract nodes in minutes

 Scale compute and storage on-demand S Faster query processing S Finish large ETL jobs faster S Save on-demand cost during off-peak hours

Amazon Redshift announces Deferred Maintenance and Advance Event Notifications

You can now defer maintenance of your Amazon Redshift cluster to keep your data warehouse running without interruption during critical business periods. You will now also receive advance notifications from Amazon Redshift prior to any upcoming maintenance on your cluster.

Amazon Redshift reserves a 30-minute weekly window to apply upgrades and do any other scheduled maintenance activities, such as replacing nodes to avoid hardware failures. You can select a day and time for the maintenance window using the Modify Cluster setting on the AWS management console or using the Modify Cluster API. If there are no maintenance tasks to perform during the scheduled maintenance window, your cluster continues to operate normally until the next scheduled maintenance window.

From large enterprises to fast growing start-ups, many businesses run mission critical workloads on Amazon Redshift. During high business activity period, you might want to defer the scheduled maintenance to a less busy time. Using ‘the Deferred Maintenance’ feature you can now postpone scheduled maintenance. Amazon Redshift waits for the following maintenance window to apply software updates. The deferred maintenance period is overridden if a mandatory hardware replacement is scheduled on your cluster. If hardware replacement is required, you will get an evenj\iotification through the AWS management console and the SNS subscription.

Amazon Redshift now sends advance notifications, under the Pending category, if a software upgrade or hardware replacement is scheduled during the upcoming maintenance window. The advance notification is sent when the maintenance event is scheduled by the system. You can subscribe to Pending events using Amazon SNS.

To learn more about the maintenance window and how to defer maintenance, refer to Maintenance Windows in the Amazon Redshift Cluster Management Guide. To learn how to subscribe to Pending event notifications, see Event Notifications in the Amazon Redshift Cluster Management Guide.

Choosing Cluster Maintenance Tracks

When Amazon Redshift releases a new cluster version, your cluster is updated during its maintenance window. You can control whether your cluster is updated to the most recent approved release or to the previous release.

The maintenance track controls which cluster version is applied during a maintenance window. When Amazon Redshift releases a new cluster version, that version is assigned to the current track, and the previous version is assigned to the trailing track. To set the maintenance track for the cluster, specify one of the following values:

  • Current – Use the most current approved cluster version.
  • Trailing – Use the cluster version before the current version.

For example, suppose your cluster is currently running version 1.0.2762 and the Amazon Redshift current version is 1.0.3072. If you set the maintenance track value to Current, your cluster is updated to version 1.0.3072 (the next approved release) during the next maintenance window. If you set the maintenance track value to Trailing, your cluster isn’t updated until there is a new release after 1.0.3072.

Switching between Maintenance Tracks

Changing tracks for a cluster is generally a one-time decision. You should exercise caution in changing tracks. If you change the maintenance track from Trailing to Current, we will update the cluster to the Current track release version during the next maintenance window. However, if you change the cluster’s maintenance track to Trailing we won’t update your cluster until there is a new release after the Current track release version.

Maintenance Tracks and Restore

A snapshot inherits the source cluster’s maintenance track. If you change the source cluster’s maintenance track after you take a snapshot, the snapshot and the source cluster are on different tracks. When you restore from the snapshot, the new cluster will be on the maintenance track that was inherited from the source cluster. You can change the maintenance track after the restore operation completes. Resizing a cluster doesn’t affect the cluster’s maintenance track.

For more information see, Setting the Maintenance Track for a Cluster.

Amazon Redshift now runs VACUUM DELETE automatically

Amazon Redshift now automatically runs the VACUUM DELETE operation to reclaim disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. It also defragments the tables to free up consumed space and improves performance for your workloads.

VACUUM DELETE is scheduled to run based on query load and the number of deleted rows in tables. For example, VACUUM DELETE runs only sporadically during times of high load to reduce the impact on users and queries. Automatic VACUUM DELETE pauses when the incoming query load is high, then resumes later. Routinely scheduled VACUUM DELETE jobs don’t need to be modified because Amazon Redshift skips tables that don’t need to be vacuumed.

Additionally, all vacuum operations now run only on a portion of a table at a given time rather than running on the full table. This drastically reduces the amount of resources such as memory, CPU, and disk I/O required to vacuum.

You can track when VACUUM DELETE is running in the background by monitoring ‘Space reclaimed by auto vacuum delete’on the Cluster Performance tab on the AWS Management Console and using the Cloudwatch metric AutoVacuumSpaceFreed. For more information, see VACUUM in the Amazon Redshift Database Developer Guide.

Auto VACUUM DELETE is now available with the release version 1.0.5290 or higher in all AWS commercial regions. Refer to the AWS Region Table for Amazon Redshift availability.

Amazon Redshift now updates table statistics by running ANALYZE automatically

Analyze operations now run automatically on your Amazon Redshift tables in the background to deliver improved query performance and optimal use of system resources.

Amazon Rcdshift’s sophisticated query planner uses a table’s statistical metadata to choose the optimal query execution plan for better query performance. The analyze operation generates or updates the table statistics. With this update, you no longer need to explicitly run the ANALYZE command. If you do run it as part of your extract, transform, and load (ETL) workflow, automatic analyze skips tables with up-to-date statistics. Similarly, an explicit ANALYZE skips tables with up-to-date table statistics.

For more information, see Automatic Analyze in the Amazon Redshift Database Developer Guide.

Automatic Analyze is now available with the release version 1.0.5671 or higher in all AWS commercial regions. Refer to the AWS Region Table for Amazon Redshift availability.

Automatic Analyze

Amazon Redshift continuously monitors your database and automatically performs analyze operations in the background. To minimize impact to your system performance, automatic analyze runs during periods when workloads are light.

Automatic analyze is enabled by default. To disable automatic analyze, set the auto_analyze parameter to false by modifying your cluster’s parameter group.

To reduce processing time and improve overall system performance, Amazon Redshift skips automatic analyze for any table where the extent of modifications is small.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your extract, transform, and load (ETL) workflow, automatic analyze skips tables that have current statistics. Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table’s statistics.

Analysis of New Table Data

By default, the COPY command performs an ANALYZE after it loads data into an empty table. You can force an ANALYZE regardless of whether a table is empty by setting STATUPDATE ON. If you specify STATUPDATE OFF, an ANALYZE is not performed. Only the table owner or a superuser can run the ANALYZE command or run the COPY command with STATUPDATE set to ON.

Amazon Redshift also analyzes new tables that you create with the following commands:

  • CREATE TABLE AS (CTAS)
  • CREATE TEMP TABLE AS
  • SELECT INTO

Amazon Redshift returns a warning message when you run a query against a new table that was not analyzed after its data was initially loaded. No warning occurs when you query a table after a subsequent update or load. The same warning message is

Amazon Redshift announces Concurrency Scaling: Consistently fast performance during bursts of user activity

Amazon Redshift now automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency Scaling resources are added to your Redshift cluster transparently in seconds, as concurrency increases, to process queries without wait time. Once workload demand subbsides, Amazon Redshift automatically shuts down Concurrency Scaling resources to save you cost. You can continue to use your existing applications and Business Intelligence tools without any changes.

Concurrency Scaling is free for most Redshift customers. For every 24 hours that your main cluster is in use, you accrue a one-hour credit for Concurrency Scaling. These credits enable you to offset the cost of handling significant variance in your workloads, while ensuring consistently fast performance throughout the course of a day. For any usage that exceeds the accrued free usage credits, you’re billed on a per-second basis based on the on-demand rate of your Amazon Redshift cluster. Refer to the Amazon Redshift pricing for more details.

To enable Concurrency Scaling, simply set the Concurrency Scaling Mode to Auto in the AWS Management Console, as described in the Redshift Cluster Management Guide. You can allocate Concurrency Scaling usage to specific user groups and workloads, control the number of Concurrency Scaling clusters that can be used, and monitor Cloudwatch performance and usage metrics.

Concurrency Scaling Pricing

Amazon Redshift automatically adds transient capacity to provide consistently fast performance, even with thousands of concurrent users and queries. There are no resources to manage, no upfront costs, and you are not charged for the startup or shutdown time of the transient clusters. With free concurrency scaling credits that are sufficient for most customers’ concurrency needs, Amazon Redshift allows you to scale with minimal cost-impact and with predictability in your month-to-month cost, even during periods of fluctuating analytical demand. You are charged the per-second on-demand rate for a transient cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a Concurrency Scaling cluster is activated. The per-second on- demand rate is based on the type and number of nodes in your Amazon Redshift cluster.

Concurrency Scaling credits

Amazon Redshift clusters earn up to one hour of free Concurrency Scaling credits per day. Credits are earned on an hourly basis for each active cluster in your AWS account. You can accumulate up to 30 hours of free Concurrency Scaling credits for each active cluster. Credits do not expire as long as your customer is not terminated.

Pricing example for Concurrency Scaling

A 10 DC2.8XL node Redshift cluster in the US-East costs $48 per hour. Consider a scenario where two transient clusters are utilized for five minutes beyond the free Concurrency Scaling credits. The per-second on-demand rate for Concurrency Scaling is $48 * 1/3600 = $0,013 per second. The additional cost for Concurrency Scaling in this case is $0,013 per second * 300 seconds * 2 transient clusters = $8. Therefore, the total cost of the Amazon Redshift cluster and the two transient clusters in this case is $56.

Concurrency Scaling mode

To enable Concurrency Scaling on a queue, set Concurrency Scaling mode to auto. When the number of queries routed to a

queue exceeds the queue’s configured {concurrency, eligible queries are sent to the scaling cluster. When slots become

available, queries are run on the main cluster. The default is off.

JSON property: concurrency_scaling

Concurrency

The number of queries that can run concurrently in a queue. If Concurrency Scaling mode is enabled, eligible queries are

sent to a scaling cluster when a queue reaches the concurrency level. If Concurrency Scaling is disabled, queries wait in the

queue until a slot becomes available. The range is between 1 and 50.

JSON property: query_concurrency

User Groups

A comma-separated list of user group names. When members of the user group run queries in the database, their queries

are routed to the queue that is associated with their user group.

JSON property: user_group

User Group Wildcard

A Boolean value that indicates whether to enable wildcards for user groups. If this is 0, wildcards are disabled; if this is 1,

wildcards are enabled. When wildcards are enabled, you can use or “?” to specify multiple user groups when running

queries. For more information, see Wildcards.

JSON property: user_group_wild_card

Query Groups

Amazon Redshift now provides more control over snapshots

Amazon Redshift now provides more control over snapshots

Posted On: Apr 4, 2019

Amazon Redshift automatically takes incremental snapshots (backups) of your data every 8 hours or 5 GB per node of data change. You now get more information and control over a snapshot including the ability to control the automatic snapshots schedule.

Amazon Redshift now provides the ability to:

  1. View snapshots that are not associated to any cluster so you can remove unnecessary snapshots
  2. Bulk-delete snapshots to allow you to quickly delete unnecessary snapshots and reduce your S3 storage needs
  3. Control your cluster’s automatic snapshot schedule using the snapshot scheduler. The snapshot schedule can be configured with a cron style granularity via an API or with the AWS Management Console. You can create a schedule and attach the schedule to your cluster to have full control of when automated snapshots are taken.

For more information, see Snapshot scheduler in the Amazon Redshift Cluster Management Guide. Refer to the AWS Region Table for Amazon Redshift availability.

Amazon Redshift announces a console refresh to improve management and monitoring of your data warehouse

Amazon Redshift now offers a new console for Redshift users. The new user interface and new features simplify management and improve insights into the health and performance of your Redshift clusters and workloads .

The new monitoring dashboard in the new Redshift Console makes it easier for administrators to monitor all of their Redshift clusters from a single page by unifying all relevant information such as availability status, cluster and query performance, events, and alarms. This lets administrators quickly understand activity associated with their workloads.

The new console simplifies the management of Redshift clusters by streamlining user experiences and reducing the number of clicks for everyday operations. You can view your queries associated with workload management queues and correlate therro with cluster performance.

The new console reduces the time needed to understand and optimize query performance by allowing users to search for and explore queries, and then drill down to visualize query plans and execution statistics. It also gives users the ability to correlate execution time with cluster performance metrics and provides in-place optimization recommendations.

The new Query Editor improves authoring of complex queries with a SQL editor that can manage queries and provide visual analysis of query results. It improves user productivity with features such as content assist, query formatting, and keyboard shortcuts. You can also view execution plans for executed queries.

The new monitoring dashboard in the new Redshift Console makes it easier for administrators to monitor all of their Redshift clusters from a single page by unifying all relevant information such as availability status, cluster and query performance, events, and alarms. This lets administrators quickly understand activity associated with their workloads.

The new console simplifies the management of Redshift clusters by streamlining user experiences and reducing the number of clicks for everyday operations. You can view your queries associated with workload management queues and correlate them with cluster performance.

The new console reduces the time needed to understand and optimize query performance by allowing users to search for and explore queries, and then drill down to visualize query plans and execution statistics. It also gives users the ability to correlate execution time with cluster performance metrics and provides in-place optimization recommendations.

The new Query Editor improves authoring of complex queries with a SQL editor that can manage queries and provide visual analysis of query results. It improves user productivity with features such as content assist, query formatting, and keyboard shortcuts. You can also view execution plans for executed queries.

You can now easily explore partner solutions that integrate with Redshift directly from the Redshift console using the Marketplace widget.

For details about the new Redshift Console, view this blog.

The new Redshift console is now available in the following AWS Regions: Asia Pacific(Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), EU (Stockholm), Middle East (Bahrain), South America (Sao Paulo), US East (N. Virginia), US East (Ohio), US West (N. California) and US West (Oregon).

Amazon Redshift

Fast, simple, cost-effective data warehouse that can extend queries to your data lake

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost- effective to analyze all of your data across your data warehouse and data lake.

Amazon Redshift now supports stored procedures

Amazon Redshift now supports SQL stored procedures to make migration to Amazon Redshift easier. Stored procedures are used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

Until now customers that use stored procedures in their legacy data warehouse had to consider re-implementing the logic via services such as Amazon EMR before migrating to a cloud data warehouse. Now these customers can bring their existing stored procedures to Amazon Redshift and reduce their migration effort.

Amazon Redshift supports stored procedures in PL/pgSQL dialect and can include variable declaration, control logic, loops, allow raising errors, support security definer, and other features. You can create stored procedures to perform functions without giving a user access to the underlying tables with security definer controls. You can find more information about [creating and running stored procedures in the Amazon Redshift database developer guide.

Stored procedures are supported with the release version 1.0.7562 or higher in all AWS commercial regions. Refer to the AWS Region Table for Amazon Redshift availability.

Stored Procedures

You can define an Amazon Redshift stored procedure using the PostgreSQL procedural language PL/pgSQL to perform a set of SQL queries and logical operations. The procedure is stored in the database and is available for any user with sufficient privileges to run.

Unlike a user-defined function (UDF), a stored procedure can incorporate data definition language (DDL) and data manipulation language (DML) in addition to SELECT queries. A stored procedure doesn’t need to return a value. You can use procedural language, including looping and conditional expressions, to control logical flow.

  • CREATE PROCEDURE
  • ALTER PROCEDURE
  • DROP PROCEDURE
  • SHOW PROCEDURE
  • CALL
  • GRANT
  • REVOKE
  • ALTER DEFAULT PRIVILEGES
Amazon Redshift now supports column level access control with AWS Lake Formation

Amazon Redshift Spectrum now supports column level access control for data stored in Amazon S3 and managed by AWS Lake Formation. Column level access control can be used to limit access to only the specific columns of a table rather than allowing access to all columns of a table.

For many customers, their security and data governance compliance policy require more specific access controls on their data. Now these customers can take advantage of Amazon Redshift’s integration with AWS Lake Formation to implement finer-grained access control on their managed data lake while still being able to query the data lake with Amazon Redshift Spectrum.

Amazon Redshift’s Spectrum capability allows Redshift to query data stored in Amazon S3 directly. Querying S3 can be more cost- effective and it eliminates the need to load data. Now Amazon Redshift Spectrum’s S3 query capability is further enhanced to support column level access control for data stored in Amazon S3.

To use this feature, an administrator creates an IAM role for Amazon Redshift and creates the policy to allow Redshift to access AWS Lake Formation. The administrator can then use the Lake Formation console to specify the tables and columns that the role is allowed access to. The column level access control policies can also be created and managed by the SQL grant statements.

You can find more information about enabling column level access control for Amazon Redshift Spectrum in the Amazon Redshift database developer guide. Amazon Redshift supports column level access with the release version 1.0.8610 or higher in all AWS commercial regions supported by AWS Lake Formation. Refer to the AWS Region Table for AWS Lake Formation availability.

Step 6: Load Sample Data from Amazon S3

PDF I Kindle

At this point, you have a database called dev and you are connected to it. Next, you create some tables in the database, upload data to the tables, and try a query. For your convenience, the sample data you load is available in an Amazon S3 bucket.

© Note

If you’re using a SQL client tool, ensure that your SQL client is connected to the cluster.

After you complete this step, you can find more information about Amazon Redshift and reset your environment at Where Do I Go From Here?.

To load sample data

1. Create tables.

Individually copy and run the following create table statements to create tables in the dev database. For more information about the syntax, see CREATE TABLE in the Amazon Redshift Database Developer Guide.

create table users(

userid integer not null distkey sortkey,

username char(8),

firstname varchar(30),

lastname varchar (30); city varchar(30),

Stored Procedures Overview

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can reduce round trips between your applications and the database.

For fine-grained access control, you can create stored procedures to perform functions without giving a user access to the underlying tables.

For example, only the owner or a superuser can truncate a table, and a user needs write permission to insert data into a table. Instead of granting a user permissions on the underlying tables, you can create a stored procedure that performs the task. You then give the user permission to run the stored procedure.

A stored procedure with the DEFINER security attribute runs with the privileges of the stored procedure’s owner. By default, a stored procedure has INVOKER security, which means the procedure uses the permissions of the user that calls the procedure.

To create a stored procedure, use the CREATE PROCEDURE command. To run a procedure, use the CALL command.

Stored Procedure Overloading and Security

A procedure is identified by its name and signature, which is the number of input arguments and the data types of the arguments. Two procedures in the same schema can have the same name if they have different signatures. In other words, you can overload procedure names.

When you run a procedure, the query engine determines which procedure to call based on the number of arguments that you provide and the data types of the arguments. You can use overloading to simulate procedures with a variable number of arguments, up to the limit allowed by the CREATE PROCEDURE command.

By default, all users have permission to create a procedure. To create a procedure, you must have USAGE permission on the language PL/pgSQL, which is granted to PUBLIC by default. Only superusers and owners have the permission to call a procedure by default. Superusers can run REVOKE USAGE on PL/pgSQL from a user if they want to prevent the user from creating a stored procedure.

The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default.

Stored Procedure Transactions

The SQL statements inside a procedure behave as if they are in a transaction block that implicitly begins when the call starts and ends when the call finishes.

When working with stored procedures, consider that the BEGIN and END statements in PL/pgSQL are only for grouping. They don’t start or end a transaction.

When you call a stored procedure from within a user specified transaction block (defined by BEGIN…COMMIT), all statements in the stored procedure run in the context of the user specified transaction. The procedure doesn’t commit implicitly on exit. The caller controls the procedure commit or rollback.

TRUNCATE is another statement that you can issue from within a stored procedure and influences transaction management. In Amazon Redshift, TRUNCATE issues a commit implicitly. This behavior stays the same in the context of stored procedures. When a TRUNCATE statement is issued from within a stored procedure, it commits the current transaction and begins a new one.

You can’t run COMMIT or ROLLBACK using dynamic SQL. However, you can run TRUNCATE using dynamic SQL

Stored Procedure Logging

Details about stored procedures are logged in the following system tables and views:

SVL_STORED_PROC_CALL- details are logged about the stored procedure call’s start time and end time, and whether the call is ended before completion. For more information, see SVL_STORED_PROC_CALL.

SVL_QLOG – the query ID of the procedure call is logged for each query called from a stored procedure. For more information, see SVL_QLOG.

STL_UTILITYTEXT – stored procedure calls are logged after they are completed. For more information, see STL_UTILITYTEXT.

PG_PROC_INFO – this system catalog view shows information about stored procedures. For more information, see PG_PROC_INFO.

Limits

The following are limits on stored procedures in Amazon Redshift:

  • Amazon Redshift doesn’t support subtransactions, and hence has limited support for exception handling blocks.
  • The maximum size of the source code for a procedure is 2 MB.
  • The maximum number of explicit and implicit cursors that you can open concurrently in a user session is one. FOR loops that iterate over the result set of a SQL statement open implicit cursors. Nested cursors aren’t supported.
  • Explicit and implicit cursors have the same restrictions on the result set size as standard Amazon Redshift cursors. For more information, see Cursor Constraints.
  • The maximum number of levels for nested calls is 16.
  • The maximum number of procedure parameters is 32 for input arguments and 32 for output arguments.
  • The maximum number of variables in a stored procedure is 1,024.
  • Any SQL command that requires its own transaction context isn’t supported inside a stored procedure. Examples are VACUUM, ALTER TABLE APPEND, and CREATE EXTERNAL TABLE.

Amazon Redshift now recommends distribution keys for improved query performance

Amazon Redshift Advisor now recommends the most appropriate distribution key for frequently queried tables to improve query performance. The Advisor generates tailored recommendations by analyzing the cluster’s performance and query patterns. You can then use the ALTER TABLE ALTER DISTKEY command to add or modify the distribution key of a table, without impacting concurrent read or write queries.

When you specify the appropriate distribution key for a table, Amazon Redshift places a similar number of rows on each node when loading data into that table. A query that joins multiple tables will run much faster, if those tables can be joined on their distribution key columns.

With this Advisor update, Amazon Redshift can now determine the appropriate distribution key, by constructing a graph representation of the SQL join history, and optimizing for data transferred across nodes when joins occur. For more information, see Choosing a Data Distribution Style.

Advisor is accessible in the left-hand navigation menu on the Amazon Redshift console. Note that Advisor will only show distribution key recommendations if they would have a significant, positive impact on your workload. For more information, see Working with Recommendations from Amazon Redshift Advisor.

Amazon Redshift announces automatic workload management and query priorities

Amazon Redshift now makes it easy to maximize query throughput and get consistent performance for your most demanding analytics workloads. Automatic workload management (WLM) uses machine learning to dynamically manage memory and concurrency helping maximize query throughput. In addition, you can now easily set the priority of your most important queries, even when hundreds of queries are being submitted.

By setting query priorities, you can now ensure that higher priority workloads get preferential treatment in Redshift including more resources during busy times for consistent query performance. Automatic WLM uses intelligent algorithms to make sure that lower priority queries don’t stall, but continue to make progress. For more information, see Query Priority.

If you manually manage your workloads, we recommend that you switch to automatic WLM. Start by creating a new parameter group for automatic WLM. For more information, see Implementing Automatic WLM.

You can also enable concurrency scaling for any query queue to scale to a virtually unlimited number of concurrent queries, with consistently fast query performance. To learn more about concurrency scaling, see Working with Concurrency Scaling.

Automatic WLM with query priority is now available with cluster version 1.0.9459, or later. Refer to the AWS Region Table for Amazon Redshift availability.

Query priority

PDF I Kindle | RSS

Not all queries are of equal importance, and often performance of one workload or set of users might be more important. If you have enabled automatic WLM, you can dehne the relative importance of queries in a workload by setting a priority value. The priority is specified for a queue and inherited by all queries associated with the queue. You associate queries to a queue by mapping user groups and query groups to the queue. You can set the following priorities (listed from highest to lowest priority):

1. HIGHEST

ZHIGH

  • NORMAL
  • LOW
  • LOWEST

Administrators use these priorities to show the relative importance of their workloads when there are queries with different priorities contending for the same resources. Amazon Redshift uses the priority when letting queries into the system, and to determine the amount of resources allocated to a query. By default, queries run with their priority set to NORMAL.

An additional priority, CRITICAL, which is a higher priority than HIGHEST, is available to superusers. To set this priority, you can use the functions CHANGE_QUERY_PRIORITY, CHANGE_SESSION_PRIORITY. and CHANGE_USER_PRIORITY. To grant a database user permission to use these functions, you can create a stored procedure and grant permission to a user. For an example, see CHANGE_SESSION_PRIORITY.

© Note

Only one CRITICAL query can run at a time.

Amazon Redshift introduces AZ64, a new compression encoding for optimized storage and high query performance

AZ64, a proprietary compression encoding that is designed to achieve a high compression ratio and improved query performance, is now available. Innovations in the AZ64 algorithm efficiently compress small groups of data values and leverage SIMD instructions for data parallel processing. This encoding results in significant storage savings in most cases relative to LZO and ZSTD encodings and optimal de-compression performance for numeric and date/time data stored in Amazon Redshift.

AZ64 encoding has consistently better performance and compression than LZO. It has comparable compression with ZSTD but greatly better performance. With workloads we tested against, the following results were observed. We recommend that you evaluate the benefit for your workloads.

  • Compared to RAW encoding, AZ64 consumed 60-70% less storage, and was 25-30% faster.
  • Compared to LZO encoding, AZ64 consumed 35% less storage, and was 40% faster.
  • Compared to ZSTD encoding, AZ64 consumed 5-10% less storage, and was 70% faster.

Note that the above numbers are for a full workload and individual queries might get a much higher boost.

benefit for your workloads.

  • Compared to RAW encoding, AZ64 consumed 60-70% less storage, and was 25-30% faster.
  • Compared to LZO encoding, AZ64 consumed 35% less storage, and was 40% faster.
  • Compared to ZSTD encoding, AZ64 consumed 5-10% less storage, and was 70% faster.

Note that the above numbers are for a full workload and individual queries might get a much higher boost.

With your CREATE TABLE and ALTER TABLE statements, you can enable AZ64 encoding on columns with the following data types.

  • SMALLINT
  • INTEGER
  • BIGINT
  • DECIMAL
  • DATE
  • TIMESTAMP

fe*

  • TIMESTAMPTZ

For more details about AZ64 encoding, see Compression Encodings in the Amazon Redshift Database Developer Guide.

AZ64 encoding is now available with Redshift cluster versions 1.0.10013 or later. Refer to the AWS Region Table for Amazon Redshift

Amazon Redshift Improves Performance of Inter-Region Snapshot Transfers

Performance enhancements have been made that allow Amazon Redshift to copy snapshots across regions much faster, allowing customers to support much more aggressive Recovery Time Objective (RTO) and Recovery Point Objective (RPO) Disaster Recovery (DR) policies. The performance of copying a snapshot across regions is now orders of magnitude faster than what was achievable earlier this year, with up to 40,000 MB/s transfer rates observed.

Amazon Redshift provides several features to ensure data is securely backed up both in the local and remote regions. Snapshots to S3 are automatically created on active clusters every 8 hours or when an amount of data equal to 5 GB per node changes. Cross-region snapshots can also be enabled, which will copy incremental changes to a secondary/DR region, allowing for a Redshift cluster to be recovered in another region. Depending upon the snapshot policy configured on the primary cluster, the snapshot updates can either be scheduled, or based upon data change, and then any updates automatically replicated to the secondary/DR region automatically.

The combination of these features, along with the performance improvements introduced for copying snapshots across regions, allows Redshift customers to implement a cross-region DR policy.

You can find more information about copying snapshots between regions in the Amazon Redshift Cluster Management Guide. The performance improvements for copying snapshots across regions are available in release version 1.0.2762 or higher in all AWS commercial regions.

Amazon Redshift now supports changing table sort keys dynamically

Amazon Redshift now enables users to add and change sort keys of existing Redshift tables without having to re-create the table. The new capability simplifies user experience in maintaining the optimal sort order in Redshift to achieve high performance as their query patterns evolve and do it without interrupting the access to the tables.

Customers when creating Redshift tables can optionally specify one or more table columns as sort keys. The sort keys are used to maintain the sort order of the Redshift tables and allows the query engine to achieve high performance by reducing the amount of data to read from disk and to save on storage with better compression. Currently Redshift customers who desire to change the sort keys after the initial table creation will need to re-create the table with new sort key definitions.

With the new ALTER SORT KEY command, users can dynamically change the Redshift table sort keys as needed. Redshift will take care of adjusting data layout behind the scenes and table remains available for users to query. Users can modify sort keys for a given table as many times as needed and they can alter sort keys for multiple tables simultaneously.

For more information ALTER SORT KEY, please refer to the documentation.

This feature is available in Redshift 1.0.10654 and later. Refer to the AWS Region Table for Amazon Redshift availability.

ALTER [COMPOUND] SORTKEY ( columr_name [,…] )

A clause that changes or adds the sort key used for a table. Consider the following:

  • You can define a maximum of 400 columns for a sort key per table.
  • You can only alter a compound sort key. You can’t alter an interleaved sort key.

When data is loaded into a table, the data is loaded in the order of the sort key. When you alter the sort key, Amazon Redshift reorders the data. For more information about SORTKEY, see CREATE TABLE.

RENAME COLUMN column_name TO new_name

A clause that renames a column to the value specified in new_name. The maximum column name length is 127 bytes; longer names are truncated to 127 bytes. For more information about valid names, see Names and Identifiers.

ADD [ COLUMN ] column_nome

A clause that adds a column with the specified name to the table. You can add only one column in each ALTER TABLE statement.

You can’t add a column that is the distribution key (DISTKEY) or a sort key (SORTKEY) of the table.

You can’t use an ALTER TABLE ADD COLUMN command to modify the following table and column attributes:

  • UNIQUE
  • PRIMARY KEY
  • REFERENCES (foreign key)
Amazon Redshift launches cross-instance restore

Amazon Redshift now supports cross-instance restore to allow you to restore Redshift snapshots to clusters that are different sizes or running different node types. This simplifies snapshot restore workflows to support your evolving use cases. You can change instance type and count based on compute resources, storage capacity and cost requirements of your specific workloads. For example, you can cross-instance restore a snapshot of your production dense-compute cluster with small instances into a cluster with powerful large instances to serve more demanding workloads, or to a lower-cost development cluster. Once a cluster is restored, you can resize it as your performance and capacity needs change over time.

You can access the cross-instance restore featüfe by using the snapshot restore function in the AWS Management Console or through the restore API. For more information, refer to the Amazon Redshift documentation.

Amazon Redshift cross-instance restore is supported with release version 1.0.10013 or later. Refer to the AWS Region Table for Amazon Redshift availability.

Restoring a Cluster from a Snapshot

A snapshot contains data from any databases that are running on your cluster. It also contains information about your cluster, including the number of nodes, node type, and master user name. If you restore your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster. Then It restores all the databases from the snapshot data.

For the new cluster created from the original snapshot, you can choose the configuration, such as node type and number of nodes. The cluster is restored in the same AWS Region and a random, system-chosen Availability Zone, unless you specify another Availability Zone in your request. When you restore a cluster from a snapshot, you can optionally choose a compatible maintenance track for the new cluster.

0 Note

When you restore a snapshot to a cluster with a different configuration, the snapshot must have been taken on a cluster with cluster version 1.0.10013, or later.

You can monitor the progress of a restore by either calling the DescribeClusters API operation, or viewing the cluster details in the AWS Management Console. For an in-progress restore, these display information such as the size of the snapshot data, the transfer rate, the elapsed time, and the estimated time remaining. For a description of these metrics, see

RestoreStatus.

You can’t use a snapshot to revert an active cluster to a previous state.

0 Note

When you restore a snapshot Into a new cluster, the default security group and parameter group are used unless

You can’t use a snapshot to revert an active cluster to a previous state.

When you restore a snapshot into a new cluster, the default security group and parameter group are used unless you specify different values.

You might want to restore a snapshot to a cluster with a different configuration for these reasons:

  • When a cluster is made up of smaller node types and you want to consolidate it into a larger node type with fewer nodes.
  • When you have monitored your workload and determined the need to move to a node type with more CPU and storage.
  • When you want to measure performance of test workloads with different node types.

The following steps take a cluster with many nodes and consolidate it into a bigger node type with a smaller number of nodes using the AWS CLI. For this example, we start with a source cluster of 24 ds2. xlarge nodes. In this case, suppose that we already created a snapshot of this cluster and want to restore it into a bigger node type.

1. Run the following command to get the details of our 24-node ds2. xlarge cluster.

aws redshift describe-clusters –region eu-west-1 —cluster-identifier mycluster-123456789012

2. Run the following command to get the details of the snapshot.

aws redshift describe-cluster-snapshots –region eu-west-1 —snapshot-identifier mycluster-snapshot

Amazon Redshift announces support for spatial data

With the addition of a new polymorphic data type, GEOMETRY, Amazon Redshift now provides the ability to natively process spatial data. This capability enables customers to store, retrieve, and process spatial data so you can enhance your business insights by integrating spatial data into your analytical queries.

The new data type supports multiple geometric shapes such as Point, Linestring, Polygon, Multipoint, MultiLinestring, MultiPolygon, and GeometryCollection. Customers can add GEOMETRY columns to Redshift tables and write SQL queries spanning across spatial and non-spatial data. Redshift also adds over 40 new spatial SQL functions to construct geometric shapes, import, export, access and process the spatial data. With Redshift’s ability to seamlessly query data lakes, customers can now easily extend spatial processing to data lakes by integrating external tables in spatial queries.

The support for spatial data in Redshift is included with Redshift release version 1.0.11262 or later. This functionality is available to new and existing customers at no additional cost. To get started and learn more, visit our documentation. Refer to the AWS Region Table for Amazon Redshift availability.

Querying Spatial Data in Amazon Redshift

PDF I Kindle | RSS

Spatial data describes the position and shape of a geometry in a defined space (a spatial reference system). Amazon Redshift supports spatial data with the GEOMETRY data type, which contains spatial data and optionally its spatial reference system identifier (SRID).

Spatial data contains geometric data that can be used to represent geographic features. Examples of this type of data include weather reports, map directions, tweets with geographic positions, store locations, and airline routes. Spatial data plays an important role in business analytics, reporting, and forecasting.

You can query spatial data with Amazon Redshift SQL functions. Spatial data contains geometric values for an object.

Using spatial data, you can run queries to do the following:

•             Find the distance between two points.

•             Check whether one area (polygon) contains another.

•             Check whether one linestring intersects another linestring or polygon.

You can use the GEOMETRY data type to hold the values of spatial data. A GEOMETRY value in Amazon Redshift can define two-dimensional (2D) geometry primitives. Currently, Amazon Redshift doesn’t support 3D or 4D geometry primitives. For more information about geometry primitives, see Well-known text representation of geometry 03 in Wikipedia.

The GEOMETRY data type has the following subtypes:

POINT

• LINESTRING

POLYGON

MULTIPOINT

The GEOMETRY data type has the following subtypes:

  • POINT
  • LINESTRING
  • POLYGON
  • MULTIPOINT
  • MULTILINESTRING
  • MULTIPOLYGON
  • GEOMETRYCOLLECTION

There are Amazon Redshift SQL functions that support the following representations of geometric data:

  • GeoJSON
  • Well-known text (WKT)
  • Extended well-known text (EWKT)
  • Well-known binary (WKB) representation
  • Extended well-known binary (EWKB)

For details about SQL functions to query spatial data, see Spatial Functions.

Topics

  • Limitations When Using Spatial Data with Amazon Redshift

Did this page help you?

Previous topic: Limitations and Usage Notes

Limitations When Using Spatial Data with Amazon Redshift

PDF

Kindle

RSS

The following are limitations when using spatial data with Amazon Redshift:

  • The maximum size of a GEOMETRY object is 1,048,447 bytes.
  • Amazon Redshift Spectrum doesn’t natively support spatial data. Therefore, you can’t create or alter an external table with a GEOMETRY column.
  • Data types for Python user-defined functions (UDFs) don’t support the GEOMETRY data type.
  • You can’t use a GEOMETRY column as a sort key or a distribution key of an Amazon Redshift table.
  • You can’t use GEOMETRY columns in SQL ORDER BY, GROUP BY, or DISTINCT clauses.
  • You can’t use GEOMETRY columns in many SQL functions.
  • You can’t perform an UNLOAD operation on geometry columns into every format. You can UNLOAD GEOMETRY columns to text or CSV, which writes GEOMETRY data in hexadecimal EWKB format. If the size of the EWKB data is more than 4 MB, then a warning occurs because the data can’t later be loaded into a table.

The supported compression encoding of GEOMETRY data is RAW.

When using JDBC or ODBC drivers, use customized type Mappings. In this case, the client application must have information on which parameters of a ResultSet object are GEOMETRY objects. The ResultSetMetadata operation returns type VARCHAR.

The following nonspatial functions can accept an input of type GEOMETRY or columns of type GEOMETRY:

  • The aggregate function COUNT
  • The conditional expressions COALESCE and NVL
  • CASE expressions
Amazon Redshift introduces Automatic Table Sort, an automated alternative to Vacuum Sort

Amazon Redshift now provides an efficient and automated way to maintain sort order of the data in Redshift tables to continuously optimize query performance. The new automatic table sort capability offers simplified maintenance and ease of use without compromising performance and access to Redshift tables. Automatic table sort complements Automatic Vacuum Delete and Automatic Analyze and together these capabilities fully automate table maintenance. Automatic table sort is now enabled by default on Redshift tables where a sort key is specified.

With Redshift automatic table sort, an administrator no longer needs to worry about tracking what to sort and when to sort. Redshift does it automatically. Redshift runs the sorting in the background and re-organizes the data in tables to maintain sort order and provide optimal performance. This operation does not interrupt query processing and reduces the compute resources required by operating only on frequently accessed blocks of data. It prioritizes which blocks of table to sort by analyzing query patterns using machine learning. Automatic table sort is most useful for use cases with continuous ingestion of data and ETL/Batch processing operations such as incremental daily updates. Redshift will provide a recommendation if there is a benefit to explicitly run vacuum sort on a given table.

For more information about automatic table sort, refer to the Amazon Redshift documentation.

This feature is available in Redshift 1.0.11118 and later. Refer to the AWS Region Table for Amazon Redshift availability.

Vacuuming Tables

PDF

Kindle

RSS

Amazon Redshift can automatically sort and perform a VACUUM DELETE operation on tables in the background. To clean up tables after a load or a series of incremental updates, you can also run the VACUUM command, either against the entire database or against individual tables.

© Note

Only the table owner or a superuser can effectively vacuum a table. If you don’t have owner or superuser privileges for a table, a VACUUM operation that specifies a single table fails. If you run a VACUUM of the entire database without specifying a table name, the operation completes successfully. However, the operation has no effect on tables for which you don’t have owner or superuser privileges.

For this reason, we recommend vacuuming individual tables as needed. We also recommend this approach because vacuuming the entire database is potentially an expensive operation,

Automatic Table Sort

Amazon Redshift automatically sorts data in the background to maintain table data in the order of its sort key. Amazon Redshift keeps track of your scan queries to determine which
sections of the table will benefit from sorting.

Depending on the load on the system, Amazon Redshift automatically initiates the sort. This automatic sort lessens the need to run the VACUUM command to keep data in sort key
order. If you need data fully sorted in sort key order, for example after a large data load, then you can still manually run the VACUUM command. To determine whether your table will

benefit by running VACUUM SORT, monitor the vacuum_sort_benefit column in SVV_TABLE_INFO.

Amazon Redshift tracks scan queries that use the sort key on each table. Amazon Redshift estimates the maximum percentage of improvement in scanning and filtering of data for
each table (if the table was fully sorted). This estimate is visible in the vacuum_sort_benef it column in SVV_TABLEJNFO. You can use this column, along with the unsorted column,

Automatic Table Sort

Automatic Table Sort

Amazon Redshift automatically sorts data in the background to maintain table data in the order of its sort key. Amazon Redshift keeps track of your scan queries to determine which sections of the table will benefit from sorting.

Depending on the load on the system, Amazon Redshift automatically initiates the sort. This automatic sort lessens the need to run the VACUUM command to keep data in sort key order. If you need data fully sorted in sort key order, for example after a large data load, then you can still manually run the VACUUM command. To determine whether your table will benefit by running VACUUM SORT, monitor the vacuum_sort_benefit column in SVV_TABLE_INFO.

Amazon Redshift tracks scan queries that use the sort key on each table. Amazon Redshift estimates the maximum percentage of improvement in scanning and filtering of data for each table (if the table was fully sorted). This estimate is visible in the vacuum_sort_benef it column in SVV_TABLEJNFO. You can use this column, along with the unsorted column, to determine when queries can benefit from manually running VACUUM SORT on a table. The unsorted column reflects the physical sort order of a table. The vacuum_sort_benef it column specifies the impact of sorting a table by manually running VACUUM SORT.

table I unsorted | vacuum_sort_benefit

sales I 85.71 | 5.00

event | 45.24 | 67.00

For the table “sales”, even though the table is -86% physically unsorted, the query performance impact from the table being 86% unsorted is only 5%. This might be either because only a small portion of the table is accessed by queries, or very few queries accessed the table. For the table “event”, the table is -45% physically unsorted. But the query performance impact of 67% indicates that either a larger portion of the table was accessed by queries, or the number of queries accessing the table was large.

Amazon Redshift now supports elastic resize scheduling

The Amazon Redshift cluster elastic resize operation can now be automated using a scheduler that allows you to automatically resize clusters to accommodate changes in workloads that occur on a regular basis. For example, you can now automatically expand a cluster to accommodate heavier workloads as well as shrink a cluster to accommodate lighter workloads at specific times of day. This will allow you to automate cluster resizing to balance price and performance when using Redshift.

With the ability to schedule elastic resize operations, you can resize your Amazon Redshift cluster in minutes by adding nodes to get better performance for demanding workloads or by removing nodes to save cost. Additional nodes are added or removed in minutes with minimal disruption to on-going read and write queries. By creating your own schedule for these operations, you can further automate the operation and management of your Redshift cluster.

You can schedule resizing your cluster using the AWS Management Console or programmatically using the AWS CLI or API. This capability is included with Redshift release version 1.0.11262 or later. Refer to the AWS Region Table for Amazon Redshift availability.

Elastic resize

To quickly add or remove nodes from your cluster, use elastic resize. The cluster is briefly unavailable, usually for only a few minutes. Amazon Redshift tries to hold connections open and temporarily pause queries.

Classic resize

To change the node type, number of nodes, or both, use classic resize. Your cluster goes into a read-only state during the classic resize operation. The duration of time that your cluster is read-only depends on your data’s size.

Amazon Redshift introduces support for materialized views (preview)

Starting today, Amazon Redshift adds support for materialized views in preview. Materialized views provide significantly faster query performance for repeated and predictable analytical workloads such as dashboarding, queries from business intelligence (Bl) tools, and ELT (Extract, Load, Transform) data processing.

Using materialized views, you can store the pre-computed results of queries and efficiently maintain them by incrementally processing the latest changes made to the source tables. Subsequent queries referencing the materialized views use the pre-computed results to run much faster. Materialized views can be created based on one or more source tables using filters, projections, inner joins, aggregations, grouping, functions and other SQL constructs.

Data engineers can easily create and maintain efficient data processing pipelines with materialized views while seamlessly extending the performance benefits to data analysts and Bl tools. Furthermore, materialized views make it easier to migrate to Redshift, and allow secure access to the pre-computed results.

Materialized views are available to preview by all customers. To get started and learn more, visit the documentation. Refer to the AWS Region Table for Amazon Redshift availability.

Preview Feature (Feb 2020): Materialized View in Redshift

In a data warehouse environment, applications often need to perform complex queries on large tables—for example, SELECT statements that perform multiple- table joins and aggregations on tables that contain millions of rows. Processing these queries can be expensive, in terms of system resources and the time it takes to return the results.

Materialized views in Amazon Redshift provide a way to address these issues. A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all.

From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.

For example, consider an extract, transfer, load (ETL) or business intelligence (Bl) pipeline that performs stages of computations over very large datasets. Such a pipeline might need to run periodic batch jobs to load and transform the original data, using similar SQL statements during each run. This use case is ideal for a materialized view, because the queries are predictable and repeated over and over again.

Preview Feature (Feb 2020): Materialized View in Redshift

To use materialized views during the preview:

• To use materialized views, you must confirm that your Amazon Redshift cluster is using the cluster maintenance track for Preview with the track named preview_features. The current cluster maintenance version for the preview track is 1.0.11746.

REFRESH MATERIALIZED VIEW

This is prerelease documentation for the materialized view feature for Amazon Redshift, which is in preview release. The documentation and the feature are both subject to change. We recommend that you use this feature only with test clusters, and not in production environments. For preview terms and conditions, see Beta Service Participation in AWS Service Terms E.
Refreshes a materialized view.
When you create a materialized view, its contents reflect the state of the underlying database table or tables at that time. The data in the materialized view remains unchanged, even when applications make changes to the data in the underlying tables. To update the data in the materialized view, you can use the REFRESH MATERIALIZED VIEW statement at any time. When you do this. Amazon Redshift identifies changes that have taken place in the base table or tables, and then applies those changes to the materialized view.
REFRESH MATERIALIZED VIEW attempts to perform an incremental refresh. However, if the underlying SQL query for the materialized view cannot be incrementally refreshed, Then Amazon Redshift refreshes by full recomputation of the materialized view. The following SQL elements are compatible with incremental refreshes:
· WHERE
• INNER DOIN
· GROUP BY
· HAVING
Aggregate functions: COUNT or SUM

Table Columns

Column NameData TypeDescription
db_namechar(l 28)The database that contains the materialized view.
schemachar(l 28)The schema of the database.
namechar(l 28)The materialized view name.
updated_upto_xidbigintReserved for internal use.
is.stalechar(l)A t indicates that the materialized view is stale. A stale materialized view is one where the base tables have been updated but the materialized view has not been refreshed. This information might not be accurate if a refresh has not been run since the last restart.
owner_user_namechard 28)The user who owns the materialized view.
stateintegerThe state of the materialized view as follows: • 0 – the materialized view is recomputable.
  • 1 – the materialized view is incremental.
  • 101 – the materialized view can’t be refreshed due to a dropped column. Even if the column is not used in the materialized view.
  • 102 – the materialized view can’t be refreshed due to a changed column type. Even if the column is not used in the materialized view.
  • 103 – the materialized view can’t be refreshed due to a renamed table.
  • 104 – the materialized view can’t be refreshed due to a renamed column. Even if the column is notused in the materialized
  • View.
  •  
  • 105 – the materialized view can’t be refreshed due to a renamed schema.
Amazon Redshift introduces RA3 nodes with managed storage enabling independent compute and storage scaling

Starting today, Amazon Redshift RA3 nodes with managed storage are generally available. RA3 nodes enable you to scale and pay for compute and storage independently allowing you to size your cluster based only on your compute needs. Now you can analyze even more data cost-effectively.

RA3 nodes are built on the AWS Nitro System and feature high bandwidth networking and performance indistinguishable from bare metal. RA3 nodes use very large, high performance SSDs as local caches. RA3 nodes leverage your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching, to deliver the performance of local SSD while scaling storage automatically to S3. RA3 nodes remain fully compatible with all your existing workloads.

To get started with RA3 nodes, you can create a cluster with RA3 nodes via the AWS Management Console or the create cluster API. To migrate your cluster to an RA3 cluster, you can take a snapshot of your existing cluster and restore it to an RA3 cluster, or do a classic resize from your existing cluster to a new RA3 cluster. To learn more about RA3 nodes, see the cluster management guide. You can find more information on pricing by visiting the Amazon Redshift pricing page.

RA3 nodes are generally available in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), EU (Frankfurt), EU (Ireland), EU (London) regions, and will expand to

Amazon Redshift node types

Amazon Redshift offers three different node types to best accommodate your workloads. You can select RA3, DC2, or DS2 depending on your required performance and data size.

Amazon Redshift RA3 nodes with managed storage allow you to optimize your data warehouse by scaling and paying for compute and
storage independently. With RA3, you choose the lumber of nodes you need based on your data warehousing workload’s performance

requirements, and only pay for the managed storage that you use. Redshift managed storage uses large, high-performance SSDs in each
RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large

local SSDs, Redshift managed storage automatically offloads that data to Amazon S3. You pay the same low rate for Redshift managed

storage regardless of whether the data sits in high performance SSDs or in S3. For workloads that require a lot of storage, but not as

much compute capacity, this lets you automatically scale your data warehouse storage capacity without adding and paying for additional

nodes.

Redshift managed storage uses a variety of advanced data management techniques to optimize how efficiently data is offloaded to and

retrieved from Amazon S3. In addition, RA3 nodes are built on the AWS Nitro System for high performance and enhanced security. These

nodes feature fast, next generation CPUs, large local solid-state drives (SSD), and high bandwidth networking to deliver optimal

performance at peak loads.!

DC2 nodes enable you to create compute intensive data warehouses with local SSD storage. You choose the number of nodes you need based on data size and performance requirements. If you have less than 10TBs of data, we recommend DC2 node types for the best performance at the lowest price. If you expect your data to grow rapidly, we recommend using RA3 nodes.

DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs) for a low price point. For most workloads we recommend exploring RA3 or DC2 nodes before choosing DS2 nodes.

Redshift managed storage pricing

You pay for data stored in managed storage at a fixed GB-month rate for your region. Managed storage comes exclusively with RA3 node types and you pay the same low rate for Redshift managed storage regardless of data size. Usage of managed storage is calculated hourly based on the total data present in the managed storage (see example below converting usage in GB-Hours to charges in GB-Month). You can monitor the amount of data in your RA3 cluster via Amazon CloudWatch or the AWS Management Console. You do not pay for any data transfer charges between RA3 nodes and managed storage. Managed storage charges do not include back up storage charges due to automated and manual snapshots (see Backup Storage). Once the cluster is terminated, you continue to be charged for the retention of your manual backups.

Region:       US East (N. Virginia) *

Storage   Pricing

Storage/month           $0,024 per GB

Pricing example for managed storage pricing

Assume you store 100GB of data in managed storage, with RA3 node types, for 15 days in April, and 100TB of data for the final 15 days in April.

Let’s first calculate the usage in GB-Hours for the above scenario. For the first 15 days, you will have the following usage in GB-Hours: 100GB x 15 days x ( 24 hours/day) = 36,000 GB-Hours.

For the last 15 days, you will have the following usage in GB-Hours: 100TB X 1024 GB/TB X 15 days X ( 24 hours / day) = 36,864,000 GB-Hours

At the end of April, all usage in GB-Hours adds to: 36,000 GB-Hours + 36,864,000 GB-Hours = 36,900,000 GB-Hours Let’s convert this to GB-Months: 36,900,000 GB-Hours / 720 hours per month in April = 51,250 GB-Month.

If this data was stored in the US East (Northern Virginia) Region, managed storage will be charged at $0.024/GB-Month. Monthly storage charges for 51,250 GB-Month will

Announcing Amazon Redshift data lake export: share data in Apache Parquet format

You can now unload the result of an Amazon Redshift query to your Amazon S3 data lake as Apache Parquet, an efficient open columnar storage format for analytics. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Amazon Redshift into your Amazon S3 data lake in an open format. You can then analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena, Amazon EMR, and Amazon SageMaker.

You can specify one or more partition columns so that unloaded data is automatically partitioned into folders in your Amazon S3 bucket. For example, you can choose to unload your marketing data and partition it by year, month, and day columns. This enables your queries to take advantage of partition pruning and skip scanning non-relevant partitions, improving query performance and minimizing cost.

For more information, refer to the Amazon Redshift documentation.

Amazon Redshift data lake export is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Amazon Redshift availability.

UNLOAD

Unloads the result of a query to one or more text or Apache Parquet hies on Amazon S3, using Amazon S3 server-side encryption (SSE-S3). You can also specify server-side encryption with an AWS Key Management Service key (SSE-KMS) or client-side encryption with a customer-managed key (CSE-CMK).

FORMAT AS PARQUET Clause

Be aware of these considérions when using FORMAT AS PARQUET:

  • Unload to Parquet doesn’t use file level compression. Each row group is compressed with SNAPPY.
  • If MAXFILESIZE isn’t specified, the default maximum file size is 6.2 GB. You can use MAXFILESIZE to specify a file size of 5 MB-6.2 GB. The actual file size is approximated when the file is being written, so it might not be exactly equal to the number you specify.

To maximize scan performance, Amazon Redshift tries to create Parquet files that contain equally sized 32-MB row groups. The MAXFILESIZE value that you specify is automatically rounded down to the nearest multiple of 32 MB. For example, if you specify MAXFILESIZE 200 MB, then each Parquet file unloaded is approximately 192 MB (32 MB row group x 6 = 192 MB).

  • If a column uses TIMESTAMPTZ data format, only the timestamp values are unloaded. The time zone information isn’t unloaded.
  • Don’t specify file name prefixes that begin with underscore (J or period (.) characters. Redshift Spectrum treats files that begin with these characters as hidden files and ignores them.
PARTITION BY Clause

Be aware of these considerations when using PARTITION BY:

  • Partition columns aren’t included in the output file.
  • Make sure to include partition columns in the SELECT query used in the UNLOAD statement. You can specify any number of partition columns in the UNLOAD command.

However, there is a limitation that there should be at least one nonpartition column to be part of the file.

  • If the partition key value is null, Amazon Redshift automatically unloads that data into a default partition called partition_column= HIVE_DEFAULT_PARTITION  .
  • The UNLOAD command doesn’t make any calls to an external catalog. To register your new partitions to be part of your existing external table, use a separate ALTER TABLE …

ADD PARTITION … command. Or you can run a CREATE EXTERNAL TABLE command to register the unloaded data as a new external table. You can also use an AWS Glue crawler to populate your Data Catalog. For more information, see Defining Crawlers in the AWS Glue Developer Guide.

The MANIFEST keyword isn’t supported.

Amazon Redshift introduces support for federated querying (preview)

The in-preview Amazon Redshift Federated Query feature allows you to query and analyze data across operational databases, data warehouses, and data lakes. With Federated Query, you can now integrate queries on live data in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL with queries across your Amazon Redshift and Amazon S3 environments.

Federated Query allows you to incorporate live data as part of your business intelligence (Bl) and reporting applications. The intelligent optimizer in Redshift pushes down and distributes a portion of the computation directly into the remote operational databases to speed up performance by reducing data moved over the network. Redshift complements query execution, as needed, with its own massively parallel processing capabilities. Federated Query also makes it easy to ingest data into Redshift by letting you query operational databases directly, applying transformations on the fly, and loading data into the target tables without requiring complex ETL pipelines.

The Federated Query feature is available to all Redshift customers for preview. To get started and learn more, visit the documentation. Refer to the AWS Region Table for Amazon Redshift availability.

Amazon Redshift announces general availability for federated querying

Amazon Redshift enables you to query data across your operational databases, your data warehouse, and your data lake. With Amazon Redshift Federated Query, you can query live data in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL and integrate it with the data in your Amazon Redshift and Amazon S3 environments.

Amazon Redshift Federated Query allows you to incorporate live data as part of your business intelligence (Bl) and reporting applications. The intelligent optimizer in Redshift uses statistics from the operational database to determine the most efficient way to execute a federated query. Redshift pushes down and distributes a portion of the query directly into the remote operational databases to speed up query performance by reducing data moved over the network. Redshift complements query execution, as needed, with its own massively parallel processing capabilities. Federated Query also makes it easy to ingest data into Redshift by letting you query operational databases directly, apply transformations on the fly, and load data into the target tables without requiring complex ETL pipelines.

Federated Query is now available with Amazon Redshift cluster version 1.0.14677 or later. To learn more about federated query, visit the documentation and the blogs Federated query use cases, Getting started with federated query using AWS CloudFormation. Refer to the AWS Region Table for Amazon Redshift availability.

Amazon Redshift now supports per-second billing

Starting today, Redshift will be billed in one-second increments for on-demand clusters. Pricing is still listed on a per-hour basis, but bills are now calculated down to the second and show usage in decimal form.

Per-second billing is applicable to instances that are newly launched or already running in all AWS Regions. Please visit the Amazon Redshift pricing page for more information.

Announcing Microsoft Azure Active Directory support for Amazon Redshift 

Customers can now sign-on to Amazon Redshift cluster with Microsoft Azure Active Directory(AD) identities. This allows customers who use Azure AD to be able to sign-on to Redshift without duplicating these identities in Redshift.

Built on the industry standard SAML 2.0, Azure AD integration creates a relying party trust between Azure AD and Amazon Redshift to allow Single Sign-On. Azure AD integration is enabled as a configuration for Amazon Redshift JDBC and ODBC drivers. To use this integration with a JDBC driver, the Amazon Redshift JDBC driver must be version 1.2.37.1061 or later. To use Azure AD with an ODBC driver, the Amazon Redshift ODBC driver must be version 1.4.10.1000 or later.

Azure AD integration with Amazon Redshift is available in all AWS commercial regions. For a complete list of AWS regions that Amazon Redshift is available in, see the AWS Region Table.

For information on how to use Azure AD with Amazon Redshift and to configure the JDBC or ODBC drivers, see the Amazon Redshift documentation.

Setting Up JDBC or ODBC Single Sign-on Authentication with Microsoft Azure AD

You can use Microsoft Azure AD as an identity provider (IdP) to ac[^ss your Amazon Redshift cluster. Following, you can find a procedure that describes how to set up a trust relationship for this purpose. For more information about configuring AWS as a service provider for the IdP, see Configuring Your SAML 2.0 IdP with Relying Party Trust and Adding Claims in the 1AM User Guide.

To set up Azure AD and your AWS account to trust each other

1.            Create or use an existing Amazon Redshift cluster for your Azure AD users to connect to. To configure the connection, certain properties of this cluster are needed, such as the cluster identifier. For more information, see Creating a Cluster.

2.            Add Amazon Redshift as a non-gallery application on the Microsoft Azure portal. For detailed steps, see Configure SAML-based single sign-on to non-gallery applications G3 in the Microsoft Azure Application Management documentation.

3.            Set up Azure Enterprise Application to control Amazon Redshift access on the Microsoft Azure portal:

•             In the Setup up Single Sign-On with SAML page, in the Basic SAML Configuration section, choose the Edit icon.

•             For Identifier (Entity ID), enter https : //signin. aws. amazon. com/saml.

•             For Reply URL (Assertion Consumer Service URL), enter https : //signin. aws. amazon. com/saml.

•             In the User attributes and claims section, create the claims as shown in the following table. Here, 123456789012 is your AWS account, AzureSSO is an 1AM role you created, and AzureADProvider is the 1AM provider.

Announcing column-level access control for Amazon Redshift

Amazon Redshift now supports access control at a column-level for data in Redshift. Customers can use column-level grant and revoke statements to help them meet their security and compliance needs.

Many customers already use Redshift’s table-level access control for data in Redshift, and also want the ability to control access at a finer granularity. Now they can get control access to columns without having to implement views-based access control or use another system.

Column-level access control is available in all Amazon Redshift Regions. Refer to the AWS Region Table for Amazon Redshift availability. To learn more about using column-level access control, see the Amazon Redshift documentation.

Usage Notes for Column-Level Access Control

The following usage notes apply to column-level privileges on Amazon Redshift tables and views. These notes describe tables; the same notes apply to views unless we explicitly note an exception.

For an Amazon Redshift table, you can grant only the SELECT and UPDATE privileges at the column level. For an Amazon Redshift view, you can grant only the SELECT privilege at the column level.

The ALL keyword is a synonym for SELECT and UPDATE privileges combined when used in the context of a column-level GRANT on a table.

If you don’t have SELECT privilege on all columns in a table, performing a SELECT operation for all columns (SELECT *) fails.

If you have SELECT or UPDATE privilège on a table or view and add a column, you still have the same privileges on the table or view and thus all its columns.

Only a table’s owner or a superuser can grant column-level privileges.

The WITH GRANT OPTION clause isn’t supported for column-level privileges.

You can’t hold the same privilege at both the table level and the column level. For example, the user data_scientist can’t have both SELECT privilege on the table employee and SELECT privilege on the column employee. department. Consider the following results when granting the same privilege to a table and a column within the table:

  • If a user has   a table-level privilege on a table, then granting the same privilege at the column level has no effect.
  • If a user has   a table-level privilege on a table, then revoking the same privilege for one or more columns of the table returns an error.   Instead,

revoke the privilege at the table level.

  • If a user has   a column-level privilege, then granting the same privilege at the table level returns an error.
  • If a user has   a column-level privilege, then revoking the same privilege at the table level revokes both column and  table privileges for all

create user schema_user in group qa_users password ‘Abcdl234’
create schema qa_tickit;
create table qa_tickit.test (coll int);
grant all on schema qa_tickit to schema_user;
set session authorization schema_user;
select current_user;
current user
schema_user (1 row)
select count() from qa_tickit.test; ERROR: permission denied for relation test [SQL State=42501] set session authorization dw_user; grant select on table qa_tickit.test to schema_user; set session authorization schema_user; select count() from qa_tickit.test;

count

0
(1 row)

Achieve finer-grained data security with column-level access control in Amazon Redshift

Amazon Redshift is the most popular cloud data warehouse because rt provides fast insights at a low cost Customers can confidently run mission critical workloads, even in highly regulated industries, because Amazon Redshift comes with out of the box security and compliance. The security features, combined with the ability to easily analyze data in-place and in open formats, along with compute and storage elasticity, and ease of use are what makes tens of thousands of customers choose Amazon Redshift.

Many organizations store sensitive data, commonly classified as personally identifiable information (Pll) or sensitive personal information (SPI) in Amazon Redshift and this data will have restricted access from different persona in the organization. For example, your human resources, finance, sales, data science, and marketing departments may all have the required access privileges to view customer data, whereas only the finance department should have access to sensitive data like personally identifiable information (Pll) or payment card industry (PCI).

Views or AWS Lake Formation on Amazon Redshift Spectrum was used previously to manage such scenarios, however this adds extra overhead in creating and maintaining views or Amazon Redshift Spectrum. View based approach is also difficult to scale and can lead to lack of security controls. Amazon Redshift column-level access control is a new feature that supports access control at a column-level for data in Amazon Redshift You can use column-level GRANT and REVOKE statements to help meet your security and compliance needs similar to managing any database object.

Amazon Redshift launches pause and resume

Amazon Redshift now supports the ability to pause and resume a cluster, allowing customers to easily suspend on-demand billing while the duster is not being used. For example, a duster used for development can now have compute billing suspended when not in use. White the duster is paused, you are only charged for the duster’s storage. This adds significant flexibility in managing operating costs for your Amazon Redshift dusters.

The pause and resume operations can be invoked in the Redshift console or using the Redshift API. Paused clusters will still appear as an entry in the console. Pause and resume can also be automated using a schedule you define to match your operational needs.

Pause and resume is available in all Amazon Redshift Regions. Refer to the AWS Region Table for Amazon Redshift availability.

To learn more about us-ng pause and resume, see the Amazon Redshift documentation.

Pausing and Resuming Clusters

If you have a cluster that only needs to be available at specific times, you can pause the cluster and later resume it. While the cluster is paused, on- demand billing is suspended. Only the cluster’s storage incurs charges. For more information about pricing, see the Amazon Redshift pricing page

When you pause a duster. Amazon Redshift creates a snapshot, begins terminating queries, and puts the cluster in a pausing state. If you delete a paused cluster without requesting a final snapshot, then you can’t restore the cluster. You can’t cancel or roll back a pause or resume operation after it’s initiated.

You can pause and resume a cluster on the new Amazon Redshift console (not the original console), with the AWS CU, or with Amazon Redshift API operations.

You can schedule actions to pause and resume a cluster. When you use the new Amazon Redshift console to create a recurring schedule to pause and resume, then two scheduled actions are created for the date range that you choose. The scheduled action names are suffixed with -pause and – resume. The total length of the name must fit within the maximum size of a scheduled action name.

You can’t pause the following types of dusters:

  • EC2-Class* clustery
  • Clusters that are not active, for example a cluster that is currently modifying.
  • Hardware security module (HSM) dusters.
  • Clusters that have automated snapshots disabled.

When deciding to pause a cluster, consider the following:

  • Connections or queries to the cluster aren’t available.
  • You can’t see query monitoring information of a paused cluster on the Amazon Redshift console.

When deciding to pause a duster, consider the following:

  • Connections or queries to the cluster aren’t available.
  • You can’t see query monitoring information of a paused cluster on the Amazon Redshift console.
  • You can’t modify a paused cluster. Any scheduled actions on the cluster aren’t done. These include creating snapshots, resizing dusters, and cluster maintenance operations.
  • Hardware metrics aren’t created. Update your CloudWatch alarms if you have alarms set on missing metrics.
  • You can’t copy the latest automated snapshots of a paused duster to manual snapshots.
  • While a cluster is pausing it can’t be resumed until the pause operation is complete.
  • When you pause a cluster. billing is suspended However, the pause operation typically completes within 15 minutes depending upon the size of the duster.

When you resume a duster, consider the following:

  • The cluster version of the resumed cluster is updated to the maintenance version based on the maintenance window of the cluster.
  • If you delete the subnet associated with a paused duster, you might have an incompatible network. In this case, restore your duster from the
    latest snapshot.
  • If you delete an Elastic IP address while the cluster is paused, then a new Elastic IP address is requested.
  • If Amazon Redshift can’t resume the cluster with its previous elastic network interface, then Amazon Redshift tries to allocate a new one.
  • When you resume a cluster, your node IP addresses might change. You might need to update your VPC settings to support these new IP
    addresses for features like COPY from Secure Shell (SSH) or COPY from Amazon EHR.
  • If you try to resume a duster that isn’t paused, the resume operation returns an error. If the resume operation is part of a scheduled action,
    modify or delete the scheduled action to prevent future errors.

• Logs and session data aren’t restored on resume.

Amazon Redshift now recommends sort keys for improved query performance

Amazon Redshift Advisor now recommends sort keys for frequently queried tables. With the ALTER TABLE command, you can add and change sort keys of existing Redshift tables without having to re-create the tables and without impacting concurrent read or write queries.

Choosing the most appropriate sort key for a table accelerates the performance of queries, especially those with range-restricted predicates, by requiring less data to be read from disk. Advisor can now determine the sort key for a table by analyzing the query history each table , eliminating the need to define a sort key in advance.

Advisor is like a personal database assistant that generates tailored recommendations related to database operations and cluster configuration based on analyzing your cluster’s performance and usage metrics Advisor is accessible via the left-hand navigation menu on your Amazon Redshift console. On the Advisor page, simply select the cluster for which you want to see recommendations. Advisor only displays recommendations that will improve overall query performance of your workload. Advisor refreshes recommendations regularly and once it determines that a recommendation has been addressed, it will remove it from the recommendations list. For more information, see Working with Recommendations from Amazon Redshift Advisor.

Sort key recommendation is now available with the Amazon Redshift release version 1.0.12911 or higher in US East (N. Virginia). US West (N. California. Oregon). EU (Frankfurt. Ireland), and Asia Pacific (Seoul. Singapore. Sydney, Tokyo).

Amazon Redshift launches RA3.4xlarge nodes with managed storage

Starting today, Amazon Redshift RA3.4xlargc nodes are generally available. These new node types support up to 64 TÖ of Redshift managed storage per node. They enable you to scale and pay for compute and storage independently, allowing you to size your cluster based only on your compute needs.

Previously, we launched the now popular RA3.16xlargc nodes for the largest workloads with 48vCPUs. 384 GiB of memory, and support for up to 64TB of Redshift managed storage per node. RA3.4xlarge nodes offer one-fourth compute (12 vCPU). and memory (96 GiB) compared to RA3.16xlarge at one-fourth of the price. By choosing the RA3 node size and number of nodes, you can now provision the right amount of compute for your workload.

RA3 nodes are built on the AWS Nitro System and feature high bandwidth networking and large high-performance SSDs as local caches RA3 nodes leverage your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching. to deliver the performance of local SSD while scaling storage automatically to S3. RA3 nodes remain fully compatible with all your existing workloads.

To get started with RA3 nodes, you can create a cluster with the AWS Management Console or the create cluster API. To upgrade your cluster to an RA3 cluster, you can take a snapshot of your existing cluster and restore it to an RA3 duster, or do a resize from your existing duster to a new RA5 duster. To learn more about RA3 nodes, see the duster management guide. You can find more information on pricing by visiting the Amazon Redshift pricing page.

Amazon Redshift now supports changing node types within minutes with elastic resize

Amazon Redshift now supports elastic resize across node types. Customers can change node types within minutes and with one simple operation using elastic resize.

Elastic resize across node type automates the steps of taking a snapshot, creating a new cluster, deleting the old cluster, and renaming the new cluster into a simple, quick, and familiar operation. Elastic resize operation can be run at any time or can be scheduled to run at a future time. Customers can quickly upgrade their existing DS2 or DC2 node type-based cluster to the new RA3 node type with elastic resize.

Elastic resize jeross node types is supported with release version 1.0.10013 or later. Refer to the AWS Region Table for Amazon Redshift availability.

To learn more about us-ng elastic resize with your Redshift cluster, see resizing clusters in the Amazon Redshift Cluster Management Guide.

Amazon Redshift introduces support for multi-factor authentication

Amazon Redshift now supports multi-factor authentication (MFA). Customers can use MFA to provide additional security when authenticating to their Amazon Redshift cluster.

Customers can configure identity federation provider to require users to prove their identity using additional authentication factors. Amazon Redshift now supports a browser-based authentication workflow that works with any MFA configured by the identity federation provider. This browser-based authentication workflow also makes integration with any Security Assertion Markup Language (SAMI) based identity federation easier.

To use multi-factor authentication with Amazon Redshift. get Amazon Redshift JDBC driver version 1.2.41.1065 and ODBC driver version 1.4.11.1000 or later. Refer to the AWS Region Table for Amazon Redshift availability.

To learn more about us-ng MFA to authenticate to your Redshift cluster, see ‘Options for Providing 1 AM Credentials* in the Amazon Redshift Cluster Management Guide.

Options for providing 1AM credentials

To provide 1AM credentials for a JDBC or ODBC connection, choose one of the following options.

  • AWS profile

As an alternative to providing credentials values in the form of JDBC or ODBC settings, you can put the values in a named profile. For more information, see Using a Configuration Profile.

  • 1AM credentials

Provide values for AccessKeylD, SecretAcccssKey, and, optionally, SessionToken in the form of JDBC or OOBC settings. SessionToken is required only for an IAM role with temporary credentials. For more information, see JDBC and OOBC options for providing IAM credentials.

  • Identity provider federation

When you use identity provider federation to enable users from an identity provider to authenticate to Amazon Redshift. specify the name of a credential provider plugin. For more information, see Using a credentials provider plugin.

The Amazon Redshift JDBC and OOBC drivers include plugins for the following SAML-based identity federation credential providers:

o Microsoft Active Identity Federation Services (AD FS)

o PingOne

o Okta

o Microsoft Azure Active Directory (Azure AD)

Setting up multi-factor authentication

To support multi-factor authentication (MFA), Amazon Redshift provides browser-based plugins. Use the browser SAML plugin for Okta, PingOne, and Active Directory Federation Services, and the browser Azure AD plugin for Microsoft Azure Active Directory

Announcing cost controls for Amazon Redshift Spectrum and Concurrency Scaling

You can now monitor and control your usage and associated cost for Amazon Redshift Spectrum and Concurrency Scaling features. You can create daily, weekly, and monthly usage limits, and define actions that Amazon Redshift automatically takes if those limits are reached to maintain your budget with predictability. Actions include: logging an event to a system table, alerting with a CloudWatch alarm, notifying an administrator with SNS, and disabling further usage. Amazon Redshift Spectrum enables you to power a lake house architecture to directly query and join data across your data warehouse and data lake, and Concurrency Scaling enables you to support thousands of concurrent users and queries with consistently fast query performance.

To create usage limits in the Amazon Redshift console, simply choose “Configure usage limit” in the “Actions” menu for your cluster. You can monitor your usage trends and get alerts on usage exceeding your defined limits with automatically generated CloudWatch metrics in your cluster monitoring and performance tabs. You can also create, modify, and delete usage limits programmatically by using the AWS CLI and API.

Cost controls are available to all Amazon Redshift customers with release version 1.0.14677 or later in the AWS regions where Spectrum and Concurrency Scaling are available. For more information visit the documentation. Refer to the AWS Region Table for Amazon Redshift availability.

Redshift Skills You can add to your resume

Architecture and Planning Basic Operations DDL, DML Table Design Performance Tuning System Information Operations and Maintenance Administration

Security and Cost Optimization