REDSHIFT

What are the key / value stores (opaque)?

Keys are mapped to values

Values are treated as BLOBs (opaque data)

No type information is stored ■

Values can be heterogenous

What are the Document stores (non-shaped) ?

Keys are mapped to documents

Documents consist of attributes

Attributes are name/typed value pairs, which may be nested

Type information is stored per attribute

Documents can be heterogenous

Documents may be organised in collections or databases

Relational databases

Tables (relations) consist of rows and columns

Columns have a type. Type information is stored once per column

A rows contains just values for a record (no type information)

All rows in a table have the same columns and are homogenous

Row-oriented storage

  • Row-oriented storage is especially inefficient when only a small amout of columns is needed but the table has many columns

Column-oriented storage

  • Column-oriented databases primarily work on columns
  • All columns are treated individually
  • Values of a single column are stored contiguously
  • This allows array-processing the values of a column
  • Rows may be constructed from column values later if required
  • This means column stores can still produce row output (tables)
  • Values from multiple columns need to be retrieved and assembled for that, making implementation of bit more complex
  • Query processors in columnar databases work on columns, too

Column-oriented storage

  • Column stores can greatly improve the performance of queries that only touch a small amount of columns
  • This is because they will only access these columns’ particular data
  • Simple math: table t has a total of 10 GB data, with
  • column a: 4 GB
  • column b: 2 GB
  • column c: 3 GB
  • column d: 1 GB
  • If a query only uses column d, at most 1 GB of data will be processed by a column store

= In a row store the full 10 GB will ho processed

Column-oriented storage

  • Column stores store data in column-specific files
  • Simplest case: one datafile per column
  • Row values for each column are stored contiguously

Column-oriented storage

  • All data within each column datafile have the same type, making it ideal for compression
  • Usually a much better compression factor can be achieved for single columns than for entire rows
  • Compression allows reducing disk I/O when reading/writing column data but has some CPU cost
  • For data sets bigger than the memory size compression is often beneficial because disk access is slower than decompression

Column-oriented processing

  • Reading all columns of a row is an expensive operation in a column store, so full row tuple construction is avoided or delayed as much as possible internally
  • Updating or inserting rows may also be very expensive and may cost much more time than in a row store
  • Some column stores are hybrids, with read-optimised (column) storage and write-optimised OLTP storage
  • Still, column stores are not really made for OLTP workloads, and if you need to work with many columns at once, you’ll pay a price in a column store

AMAZON REDSHIFT

An Amazon Redshift data warehouse is an enterprise-class relational database query and management system.

An Amazon Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.

Amazon Redshift supports client connections with many types of applications, including business intelligence (Bl), reporting, data, and analytics tools.

When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data in multiple-stage operations to produce a final result.

Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes.

Amazon Redshift is based on PostgreSQL 8.0.2. but have a number of very important differences too.

Client applications – Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (Bl) reporting, data mining, and analytics tools. Amazon Redshift is based on industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal changes.

Connections – Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers.

Clusters – The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.

Leader node – The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node. The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node.

Compute nodes – The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.

Node slices – A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation. The number of slices per node is determined by the node size of the cluster.

Internal network • Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very highspeed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.

• Massively parallel processing (MPP) – Enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel.

• Columnar storage for database tables – Drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries.

• Data compression – Reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you execute a query, the compressed data is read into memory, then uncompressed during query execution. Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table wiih data

Encryption – When you provision the cluster, you can optionally choose to encrypt the cluster for additional security. When you enable encryption, Amazon Redshift stores all data in user-created tables in an encrypted format. Encryption is an immutable property of the cluster. The only way to switch from an encrypted cluster to a nonencrypted cluster is to unload the data and reload it into a new cluster. Encryption applies to the cluster and any backups. When you restore a cluster from an encrypted snapshot, the new cluster is encrypted as well.

Durability – Because Amazon Redshift distributes and executes queries in parallel across all of a cluster’s compute nodes, you can increase query performance by adding nodes to your cluster. Amazon Redshift also distributes your data across all ctgnpute nodes in a cluster. When you run a cluster with at least two compute nodes, data on each node will always be mirrored on disks on another node and you reduce the risk of incurring data loss.

Databases – Amazon Redshift creates one database when you provision a cluster. A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query execution with the compute nodes.

Schemas – A database contains one or more named schemas. Each schema in a database contains tables and other kinds of named objects. By default, a database has a single schema, which is named PUBLIC. You can use schemas to group database objects under a common name. Schemas are similar to operating system directories, except that schemas cannot be nested. Identical database object names can be used in different schemas in the same database without conflict. Users with the necessary privileges can access objects across multiple schemas in a database. By default, an object is created within the first schema in the search path of the database.

Tables – When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style. The number of slices per node depends on the node size of the cluster. The nodes all participate in parallel query execution, working on data that is distributed across the slices. When you execute a query the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. Redistribution might involve either sending specific rows to nodes for joining or broadcasting an entire table to all of the nodes.

Views – A view is a SQL query that defines logic and output schema of the view. Every time the view is referenced in a query, the definition of the view is executed.

A view is not physically materialized likes a table, though the view output can be materialized with an external table after view execution. Having ownership of a view, or having privileges granted on a view, does not imply access to the underlying tables. You need to grant access to the underlying tables explicitly.

Data Types – Each value that Amazon Redshift stores or retrieves has a data type with a fixed set of associated properties. Data types are declared when tables are created. A data type constrains the set of values that a column or argument can contain.

The following table lists the data types that you can use in Amazon Redshift tables.

Users – When you provision a cluster, you specify a master user who has access to all of the databases that are created within the cluster. This master user is a superuser who is the only user with access to the database initially, though this user can create additional superusers and users.

Super User – Database superusers have the same privileges as database owners for all databases. The masteruser, which is the user you created when you launched the cluster, is a superuser. You must be a superuser to create a superuser. Amazon Redshift system tables and system views are either visible only to superusers or visible to all users. Only superusers can query system tables and system views that are designated “visible to superusers”. A database superuser bypasses all permission checks.

User Groups – Groups are collections of users who are all granted whatever privileges are associated with the group. You can use groups to assign privileges by role. For example, you can create different groups for sales, administration, and support and give the users in each group the appropriate access to the data they require for their work. You can grant or revoke privileges at the group level, and those changes will apply to all members of the group, except for superusers.

User Defined Functions – You can create a custom user-defined scalar function (UDF) using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. For Python UDFs, in addition to using the standard Python functionality, you can import your own custom Python modules.

Parameter Groups – Amazon Redshift uses parameter groups to define the behavior of all databases in a cluster, such as date presentation style and floating-point precision. If you don’t specify a parameter group when you provision your cluster, Amazon Redshift associates a default parameter group with the cluster.

Node Type – When you launch a cluster, one option you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. The dense storage (DS) node types are storage optimized. The dense compute (DC) node types are compute optimized. The cost of your cluster depends on the region, node type, number of nodes, and whether the nodes are reserved in advance

  • Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type.
  • As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.
  • Each node provides two storage choices. You can start with a single 160 GB node and scale up to multiple 16 TB nodes to support a petabyte of data or more.

The node type that you choose depends heavily on three things:

  • The amount of data you import into Amazon Redshift
  • The complexity of the queries and operations that you run in the database

• The needs of downstream systems that depend on the results from those queries and operations

 Amazon Redshift Free Trial

Try Amazon Redshift for free! If you’ve never created an Amazon Redshift cluster, you’re eligible for a two month free trial of our DC2.Large node.

You get 750 hours per month for free, enough hours to continuously run one DC2.Large node with 160GB of compressed SSD storage. You can also build clusters with multiple nodes to test larger data sets, which will consume your free hours more quickly. Once your two month free trial expires or your usage exceeds 750 hours per month, you can shut down your clfister to avoid any charges, or keep it running at our standard On-Demand Rate.

 On-Demand Pricing

Amazon Redshift On-Demand pricing allows you to pay for capacity by the hour with no commitments and no upfront costs – you simply pay an hourly rate based on the type and number of nodes in your cluster. This frees you from planning and purchasing data warehouse capacity ahead of your needs, and enables you to cost-effectively spin up and tear down environments for development or test purposes.