Vertica[l] Bits

Monday, 28 January 2013

Want to store frequently accessed data on a faster storage (SSD)? Storage Policies can help.

Vertica "Storage Locations", as Vertica users know are storage paths on disk that you specify to store data and temp files. Each Vertica cluster requires a location to store data, and another location to store catalog. These default storage locations (that we specify during Vertica install & database creation) must exist on all the nodes.

e.g.
/home/dbadmin/testdb/v_testdb_node0001_data
/home/dbadmin/testdb/v_testdb_node0001_catalog

Whenever we add data to database or perform any DML operation, the data is stored in these storage locations on disk. Vertica also supports adding new storage locations to each nodes, if we want to increase our storage capacity. The new storage location can be on Vertica nodes, or a local SAN.

So far so good, but till Vertica 6.0, we didn't have the option to selectively store any database object (database, schemas, tables) on any particular storage location. All the data added or modified was stored in the default storage location. Having this option/freedom is quite desirable when we have some business critical (frequently accessed) tables that we want to store on a faster (and not to mention costlier) SSD disk.

Vertica 6.1 release (code-name Bulldozer) lets us do this with two newly introduced storage concepts, viz. Storage Labels & Storage Policies.

Location Label: This lets us create labelled storage locations. A new optional parameter has been added to ADD_LOCATION function to create a storage location with descriptive label. These labelled storage location are used to define storage policies.

VSQL=> select add_location ('/home/dbadmin/SSD/schemas','v_testdb_node0001', 'data', 'SSD');
add_location
-------------------------------------
/home/dbadmin/SSD/schemas added.
(1 row)

The example above creates a new data storage location with label "SSD". We will use this label to identify storage location while creating storage policy.

Once we have added our faster storage location (identified by a label), it's time to create a Storage Policy to associate database objects with it. The newly introduced Vertica function that lets us do this is called "SET_OBJECT_STORAGE_POLICY". Once a storage policy exists, Vertica uses the labeled location as the default storage location for the object data. Storage policies let you determine where to store your critical data. One example I already discussed above. Only one storage policy can exist per database object.

VSQL=> select set_object_storage_policy ('SALES', 'SSD');
set_object_storage_policy
-----------------------------------
Default storage policy set.
(1 row)

Every time data is loaded and updated, Vertica checks to see whether the object has an associated storage policy. If it does, Vertica automatically uses the labeled storage location. If no storage policy exists for an object, or its parent entities, Vertica uses default storage algorithms on available storage locations.

A Storage Policy can be cleared using function "clear_object_storage_policy", specifying the object name associated with Storage Location.

VSQL=> select clear_object_storage_policy('SALES');

Tuesday, 25 December 2012

COPY: How to skip first column that has a different delimiter

I had this very interesting problem sometime back wherein customer had a data file with first column delimiter different from rest of the columns. Customer wanted to load this file using COPY command and wanted to skip first column.

The data file looked like this
The first column of the data file has ~ as a field delimeter
The rest of the columns have pipe (|) as a filed delimeter

$ cat load.out
SkipMe1 ~ A1 | B1 | C1
SkipMe2 ~ A2 | B2 | C2

The table definition is below

CREATE TABLE t (
AColumn VARCHAR(10),
BColumn VARCHAR(10),
CColumn VARCHAR(10)
);
CREATE PROJECTION tp (
AColumn,
BColumn,
CColumn
)
AS SELECT * from t;

The data once loaded was supposed to be as shown below:

AColumn | BColumn | CColumn
------+-----------+-------
A1 | B1| C1
A2 | B2| C2

Skipping a column in COPY command is easy using "Filler", and so is specifying column delimiter. But I had never tried before if we can specify delimiter for a specific column and that too in conjunction with Filler option. So I tested the solution out and voila...

The COPY command I used is shown below. Note the use of "FILLER" and specific use of "DELIMITER" only for first column.

$ cat load.out | vsql -c "copy t(c1 FILLER varchar(10) delimiter '~',c2 FILLER varchar(10), c3 FILLER varchar(10),c4 FILLER varchar(10), AColumn as c2, BColumn as c3, CColumn as c4) from stdin direct";

Sunday, 23 December 2012

Changing row delimiter in VSQL

Consider a scenario when you export some data with VSQL and some of the string columns have new line characters as part of column value. Now if you want to load the data file to some other table using COPY statement, or may be parsing the data using some of your custom parser, the new line characters in string column are bound to pose problem.

One of the easiest and cleanest way to tackle this problem is to change VSQL row delimiter to some character other than new line. So that when you export data, the rows will be delimited by the character of your choice and hence your COPY command or your custom parser can tell new line characters in column value to row delimiter.

In the example below, record separator is changed from default \n to '~'. The fields are separated by '|' and records are separated by '~"

At command line:

Use -R option (along with -A for unaligned output) to specify field separator.

$ vsql -o my_table.out -c "select * from my_table" -A -R '~';
$ cat my_table.out
id|status~1001|t~1002|f~1003|t~1004|f~(4 rows)

At vsql prompt:

Use \pset to specify new record separator
vsql=> \pset recordsep '~'
Record separator is "~".

Set output format as unaligned
vsql=> \a
Output format is unaligned.

vsql=> \o my_table2.out
vsql=> select * from my_table;
vsql=> \o