Main Page | Report this Page
Computers Forum Index  »  Computer - Databases - IBM DB2  »  Generating Indentity Columns during load on large...
Page 1 of 1    

Generating Indentity Columns during load on large...

Author Message
Mark A...
Posted: Wed Oct 28, 2009 3:45 pm
Guest
DB2 for Linux 9.7. Since DPF (Data Partitioning Feature) tries to do things
in parallel, how much extra overhead is there is having DB2 assign values to
an identity column that is used as the hash partitioning key during a load
of a very large table (compared to having the value already populated in the
input file)? Obviously, this has to be done sequentially, and not in
parallel.
 
mirof007...
Posted: Thu Oct 29, 2009 5:38 pm
Guest
Hi Mark, I think this depends on the size of the values cache
associated with a given identity column - i.e., I would expect load to
perform better with larger cache sizes. By default, I believe, DB2
will use a cache of 20 values, so you're synchronizing only once every
20 rows, but the value cache size can be adjusted using ALTER TABLE
ALTER COLUMN.

Regards,
Miro
 
Mark A...
Posted: Fri Oct 30, 2009 5:15 am
Guest
"Mark A" <noone at (no spam) nowhere.com> wrote in message
news:hcdfq0$np6$1 at (no spam) news.eternal-september.org...
Quote:
I am aware of cache for generated columns. You didn't address the question
of generating a hash key in a DPF (Data Partitioning Feature) database. My
question was about the how much extra overhead is there for doing this.

The following doc suggests that there is a 198% "degradation" for
generating the identity column used as a hash partitioning key in DPF. See
page

Sorry, didn't finish the above post.

Here is the link (See page 35):
ftp://ftp.software.ibm.com/software/data/pubs/papers/loaderperf.pdf

C1 INTEGER GENERATED ALWAYS AS IDENTITY in partitioning key 1909 sec
(198±2)%.

198% degradation means the load runs 3 times as long. Granted, this was for
DB2 Version 8.1, but unless I find some other documentation to the contrary,
I will assume there is still a big penalty even if not still 3 times as
long. Even a 50% degradation is significant when having to load many
millions of rows.
 
Mark A...
Posted: Fri Oct 30, 2009 5:15 am
Guest
"mirof007" <mirof007 at (no spam) gmail.com> wrote in message
news:5a975cbd-e598-4b31-9abb-57e351254fef at (no spam) k13g2000prh.googlegroups.com...
Quote:
Hi Mark, I think this depends on the size of the values cache
associated with a given identity column - i.e., I would expect load to
perform better with larger cache sizes. By default, I believe, DB2
will use a cache of 20 values, so you're synchronizing only once every
20 rows, but the value cache size can be adjusted using ALTER TABLE
ALTER COLUMN.

Regards,
Miro

I am aware of cache for generated columns. You didn't address the question
of generating a hash key in a DPF (Data Partitioning Feature) database. My
question was about the how much extra overhead is there for doing this.

The following doc suggests that there is a 198% "degradation" for generating
the identity column used as a hash partitioning key in DPF. See page
 
mirof007...
Posted: Fri Oct 30, 2009 7:37 pm
Guest
Hi Mark, you're right, I misunderstood the question - missed the part
where the identity column is part of the partitioning key.
Unfortunately I don't have any further info to offer on this.

Miro
 
Ian...
Posted: Tue Nov 03, 2009 10:15 pm
Guest
Mark A wrote:
Quote:
"Mark A" <noone at (no spam) nowhere.com> wrote in message
news:hcdfq0$np6$1 at (no spam) news.eternal-september.org...
I am aware of cache for generated columns. You didn't address the question
of generating a hash key in a DPF (Data Partitioning Feature) database. My
question was about the how much extra overhead is there for doing this.

The following doc suggests that there is a 198% "degradation" for
generating the identity column used as a hash partitioning key in DPF. See
page

Sorry, didn't finish the above post.

Here is the link (See page 35):
ftp://ftp.software.ibm.com/software/data/pubs/papers/loaderperf.pdf

C1 INTEGER GENERATED ALWAYS AS IDENTITY in partitioning key 1909 sec
(198±2)%.

198% degradation means the load runs 3 times as long. Granted, this was for
DB2 Version 8.1, but unless I find some other documentation to the contrary,
I will assume there is still a big penalty even if not still 3 times as
long. Even a 50% degradation is significant when having to load many
millions of rows.

I think that this is an edge case. Using the database to generate a
surrogate key that is part of the table's partitioning key is clearly
not a best (or even recommended) practice. In fact, the DB2 optimizer
guys in Toronto will tell you that using a surrogate key instead of a
natural key isn't even best practice.

I'm not Joe Celko, so if you have to use a surrogate key the world won't
end, but there are much better ways to generate values than using an
identity column. All ETL tools have this, or you can even do it at
the UNIX level. For example (assuming you have a comma-delimited
input file, and you want to prepend the surrogate key to the existing
data):

mkfifo named_pipe
awk -F '{print NR, "," $0}' inputfile > named_pipe &
db2 "load from named_pipe of del insert into fact_table"


If you want to start with something other than 1, replace "NR" in the
awk statement with "NR+x" where x is the offset...
 
 
Page 1 of 1    
All times are GMT
The time now is Thu Dec 03, 2009 4:40 pm