IT:Transforming grouping in DCS

From Stiki

Jump to: navigation, search
This page contains sample code to be used as part of a complete program. You may need to adapt this code to make it suitable for your requirements. Please be aware of the disclaimer regarding the use of any code posted on this site.

DCS allows output data to be grouped by certain variables. The output variables that are not part of the grouping set of variables are then grouped together according to the rules specified in the Grouping tab. There are three possible calculation types allowed for any one output variable that is being grouped:

  • Average returns the arithmetic mean of the variable values within the group.
  • Sum returns the arithmetic sum of the variable values within the group.
  • SumLog returns the natural logarithm of the sum of the exponents of the variable values within the group.

These groupings can be further defined by specifying weighting variables, by determining the number of decimal points to show in the output file. The remainder of this article assumes that the weighting settings are not being used.

Any single variable that appears in an output file for a DCS program can only be given one grouping setting. That is to say if you are outputting the variable ANNUAL_PREM in three different output formats, each output format will use the same settings; you cannot define different settings for different output formats (and certainly not different settings for different files that use the same output format).

Contents

Example requiring different grouping calculations

There may be situations where you need to vary the grouping calculation used. For example your data may represent tranches of assurance business, where one policy may have multiple tranches. You wish to output a single record for each policy. The input data you have shows the same sum assured across each tranche, but a different premium value. Your in force output file therefore groups the records by policy number and sets the SUM_ASSURED variable to group by Average. However your new business file just needs to show the amount of the sum assured relating to any new tranches of business, and is calculated for each tranche based on the premium. You need to group the sum assured by Sum in order to get the correct value if there are more than one new business tranche per policy in a set of data.

Solution 1: one program for in force, one program for new business

One solution is to have two separate DCS programs: one that outputs the in force file using Average SUM_ASSURED and one that outputs the new business file using Sum SUM_ASSURED. This is a poor idea however, since it requires duplicate code which must be maintained in parallel, and will require a longer total run time since many of the same calculations will be performed twice.

Solution 2: an additional program to count the number of records per group

Another solution is to have a separate DCS program that only counts how many records there are for each policy. This should be run before the main program, and outputs the results into an external table. This table can then be read in by the main program.

The main program should have SUM_ASSURED set to group by Sum. For new business, the calculated values can be summed to get the correct total amount per policy. For in force business, the sums assured for each record can be divided by the numer of records per group, so that the sum of these adds back up to the policy's sum assured.

However this solution introduces new risks:

  • The additional program may not be run before the main program, which may result in a table of values from a previous run being erroneously used in the current run.
  • The additional program must replicate the validation of the main program, so that invalid records are excluded from its count. Unless the validations are very simple, this may mean that large sections of the main program need to be copied to the additional program, and the two sets of code must then be maintained in parallel.
  • Writing data to and reading data from an external table is relatively slow (compared with performing calculations), and the entire input data file must be processed twice (once by each program) so there will be an increase to the run time if this solution is used.

Solution 3: transform the values to match the grouping calculation

A still better solution is to transform the calculated data so that the variables, when grouped, return the values that you want. In this case, this would involve transforming the calculated value of SUM_ASSURED from Average into Sum for just the new business output file.

This cannot be done by simply dividing each record's SUM_ASSURED by the number of records in the record's group, since at the time that record 1 for a group is being processed we do not know how many records there will be in total for that group. As such we need to make running adjustments to each record's SUM_ASSURED so that the average across all records in that group is actually the sum of the calculated values.

Transforming the grouping calculation

For each of the six possible transformations below, the following terminology is used in the formulae:

A(k) is the actual calculated value for the variable, for record k of a single group (k=1, 2, 3, ...)
G(k) is the transformed value of A(k) so that the grouped value of G(k) gives the required result

G(1) is always equal to A(1). Each formula below then shows how to transform A(k) into G(k) for k>1, using only the values of A(k), k, and an aggregate sum of the preceding values A(1) to A(k-1).

The most common requirement will be to transform between Average and Sum (in either direction). The formulae for SumLog are also described below, although they are necessarily more complex. Do not let any perceived complexity in the mathematical formulae put you off; the sample DCS code for each case is relatively simple to implement.


Transforming Average into Sum

If the variable is usually grouped by Average but you need it to be grouped by Sum, use the following calculation:

G(k)=kA(k)+\sum_{i=1}^{k-1}A(i)

Transforming Average into SumLog

If the variable is usually grouped by Average but you need it to be grouped by SumLog, use the following calculation:

G(k)=k\log\left{e^{A(k)}+\sum_{i=1}^{k-1}e^{A(i)}\right}-(k-1)\log\left{\sum_{i=1}^{k-1}e^{A(i)}\right}
Valid only for A(k)<\log\left{1.7\times10^{308}-\sum_{i=1}^{k-1}e^{A(i)}\right}<709.7268...

Transforming Sum into Average

If the variable is usually grouped by Sum but you need it to be grouped by Average, use the following calculation:

G(k)=\frac{A(k)}{k}-\frac{\sum_{i=1}^{k-1}A(i)}{k(k-1)}

Transforming Sum into SumLog

If the variable is usually grouped by Sum but you need it to be grouped by SumLog, use the following calculation:

G(k)=\log\left{1+\frac{e^{A(k)}}{\sum_{i=1}^{k-1}e^{A(i)}}\right}
Valid only for A(k)<\log\left{1.7\times10^{308}-\sum_{i=1}^{k-1}e^{A(i)}\right}<709.7268...

Transforming SumLog into Average

If the variable is usually grouped by SumLog but you need it to be grouped by Average, use the following calculation:

G(k)=\log\left{e^{\left[\frac{A(k)}{k}-\frac{\sum_{i=1}^{k-1}A(i)}{k(k-1)}\right]}-1\right}+\frac{\sum_{i=1}^{k-1}A(i)}{k-1}
Valid only for \frac{\sum_{i=1}^{k-1}A(i)}{k-1}<A(k)<k\times709.7268...+\frac{\sum_{i=1}^{k-1}A(i)}{k-1}

Transforming SumLog into Sum

If the variable is usually grouped by SumLog but you need it to be grouped by Sum, use the following calculation:

G(k)=\log\left{e^{A(k)}-1\right}+\sum_{i=1}^{k-1}A(i)
Valid only for 0<A(k)<709.7268...

Notes on sample code

Description of variables

  • OUTPUT_VALUE is the variable that needs to be transformed so that its grouped value after transformation (by the grouping defined for that variable on the DCS Grouping tab) equals the required grouped total for the value (by the different grouping calculation as required for that particular value).
  • OUTPUT_VALUE_NEW is the value of OUTPUT_VALUE after transformation. This is then reassigned to OUTPUT_VALUE (after calculation of GROUP_SUM) so that it is output correctly.
  • GROUP_ID represents a variable that uniquely identifies the records belonging to each group. In the example given above this could be POLICY_NUM, say.
  • GROUP_ID_PREV holds the GROUP_ID of the last record considered, so that it can be compared against the GROUP_ID of the current record to see if they are grouped together or not.
  • GROUP_COUNT holds a count of how many records have been found so far in the current group.
  • GROUP_SUM holds an aggregate sum of the value of OUTPUT_VALUE for all of the earlier records in the current group.

Assumptions made in this code

  • The grouping calculation for OUTPUT_VALUE has no weighting variables.
  • The records of the input data are sorted by GROUP_ID. (Complete sorting is not necessary; it is sufficient that for any two records with the same value of GROUP_ID, there are no records between them with a different GROUP_ID that use the section of code performing the transformation. But note that sorting the data is the easiest way to achieve this if the data is not in a suitable order.)
  • Any calculations that depend on the value of OUTPUT_VALUE are performed before the transformation code, so that they are not affected by the transformation.
  • All values of OUTPUT_VALUE (pre-transformation) are within the valid range for that transformation type. (Note that no explicit validation is given for the transformation between Average and Sum (in either direction) since these just involve arithmetic sums, and so are restricted only by the limit on a number that its absolute value is no greater than 1.7×10308).
  • The part of the code conditional on FIRST_RECORD should not be nested within any other conditions, and should be placed before any validation. This is to ensure that it is always called on the very first record of the input data. Otherwise when the transformation code is first called, the value of GROUP_ID_PREV will be undefined and so set to a default value for its variable type. This is an issue if the GROUP_ID for the first record using the transformation code happens to equal this value, or if the variables have not been declared as types and cause a conflict when compared.
  • GROUP_ID is assumed to be a text variable that cannot have a blank value. If this is not the case, set GROUP_ID_PREV to some other appropriate value in the part of the code conditional on FIRST_RECORD (for example 0 or -1 for a numeric variable).
Personal tools