Selecting the Modulo for an Index

There are two ways of working out the best modulo for an index:

Use the algorithm described in the section Calculating the Modulo. You can use statistics from the file concerned or, for new files, estimates based on the type of index and number of items you expect it to contain.
For existing indexes, you can use the ISTAT command with the J option to try different modulos. The section ISTAT (J and how it's helpful in Resizing contains examples of correctly-, over- and under-sized indexes and gives resizing guidelines.

Calculating the Modulo

The method shown below allows you to calculate the modulo for an index. However, the type of data also needs to be considered. If the size of the file is expected to change, you should use estimated values.

The following calculations provide only approximate values and assume that at least 200 items are to be indexed.

Estimate the following:
- The number of items to be indexed (numItems). For an existing file, you can use the COUNT command (with any selection criteria specified in the index definition) to obtain this information.
- The average number of values/subvalues on which the index is based (mvsvAttrCount). This depends on the BY clauses used to define the index:
  - If there are no BY-EXP or BY-EXP-SUB clauses, mvsvAttrCount = 1.
  - If there are BY-EXP, but no BY-EXP-SUB, clauses, mvsvAttrCount = the average number of multi-values in the attribute(s) referenced by BY-EXP clauses
  - If there are BY-EXP-SUB clauses, mvsvAttrCount = the average number of sub-values in the attributes referenced by BY-EXP-SUB clauses (or, equivalently, the average number of multi-values in those attributes multiplied by the average number of sub-values per multi-value).
- The average item-id size in the data section (itemidSize). For an existing file, you can obtain this information from a file statistics report.
- The average key size in the index (defKeySize). You can estimate this by adding up the average lengths of all attributes (for BY clauses), multivalues (for BY-EXP clauses) and sub-values (for BY-EXP-SUB clauses) referred to in the index definition.
Use the index modulo calculator to calculate an approximate modulo. You must supply the frame size and the values that you estimated in step 1.
Pass the result to the SHOW-MODULI TCL command to obtain the next prime number.

Note: A modulo of 11 is not recommended.

The index modulo calculator uses the following algorithm:

Choose the appropriate overhead value (overhead), depending on the level of complexity of the index:
- If the index references only attributes (that is, no BY-EXP or BY-EXP-SUB clauses are used in the index definition), overhead = 0
- If the index references multivalues (that is, the definition includes BY-EXP, but no BY-EXP-SUB clauses), overhead = 5
- If the index references subvalues (that is, the definition includes BY-EXP-SUB clauses), overhead = 10
Calculate the number of index keys from the number of items to be indexed:

numKeys = numItems * mvsvAttrCount
Calculate the average key size by adding together your estimated key size, item-id size and the overhead value selected in 1 above:

averageKeySize = defKeySize + itemidSize + overhead
Calculate the average available node size by taking half the frame size, allowing for the size of an item header and then taking 90% of the result:

averageAvailableNodeSize = INT(((frameSize / 2) - 27) * 0.9)

where INT( ) means 'Integer part of'.
Calculate the number of keys per node from the average available node size and the average key size. There will be a minimum of six keys per node:

keysPerNode = INT(averageAvailableNodeSize / averageKeySize)
if keysPerNode < 6 then keysPerNode = 6
Calculate the number of leaf nodes by dividing the number of keys by the number of keys per node. Allow an additional 10% for branch nodes:

nodes = INT(INT(numKeys / keysPerNode) * 1.1)
The modulo depends on whether the nodes are in-group or out-of-group. The nodes will be out of group if the the average key size times number of keys per node is greater than half the frame size. In this case, the size of the in-group part of an out-of-group item is about 20 bytes, so the modulo is the number of 20 byte nodes that will fit in a frame.

Otherwise, the nodes will be in-group – we can assume that each group will contain two in-group nodes:
If (averageKeySize * keysPerNode) > (frameSize / 2) then
modulo = INT(nodes / INT(frameSize / 20))
else
modulo = INT(nodes / 2)
Add 20% to allow for bad hashing and growth:

modulo = INT(modulo * 1.2)
Pass the result to the SHOW-MODULI TCL command to obtain the next prime number.

Example 1

frameSize = 1Kb = 1024 bytes
numItems = 1000 items
mvsvAttrCount = 1 (attributes are not multivalued)
itemidSize = 10 bytes
defKeySize = 40 bytes
overhead = 0 (attributes are not multivalued)

numKeys = 1000 * 1 = 1000
averageKeySize = 40 + 10 + 0 = 50
availableNodeSize = (1024 / 2) – 27 = 512 – 27 = 485
averageAvailableNodeSize = INT(485 * 0.9) = 436
keysPerNode = INT(436 / 50) = 8
leafNodes = INT(1000 / 8) = 125
nodes = INT(125 * 1.1) = 137

(50 * 8) = 400 < 512 (frameSize / 2)
Therefore modulo = INT(137 / 2) = 68
modulo = next-prime(INT(68 * 1.2)) = next-prime(81) = 83

Example 2

frameSize = 4Kb = 4096 bytes
numItems = 1000 items
mvsvAttrCount = 1 (attributes are not multivalued)
itemidSize = 50 bytes
defKeySize = 50 bytes
overhead = 0 (attributes are not multivalued)

numKeys = 1000 * 1 = 1000
averageKeySize = 50 + 50 + 0 = 100
availableNodeSize = (4096 / 2) – 27 = 2048 – 27 = 2021
averageAvailableNodeSize = INT(2021 * 0.9) = 1818
keysPerNode = INT(1818 / 100) = 18
leafNodes = INT(1000 / 18) = 55
nodes = INT(55 * 1.1) = 60

(100 * 18) = 1800 < 2048 (frameSize / 2)
Therefore modulo = INT(60 / 2) = 30
modulo = next-prime(INT(30 * 1.2)) = next-prime(36) = 37

Example 3

frameSize = 1Kb = 1024 bytes
numItems = 1000 items
mvsvAttrCount = 1 (attributes are not multivalued)
itemidSize = 50 bytes
defKeySize = 100 bytes
overhead = 0 (attributes are not multivalued)

numKeys = 1000 * 1 = 1000
averageKeySize = 100 + 50 + 0 = 150
availableNodeSize = (1024 / 2) – 27 = 512 – 27 = 485
averageAvailableNodeSize = INT(485 * 0.9) = 436
keysPerNode = INT(436 / 150) = 2
keysPerNode < 6. Therefore keysPerNode = 6
leafNodes = INT(1000 / 6) = 166
nodes = INT(166 * 1.1) = 182

(150 * 6) = 900 > 512 (frameSize / 2)
Therefore modulo = INT(182 / INT(1024 / 20)) = INT(182 / 51) = 3
modulo = next-prime(INT(3 * 1.2)) = next-prime(3) = 3

Example 4

frameSize = 1Kb = 1024 bytes
numItems = 1000 items
mvsvAttrCount = 5 (multivalues per attribute)
itemidSize = 50 bytes
defKeySize = 100 bytes
overhead = 5 (attributes are multivalued)

numKeys = 1000 * 5 = 5000
averageKeySize = 100 + 50 + 5 = 155
availableNodeSize = (1024 / 2) – 27 = 512 – 27 = 485
averageAvailableNodeSize = INT(485 * 0.9) = 436
keysPerNode = INT(436 / 155) = 2
keysPerNode < 6. Therefore keysPerNode = 6
leafNodes = INT(5000 / 6) = 833
nodes = INT(833 * 1.1) = 916

(155 * 6) = 930 > 512 (frameSize / 2)
Therefore modulo = INT(916 / INT(1024 / 20)) = INT(916 / 51) = 17
modulo = next-prime(INT(17 * 1.2)) = next-prime(20) = 23

ISTAT (J and how it can help in Resizing

This section shows the results you might expect when you use ISTAT’s M command to try different modulos, and gives you some guidelines for choosing a suitable modulo. The examples use a file called EMPLOYEE containing 5000 items, one for each employee, on a database with a 1Kb frame size. Attribute 6 of the file has been used to build a simple index, NAME, which contains the employee's surname. The index contains about 1000 different names.

The following commands were used to obtain a display of the index statistics:

:ISTAT EMPLOYEE (J
Index name:NAME

After the initial display, the following command was entered:

Command: N100

This sets the maximum value of N to 100 (see below).

Correctly-sized Index

This example shows the output from ISTAT with the modulo set to 127 using the ISTAT command M127.

File='EMPLOYEE' Index='NAME' Modulus=127 Frame-size=1008 Hash-type=2

        ¦        Number of groups with :-
     N  ¦  N items   N used IG frames   N % usage of IG frames

     0  -     13            13            13
     1  -     44           114             0
     2  -     70             0             0
    49  -      0             0            24
    50  -      0             0            17
    53  -      0             0             1
    72  -      0             0             1
    93  -      0             0             1
    97  -      0             0             7
    98  -      0             0            29
    99  -      0             0            27
   100+ -      0             0             4

Total items=        184, Total bytes=     89724, Total frames=       127
IG bytes=         89724, Overflow frames=     0, Wasted bytes=     38292 (29%)
OG items=             0, OG frames=           0, Groups used=        114

Refer to the description of the ISTAT command for details of the information shown in this report.

The example above tells you the following:

Column 2: N items
Out of a total of 127 groups (the modulo equals 127), the file contains 13 empty groups, 44 groups with 1 item and 70 groups with 2 index node structures. About 10% of the groups are therefore empty, with no groups containing more than 2 node structures.

Column 3: N Used IG frames
The file contains 13 empty groups, and 114 groups that use a single frame. No groups have used any overflow frames.

Column 4: N % usage of IG frames
13 frames are empty (about 10%), 41 (24 + 17) frames (about 32%) are about 50% full and 64 (1+7+29+27) frames (about 50%) are over 90% full. 4 frames (about 3%) are full and overflow frames might therefore be needed if the file increases in size.

Undersized Index

This example shows the output from ISTAT with the modulo set to 29.

File='EMPLOYEE Index='NAME' Modulus=29 Frame-size=1008 Hash-type=2
        ¦        Number of groups with :-
     N  ¦  N items   N used IG frames   N % usage of IG frames

     0  -      0             0             0
     3  -      0            19             0
     4  -      0            10             0
     5  -      2             0             0
     6  -     17             0             0
     7  -      8             0             0
     8  -      2             0             0
    82  -      0             0             2
    83  -      0             0             1
    86  -      0             0             5
    87  -      0             0             3
    92  -      0             0             2
    98  -      0             0            13
    99  -      0             0             3
   100+ -      0             0             0

Total items=        184, Total bytes=     89724, Total frames=        97
IG bytes=         89724, Overflow frames=    68, Wasted bytes=      8052 (8%)
OG items=             0, OG frames=           0, Groups used=         29

This example shows the following:

Column 2: N items
This shows that all groups contain between 5 and 8 index node structures. The node structures are therefore fairly evenly distributed across the groups.

Column 3: N Used IG frames
This shows that 19 groups are using 2 linked overflow frames and 10 are using 3. Every group in the base allocation has therefore overflowed.

Column 4: N % usage of IG frames
This shows that all groups are over 82% full, with most over 90% full.

The summary at the bottom of the report shows that the 184 index node structures are badly overflowing the 29 base frames. A total of 97 frames are in use, 68 of which are overflow frames. This is the key indicator for this index, in that the total frame count is over three times the base group allocation.

Oversized Index

This example shows the output from ISTAT with the modulo set to 383.

File='EMPLOYEE' Index='NAME' Modulus=383 Frame-size=1008 Hash-type=2
        ¦        Number of groups with :-
     N  ¦  N items   N used IG frames   N % usage of IG frames

     0  -    201           201           201
     1  -    180           182             0
     2  -      2             0             0
    22  -      0             0             1
    45  -      0             0             1
    48  -      0             0             8
    49  -      0             0            84
    50  -      0             0            86
    53  -      0             0             1
    76  -      0             0             1

Total items=        184, Total bytes=     89724, Total frames=       383
IG bytes=         89724, Overflow frames=     0, Wasted bytes=    296340 (76%)
OG items=             0, OG frames=           0, Groups used=        182

This example shows the following:

Column 2: N items
This shows that most frames are empty, so the distribution of index node structures is very poor.

Column 3: N Used IG frames
This shows that about half of the frames are empty, with none overflowed.

Column 4: N % usage of IG frames
This also shows that about half of the frames are empty, with most of the remaining frames about 50% full.

The summary at the bottom of the report shows that the 184 index node structures are spread thinly across the base group allocation, with 201 (52%) of the base group allocation empty.

What to consider when resizing

Size for uniform item distribution.

You should aim to spread the number of items in a file uniformly across all groups. The more even the distribution, the better the compromise between disk space used and performance.
When to upsize.

If the index uses a large number of overflow frames, performance could be affected. You should consider increasing the size of the index when the number of overflow frames used is more than 50% of the base frame allocation. As a guide, increase the modulo by half the number of overflow frames.
When to downsize.

If the majority of in-group frames are empty this is wasting disk space. However, as index node structures are only accessed randomly, performance is unlikely to be affected.

Nevertheless, if more than 25% of the base frame allocation is empty, you should consider reducing the size of the index. As a guide, you should reduce the modulo by about half the number of empty frames.

File Statistics for Indexes

File statistics listings generated by routine database backup procedures show index sections after the associated data section, separated by colons (:). These statistics allow you to see what index sections exist and how much space they take up.

The LISTFILES command, which lists all files defined via a D-pointer from the current account (or a named dictionary), also lists index sections.