Generation algorithms of Object id

Generating Object id

The most important thing for an id is to never collide with another id for a Nature in a Universe and a partition (Zone and Group), and this in a distributed environment. If not, two distinct entities will share the same id…

The partition axes should be used to ensure segregation in distinct spaces for large scales of Objects ids. They have to be managed by architect, designers and administrators.

The Universe provides a stronger isolation space. This means that an Object id and its partition can collide if they are in distinct universes. However, it is also important to avoid Object id collision within an Universe when promoting objects from a Universe to another. And it also matters to manage the resolution of this case when it occurs.

To address large scale space domains, the generation algorithm have to ensure id uniqueness. The simplest way is to proceed like the GUID policy do, and to build an ID almost randomly in a wide range making collision impossible in real life. The problem for ISA is that a GUID is 128 bits wide, while a Sbyte has only 64 bits. This a very very enormous reduction of magnitude. Therefore this policy cannot be used every where.

Is GUID a solution ?
Despite the fact it is 2 Sbyte wide, using GUID could be a great solution because random generation is quick, flexible and simple, and moreover there is a large experience in using this solution. This approach is also efficient because in memory the pointer remains of 64 bits size and on storage it is just 1 Sbyte overhead per object. On the wire a binary format can be used, so the overhead remain small. For search operations, seeking on just 1 Sbyte first will very very often be sufficient to retrieve or not retrieve a single object. So here also the overhead is not so big as it can be expected. But the GUID specification RFC 4122 tell us:

Distributed applications generating UUIDs at a variety of hosts must be willing to rely on the random number source at all hosts. If this is not feasible, the namespace variant should be used

So the first option of having a central point to generates GUID make us lost the advantage of generation IDs in an independent manner! We just lost the most important advantage. The second option requires us to use namespace. Let see what says the RFC 4122

The mechanisms or conventions used for allocating names and ensuring their uniqueness within their name spaces are beyond the scope of this specification.

OK. And later:

The UUIDs generated from the same name in two different namespaces should be different with (very high probability).

The Universe could be a first class candidate for namespace. However, as we will see at § The Universe subject and Module object id, one of the advantage of using GUID, is the capability to share the same GUID for the same object pushed in different Universes. This simplifies greatly publication from an Universe to another. But this constraint disqualifies Universes in being namespace.

Finally, while seeming seducing at first glance, the usage of GUID does not provides more advantages than not to use them…

When the random generation can be it used ?

The main advantages of random generation are speed and full independence. These are so important, in particular in a distributed architecture, that we must study when it can be used.

The random generation on 64 bits cannot completely avoid collision as GUID can. Therefore the application domain should be compatible with this possibility, even if it is very improbable. These are : other Universes that Production, games, most administration software, document processing, etc. But these are not domains where human life or health can be impacted.

Golden rule: the random generation applicability must be stated by the designers.

The collision probability threshold must be determined. Collision must be so improbable that in real life it will never happen. The threshold is arbitrary fixed to 1 chance over 1 billion. This corresponds to no-collision probability of 99,9999999%.

This problem is known as the birthday problem. According to the table of correspondence of Wikipedia, for 64 bits with $p$ = 10⁻⁹, the random generation applicability is limited to object number of 1.9×10⁵ (190 000) for a given Nature.

Object id can be randomly generated when acceptable and if the number of objects is up to 190 000

What happens in case of collision?
Two major states of the Object have to be considered : In memory or On storage.

In memory.
When objects are in memory the references are all in the memory's address space, even for relationships. The Object id is just a field co-located to the object. So up to be stored, the collision does not result in a problem.On storage
Here, there are several cases that influences the collision detection and the side effects. Let's talk first of the detection.

- If the Object is stored on a single Node the detection is easy.
- If the Object resides on several Nodes the detection depends of the distribution policy: Independent, Replication, Multi-master. The first policy detects the collision very tardily, while the last one detects it before storing. The Replication detects it in a differed manner.

The preventive/corrective action can be either done by system administrator or by an OS reconciliation mechanism.

The side effect depends of the way preventive/corrective actions are supported. A simple policy can be to throw an error to prevent inconsistencies, but it is applicable only on a single node or multi-master nodes. In this case the system is warranted to be consistent, but the users may experiment unexpected errors (rarely).
To maintain consistency, automatic correction can also be supported, but it can be complex to implement in particular for independent distribution policy.

Secure generation

The securing of the Object id generation is the alternative to the random generation. It requires an adhesion between the Nodes producing the Objects. This results in the Object id can have 2 sections:

(A) A Node number unique in the domain space defined by the Zone and the Group.
The Node Object id cannot be used because it has already 64 bits. The collision of Node numbers can be avoided by managing the assignment by the Universe manager.

(B) An incremental section giving the object a sequence number.

The size of these sections may vary depending of the volumes. The axis (A) may be void when (A) is a single node that builds objects. We will therefore study just 2 particular cases :

1. (A) is not valued.
  In this case there is just an increment on a single Node. This is a very simple case, but quite an extreme one within a distributed environment like ISA… Here there are 18 446 744 073 709 551 615 positions without collision.
2. (A) has a value.
  Having at least 1 048 576 possible numbers per Node is considered a minimum (20 bits). The curve below shows the id value domain.

The default values for this policy are 24 bits for the Node number and the remaining 40 bits for the rooms per Node. This lets having up to 16 777 215 nodes each being able to create up to 1 099 511 627 775 objects.

The incremental section (B) can also be divided in 2 sub-sections when :

1. 1. A slot number. The slot size is fixed. This means it has a maximum number of increments. A slot is allocated by a master Node (distributed allocation).
  2. An increment in the slot. It is managed locally by a slave Node.
    Typically this increment have a size of 6 bits. This allows 64 rooms before a slave Node needs asking a new slot to the master Node.
    This policy is very efficient to increase of several order of magnitude the number of involved nodes (~ users) for a similar number of Object id.

To conclude, with these policies, the Object id space should be flexible enough to fulfill almost all common needs.

The Universe and the Module object id

A Universe manages an environment for a given Process scope and a Purpose. It could embrace several Zones. Its main goal is to isolate an instantiation of the ISA architecture. A Universe therefore manages Nodes, Assemblies and the object instances. The Universe is the object in charge of this management. One important thing is to be able to publish artifacts (assembly) from one Universe to another. For instance from the Development Domain 1 Universe to the Integration Internal Universe.

The artifacts to publish are always within an assembly. They are class, Interface, Nomenclature, Sequence, Context, Edition, etc. Which are they more generally?

- Most of them are manufactured Modules.
  This is studied just below.
- But they are also the Settings and the Software Configurations which are Parameters instances. They are able to point to other Types for parametric purposes. Note that the same setting object may be used by several Assemblies so it cannot be dependent to a single assembly.
  The § Parameters ids is dedicated to this case.
- Finally, there are also artifacts that are not Modules but that add some Context to the Modules.
  The § Context ids is dedicated to this case.

The first and the second are both involved in a Namespace hierarchy. This is important to be sure that these artifacts do not collide and therefore are identified unambiguously by humans. The third one is not in a Namespace hierarchy. However, it is contained in one assembly.

In order to simplify the travel of these particular objects thru the Universes and to ensure consistency of theirs links, it is important that the Objects Id remains the same over all Universes (that's why the GUID are rejected for ISA – see above).
It is also important to avoid id collision between several Module and Parameter producers that may be independents. So, this is important because an entity, like a class of a model (Person for instance), is always an instantiation of the meta-class class. This meta-class Class is shared by all the ISA users and in all Universes. As a consequence, avoid that the Id of Module instances collide is important, because the (meta)class is the same for all of us, and therefore do not allows to distinguish Modules when their Id are the same.

The solution to be both unique textually and at the Object id level is achievable via a computed hash code from a full qualified name. All of that for a given Zone and Group name, but not for an Edition of the assembly.

The policy for Modules is to concatenate : The Path and the Edition properties and to get a hash code of it starting at the first significant position and ending at the last significant one.
Note that despite the Zone and Group should be provided when identifying a Class, they are also incorporated in that hash code because they isolate the Class name spaces within the same Organization.

Here is the LL expression for that:

String: identity zone sight: Path; identity sight : Group; Sight: Path; Edition sight:Version, country, language, number.

Example strings :

- An assembly of the company e-Companion Software made by its France department. No grouping. The Assembly is a code generator. It is a component of a larger assembly Pollen UML GUI Designer (a product). It is the second version and its number is 20181025.
  Zone: Path: e-Companion Software\e-CS France. Path: Pollen UML GUI Designer\Code generator. Edition: Version: 2, number: 20181025.
  Significant string for hashing:
  e-Companion Software\e-CS France . Path: Pollen UML GUI Designer\Code generator>. Edition: Version: 2, number: 20181025
  Giving for instance the following Object id for the Assembly:189115859443450773.
- A class of a the above assembly is named Expression producer, its string is
  Zone: Path: e-Companion Software\e-CS France. Path: Pollen UML GUI Designer\Code generator\Expression producer. Edition: Version: 3.
  Significant string for hashing:
  e-Companion Software\e-CS France. Path: Pollen UML GUI Designer\Code generator\Expression producer. Edition: Version: 3
  Giving for instance the following Object id for the Class: 14264874237290243808
- A class named SBOM to XSD extractor not part of a Zone nor of a Group, and also not embedded in an Assembly (this means that this class is only available in Universe having at Process step “Software”).
  Path: SBOM to XSD extractor. Edition: Version: 1.
  Significant string for hashing:
  SBOM to XSD extractor. Edition: Version: 1
  Giving for instance the following Object id for the Class: 15734710905516895901.
- In this class named SBOM to XSD extractor become a component of an Assembly named Utility tools
  Path: Utility tools\SBOM to XSD extractor. Edition: Version: 1.
  Significant string for hashing:
  Utility tools\SBOM to XSD extractor. Edition: Version: 1
  Giving for instance the following Object id for the Class: 11312607599932062204

Within an Organization and Group, classes could share the same Nature (the meta-class Class) if the Object id was generated randomly the birthday problem limits the number of class to 1.9×10⁵ (190 000) for a collision probability with $p$ = 10⁻⁹, and to 6.1×10⁶ for $p$ = 10⁻⁶. However, having a hash code policy for Object id production should enlarge widely this space. If the hash function is good, the birthday problem is the more a lower limit. Finally, the detection of collision should be easy, and solving just needs to change the name. So for almost all organization this policy is applicable.

Parameters ids

The policy for Parameters is necessarily different as they are model objects they are from different classes (not of the Class meta-class for instance). However, the regular policies defined in the parts above are also relevant. ~~Although the way the namespace is built differs, the Object id generation relies on the same principle. The only difference is in fact the implementation details of the Namespace interface.~~

~~Here is the LL expression which remains the same:~~

~~String: identity zone sight: Path; Identity sight : Group; Sight: Path; Edition sight: Version, country, language, number.~~

~~Example strings :~~

- ~~A parameter named Taux de TVA normal (normal VAT rate) issued by the french government CEDEF department within an Assembly. No grouping. It is the 158th version and its number is 20181025.~~
  ~~Zone: Path: France\Gouvernement\Économie\CEDEF. Path: Taux de TVA\Normal. Edition: Version: 158, country: FR, language: fra, number: 20181025.~~
  ~~Significant string for hashing:~~
  ~~France\Gouvernement\Économie\CEDEF. Path: Taux de TVA\Normal. Edition: Version: 158, country: FR, language: fra, number: 20181025~~
  ~~Giving for instance the following Object id for the Parameter: 15130673818144055749~~

However, there is no Namespace because in ISA the parameters are integrated to the OO approach just as some thing similar to static member in languages like C++, in fact as Shared members on an Abstract Data Type metaclass. Therefore they are not in a hierarchy, as they could be in a not-OO approach (character string based resource or property files). So, to conciliate both the regular policy and the OO approach, the Objects used as parameters needs to be Named.

Here is the LL expression :

String: Identity zone sight: Path; Identity sight : Group; Sight: Name.

Example string :

- A parameter named Taux de TVA (VAT rates) having properties like Normal, Deducted. It is issued by the french government CEDEF department within an Assembly. No grouping.
  Zone: Path: France\Gouvernement\Économie\CEDEF. Name: Taux de TVA.
  Significant string for hashing:
  France\Gouvernement\Économie\CEDEF. Path: Taux de TVA
  Giving for instance the following Object id for the Parameter: 14009151622738369810.

Context Ids

Here the Context object are not supposed to have a name, but they are contained in one and only one Assembly thru the Agglomerate relationship. So, they are within a namespace, but it don't have a name or code or anything that identify them in this namespace. That's why they inherit from Time stamped object, in order to have a Creation time stamp to identify it.

Here is the LL expression used to build they unique name:

String: Agglomerate identity zone sight: Path; Agglomerate identity sight : Group; Agglomerate sight: Path; Sight: Creation time stamp.

Example strings :

- An assembly of the company e-Companion Software made by the France department. No grouping. The Assembly is a code generator. It is a component of a larger assembly Pollen UML GUI Designer (a product contains a Context created the 25-oct-2018 at 12:34:56.
  Zone: Path: e-Companion Software\e-CS France . Path: Pollen UML GUI Designer\Code generator. Creation time stamp: 2018-10-25T12:34:56,1234567.
  Significant string for hashing: e-Companion Software\e-CS France. Path: Pollen UML GUI Designer\Code generator. Creation time stamp: 2018-10-25T12:34:56,1234567 Giving for instance the following Object id for the Context:16123954031158963321.

The Universe and the Zone ids

The Zone object ids share some similar constraints with those of the Modules, but are different in several major points.

The collision must be avoided across all Universes, because promoting the objects of a zone “B” in an Universe where there are objects belonging to the zone “A” must be sure. Thus, having a zone “A” with an id “Z1”, one must be sure of the impossibility for a zone “B” coming from another Universe to have the “Z1” as id.

On the other end, a zone is transcendent to Universe and should have the same Id across all Universes.

The collision avoidance has only to be managed for the root zones (Organization, Individual and Zone), because sub-zones (that are owned by an Organization) have an id that is of a scope limited to this Organization. Moreover, as the sub-zones are in limited number, their ids can generated with the random policy.

The number of zone is estimated to 23 billions for one century. This number is huge and the random policy is not applicable. The secure generation could be applied but it requires some one to manage the Node allocation. This one appears difficult for every one to agree on it.
Therefore the Zone id attribution must be a mix of rules decentralized and of centralized. And moreover, preferably, relying or being compatible with existing identification mechanisms.
The proposed solution for ISA is explained below.

A Sbyte allows 1.84 ×10¹⁹ rooms. Therefore 19 decimal digits are available ( from 0 to 9 999 999 999 999 999 999).

This Zone id policy is from left to right:

- Position 19: Format

- - 0 : Geo-coordinates within the headquarter of the organization at its creation time.
  - 1 to 3 : Meaning of the number in the country.

- - - 1 : Individual (typically the social security number).
    - 2 : Company (typically the company registration number).
    - 3 : state (state dependent administration number).

- - 4 to 9 : Reserved.

- Geo-coordinates

- - Position 18: Latitude and Longitude orientation

- - - 0 or 4 : North – East
    - 1 or 5 : North – West
    - 2 or 6 : South – East
    - 3 or 7 : South – West

- - Positions 10 to 17: Latitude DDdddddd using decimal degrees with 6 decimal places (resolution ~0.1 meter at equator – Cf. Wikipedia).
  - Positions 1 to 9: Longitude DDDdddddd using decimal degrees with 6 decimal places (resolution ~0.1 meter at equator – Cf. Wikipedia).
    NB: The collision probability for 2 distinct organizations created in the same office surface of 10m² is 0.0005; which is considered to be acceptable.

- In country number

- - Positions 17 to 19: Country

- - - 3 digits for country code following the ISO 3166-1 numeric codification.
    - The value 000 means out of a defined country, therefore the number has a world wide scope.

- - Positions 1 to 15 : Number

- - - The number in the country.

These identifiers are in the Public identifier property.

Examples:

- Zone id of Luc Laforets in France social security number (NIR) : 1 250 000 1 600 727 366 001
- Zone id of e-Companion Software in France company number (SIREN) : 2 250 000 000 432 532 000
- Zone id of e-Companion Software in France at creation time (93 Rue Jules Guesde, 92300 Levallois-Perret, France) :
  0 0 48 896230 002 289940

The rooms above 1×10¹⁹ are reserved for the free zones (Zone class). There are 8.4 ×10¹⁸ rooms in this space (8 446 744 073 709 551 617). With a random allocation it results for this space a collision probability of $p$ = 10⁻⁶, allowing up to 4.1×10⁶ ( 4 110 169) Zones. That is not enough. Therefore, the Secure generation policy have to be applied. The way this is done is out of the scope of this post. It will be studied with the distribution and search engine subject. In short we can say for now that a Public certification authority should be able to allocate zone id thru the Get free zone id command. If there are 1 000 000 rooms for Public certification authorities and each room can produce 8 446 744 073 709 zone id. This is widely sufficient.

Choice of the hash algorithm

Algorithm

The choose of the good hash algorithm for characters strings is crucial for ISA's Object ids. The post Which hashing algorithm is best for uniqueness and speed at Stack Exchange is very useful to make it, and in particular to rely on the experience of many skilled contributors.

The taking into account of the Merkle–Damgård construction is also a guarantee of quality.

In our case, a good algorithm is one that avoid collision by having an uniform distribution. Speed and reversibility are not really concerns here.

In any case, this approach allows to select a sufficient algorithm even if new findings may revel better ones in the future.

The selected algorithm is MurmurHash2. In fact, the effective implementation is a combination of this algorithm to supports: 64 bits hash code, the Merkle–Damgård construction and to be neutral to big or little endian alignment.

Avoid strings representation bias

To be efficient in any language, including those relying on ideograms, the ISA String element is represented with 2 bytes (16 bits) using the encoding UCS2.

The hash code algorithms make an assumption often not expressed: Source strings representation should be almost uniform. With the UCS2 encoding and with alphabets, like Latin or Cyrillic which are some the most commonly used in the world, this is not the case. The top most byte is very often 0000 0000 and extremely rarely 1111 0000. This results in a bias in the hash code distribution, in particular for algorithms manipulating bits (bit shifting, boolean operators, etc.).

One way to avoid this bias of the hash algorithm is to convert the String Element values to a format taking into account the usage frequency of characters. The most simple way to do that is to convert the characters string to UTF-8 before playing the core of the hash algorithm.

Avoid bias for recurrent postfixes and prefixes in strings

As seen in the examples above, most Sight strings have a common start and a common end. This commonalities results in a constraint on the hash code value domain and therefore introduce a bias. The idea is to avoid that by removing all non significant string parts or characters, in particular at string extremities.

The most common characters are space characters at the beginning and at the end of string. Therefore, leading and ending spacing characters are removed for hash compute (u0020, u0009 and u000D typically). More generally all characters below u0020 are considered to be white space in this case.

The most common end is a dot . at the end. This ending point is removed. If any, the leading point is also removed.

The Sight use the LL convention, so all the leading data name have to be removed. So the hash string begin after the first non white space character following the last : character before the first occurrence of a separating character , ; or .
This implies, for being consistent, that the first lemme in the string is not optional. This is generally the case since the string is constructed from the more general to the more specific. If this is not possible, a dot . character have to be put at the beginning of the string.

Algorithm code

The code of the core of the used algorithm can be found :

Tags: (05 - Warm up), Foundations, Language, Requirement

ISA

Ideal Software Architecture Blog

Generation algorithms of Object id