BOSWAU + KNAUER
All posts

Blog

AI Video Models in the Gulf: Why a Thobe Looks Different from a Suit

Training set bias, regional clothing, gait. Why models trained on US/EU datasets misfire across the Gulf.

Dr. Raphael Nagel

Dr. Raphael Nagel

July 27, 2025

AI Video Models in the Gulf: Why a Thobe Looks Different from a Suit

A video analytics model is not a universal observer. It is a statistical artefact of the data it was fed, and it carries the silhouette of that data into every frame it ever processes.

The consequence is plain. A detector trained in Munich, Manchester or Mountain View has been taught what a person looks like in a fitted jacket, in jeans, in a high-visibility vest worn over a t-shirt. It has been taught how a person walks when their knees are visible and their arms swing freely beside a torso whose outline is preserved by tailored cloth. It has not been taught the thobe. It has not been taught the abaya. It has not been taught the kandura at the moment when the wearer crouches to inspect a pump skid, nor the ghutra moving in a forty-knot shamal. When the model is shipped to Riyadh, Dammam, Doha or Jubail and asked to count people on a logistics yard, it does not refuse. It guesses. It guesses badly, and it logs those guesses as ground truth.

What the dataset actually contains

Most commercial person-detection models in circulation today inherit their backbone from a small number of public datasets. COCO, ImageNet, Open Images, Cityscapes, Pascal VOC, and a handful of pedestrian benchmarks like Caltech and KITTI form the substrate. These datasets were assembled in North America and Western Europe during the 2010s, with smaller contributions from East Asia. Their human subjects wear the clothing of those regions. The vendors who fine-tune on top of these backbones add proprietary data, but the proprietary data tends to come from the markets where the vendor sells, which closes the loop on the original bias rather than opening it.

The result is a feature space in which the prototypical person is defined by certain visual regularities. The head sits on a neck that is visible. The shoulders descend at a specific angle relative to the torso. The torso narrows at the waist. The legs are distinguishable as two columns whose motion can be tracked independently. Hands appear at the ends of arms whose silhouette is preserved by the cut of a sleeve. When any of these regularities is disturbed, the confidence score collapses, and the model either drops the detection or reclassifies the object as something else, a piece of equipment, a tarpaulin, a static occluder.

A thobe disturbs almost every one of those regularities. The garment runs from neck to ankle in a continuous vertical line. The waist is not narrowed. The legs do not appear as two separate columns until the wearer takes a long stride. The arms are sheathed in fabric that hangs in folds when the arm is at rest. In ambient temperatures above forty degrees, the fabric is loose precisely because it must move air; this looseness is functional, and it is also opaque to a detector that learned the human shape from people wearing winter coats in Toronto. The abaya extends this challenge further, with full sleeves, a covered head and, in many cases, a niqab that removes the facial landmarks the model relies on as a secondary check. NIST's Face Recognition Vendor Test programme has documented demographic differentials of one or two orders of magnitude in face recognition error rates across skin tone and gender, and the underlying mechanism, dataset composition, applies just as forcefully to whole-body detection.

How clothing affects the detection signal

The failure mode is not abstract. It can be observed in the confidence histograms that any honest integrator extracts from a pilot. On a logistics yard in the Eastern Province, a detector that registers warehouse workers in coveralls at confidence scores between 0.85 and 0.95 will register the same workers in thobes at scores between 0.45 and 0.70, often with the bounding box truncated at the waist because the detector is anchoring on the upper body and discarding the lower portion as background. The detector is not broken. It is doing exactly what its training data instructed it to do, which is to express low confidence about silhouettes it has rarely seen.

There is a second order effect. Tracking algorithms depend on consistent detections across frames. When confidence drops below the tracker's threshold for even a handful of frames, the identity is lost and a new track is opened. On a site with twenty workers in thobes moving between scaffolding, a tracker that loses identities every three to five seconds will produce a count that is three to five times the true count, or, depending on the smoothing logic, will produce a count that is half the true count because the tracker discards short-lived tracks as noise. Both errors look plausible in a dashboard. Neither is correct.

Gait analysis suffers a parallel collapse. Gait recognition models trained on the CASIA-B, OU-ISIR and TUM-GAID corpora assume that limb motion is at least partially visible. A thobe occludes the lower leg motion until the wearer is moving at speed. The gait signature, which is one of the few biometrics that can survive at a distance and across changing illumination, is degraded to the point where it is no longer a reliable secondary signal. In a region where surveillance distances are long, illumination varies between brutal sun and floodlit night, and faces are often partially covered, the loss of gait as a usable channel is not a minor inconvenience.

What regional retraining actually requires

The remedy is not a slogan. It is a dataset, and the dataset is the expensive part. A defensible regional retraining programme for a Gulf deployment needs, at minimum, several tens of thousands of labelled frames covering thobe, kandura, abaya, ghutra, agal, shemagh and the regional variations of high-visibility workwear worn over traditional dress. The frames must span illumination conditions from pre-dawn to high noon to floodlit night, weather conditions from clear to dust storm, and camera angles from overhead PTZ to ground-level fixed. They must include occlusion patterns specific to the region, such as workers seated on the ground during prayer breaks, which is a posture that Western datasets contain in negligible quantity and which a naive detector classifies as a fallen person, generating false medical alarms.

The labelling itself is non-trivial. A label schema designed for Western workwear distinguishes person, vest, helmet, glove. A schema fit for purpose in the Gulf needs additional classes and additional attributes, because a worker in a thobe with a hard hat over a ghutra is wearing a stack of garments whose individual presence and correct configuration matters for site safety compliance. The labellers must understand the garments they are annotating. Outsourcing this work to a generic data labelling vendor in a low-cost geography produces label noise that propagates directly into model errors, because the labellers cannot distinguish a correctly worn ghutra from an incorrectly worn one, and the model learns the average of correct and incorrect.

ISO 27001 and IEC 62443 both speak to the integrity of the data on which security decisions are based. NIST CSF 2.0 places dataset provenance under the Govern function for any organisation deploying AI in a security context. NIST 800-53 control SI-7 requires integrity verification of the inputs to automated decision systems. None of these frameworks tells an operator how to assemble a regional dataset, but each of them obliges the operator to know what is in the dataset and to be able to defend the choice. An operator who deploys a US-trained detector on a Saudi site and cannot describe the gap between training distribution and deployment distribution is not compliant with the spirit of any of them.

The fairness question, reframed for operators

Fairness in computer vision is usually discussed in terms of demographic parity across protected categories. That discussion matters, and the NIST work on face recognition demographics is the canonical reference. For an operator, however, the question lands differently. The operator does not care, in the first instance, about parity as an abstract value. The operator cares whether the false negative rate on workers in traditional dress is higher than the false negative rate on workers in coveralls, because a higher false negative rate on one group of workers means that group is less protected by the system. The same camera that flags an intrusion when a Western-dressed contractor crosses a line at 02:00 fails to flag the same intrusion when a Gulf-dressed contractor crosses the same line, because the detector never registered the second contractor as a person.

The operational measure that matters is the disparity in error rates across the populations actually present on the site. This is measurable. It requires a labelled validation set drawn from the deployment environment, not from the vendor's marketing collateral. It requires that the validation set be large enough to produce statistically meaningful per-subgroup error estimates, which means several thousand examples per subgroup at minimum. It requires that the operator hold the validation set and run the evaluation, rather than accepting the vendor's claim. Any vendor who declines to be evaluated on the operator's data has answered the question already.

The author's book BOSWAU + KNAUER. From Building to Security Technology argues that a security system which logs every event but decides nothing is an archive, and that a system which decides quickly but documents nothing is a liability. The same logic applies to bias. A system that detects bias but does not act on it is an archive. A system that acts without measurement is a liability. The discipline lies in measuring continuously and acting on the measurements.

What happens in the integration layer

Even with a regionally retrained model, the deployment is not finished. The detector sits inside an analytics stack that includes a tracker, a classifier for behaviour, an event generator and a notification layer. Each of these components has its own assumptions, and each can re-introduce the bias that retraining removed. A behaviour classifier trained to flag loitering uses the duration of a track as one of its inputs. If the tracker fragments tracks more often for workers in thobes, the loitering classifier will flag those workers less often, which sounds like a benefit until the operator realises that genuine loitering by intruders in regional dress is also being missed.

The notification layer compounds the problem in the opposite direction. Operators in the control room develop intuitions about which alerts are real and which are false. If the system generates false medical alarms every time a worker sits on the ground for prayer, the operators will start to discount sit-on-ground alerts. The next time a worker actually collapses, the alert will arrive in a queue that has been trained, by the system's own behaviour, to ignore it. The fix is not to disable the alert. The fix is to retrain the classifier on regional postures, including prayer, so that the alert only fires on genuine anomalies. This is the kind of correction that requires the manufacturer to understand the deployment context, not merely to ship a model.

Integration with the operator's existing systems is the second axis. CISA's guidance on operational technology, IEC 62443's zone and conduit model, and the BSI's recommendations on industrial security all converge on the principle that analytics outputs must be auditable and that the chain from sensor to decision must be traceable. When a detection is wrong because the model was trained on the wrong distribution, the audit trail has to be able to surface that fact. A system that reports a confidence score without reporting the population that score was calibrated against is not auditable in any meaningful sense.

What holds

The clothing question is a stand-in for a larger one. Models inherit the world of their training data, and that world is never neutral. A vendor that ships a US-trained detector to a Gulf operator without disclosing the gap is not selling a product. It is selling a guess, dressed in the confidence intervals of a different geography. The operator who accepts that guess is accepting the risk that comes with it, and the risk falls disproportionately on the workers whose appearance the model has never properly seen.

The remedy is straightforward in principle and demanding in execution. Build or commission a regional dataset. Measure error rates by subgroup on validation data drawn from the actual deployment environment. Require the manufacturer to demonstrate competence on regional populations before installation, not after. Audit the system on a recurring schedule, because populations change and models drift. None of this is novel. It is the same discipline that ASIS International recommends for any security control and that the GDV expects of any installation whose performance affects an insurance premium.

For operators in the Gulf who want to test their current analytics stack against the regional reality, Path I is a sixty-minute confidential conversation in which the gap between training distribution and deployment distribution can be sketched honestly. For operators ready to quantify the gap, Path II is a three to five day audit that produces a written report including subgroup error rates on a validation set drawn from the operator's own sites. For operators who want to see a regionally retrained system run on their ground for a defined period, Path III is a ninety day pilot at one location with success criteria agreed before installation.

Frequently asked questions

How does clothing affect detection?

Clothing changes the silhouette the detector learned to recognise. A thobe or abaya removes the waist narrowing, hides the leg columns and softens the arm outline, all of which are features the model uses as primary cues. Confidence scores drop, bounding boxes truncate at the waist, and trackers lose identity continuity between frames. Gait analysis degrades because lower limb motion is occluded. The detector is not malfunctioning. It is reporting low confidence about a silhouette it has rarely seen in training, which propagates into miscounts, false negatives and unreliable behaviour classification.

Are regional datasets available?

Public regional datasets covering Gulf clothing at the scale and quality required for production deployment are scarce. Academic datasets exist for specific narrow tasks, and some Gulf research institutions have begun publishing, but the volume falls short of what a commercial detector requires for robust retraining. Most operational regional datasets are proprietary, assembled by integrators or end users who recognised the gap and invested in closing it. Operators should expect to commission or co-fund dataset construction rather than to download a ready solution, and should treat any vendor claim of regional coverage with documented scrutiny.

Who labels them?

Labelling regional datasets requires annotators who understand the garments, postures and work practices being annotated. Generic labelling vendors in low-cost geographies produce high label noise on Gulf imagery because the annotators cannot reliably distinguish correct from incorrect garment configurations, prayer postures from medical incidents, or cultural variations within a single garment category. Defensible labelling pipelines combine regional annotators with quality control by domain experts, double-labelling on a sampled subset, and explicit label schemas that capture the attributes relevant to safety and security decisions, not only the object class.

How is fairness measured?

Fairness is measured by comparing error rates across the subgroups actually present in the deployment environment, using a validation set the operator controls. The relevant metrics are per-subgroup false negative rate, false positive rate and tracker identity persistence, computed on imagery the model has not seen in training. Sample sizes must be large enough for statistical significance, typically several thousand examples per subgroup. The operator runs the evaluation, not the vendor. Fairness is then re-measured at intervals because populations, garments and site conditions drift, and a model that was fair at installation can become unfair within months.

Dr. Raphael Nagel

About the author

Dr. Raphael Nagel (LL.M.) is founding partner of Tactical Management. He acquires and restructures industrial businesses in demanding market environments and writes on capital, geopolitics, and technological transformation. raphaelnagel.com

Since 1892.

The firm is reached at boswau-knauer.de or +49 711 806 53 427.