Blog
AI CCTV Cameras 2026: Who Actually Trains the Models
A buyer-side guide to AI CCTV in 2026. Which vendors train their own models, which rebrand object detectors, and how to read the spec sheet without being sold to.

Dr. Raphael Nagel
December 14, 2024

Most cameras sold as AI CCTV in 2026 contain no proprietary intelligence at all. They contain a general-purpose object detector, usually a variant of YOLO or an equivalent open weight model, wrapped in a branded user interface and resold at a margin that suggests something rarer is inside the box.
This matters because the buyer who signs the order is not buying a camera. The buyer is buying a claim. The claim is that this device will distinguish a delivery driver from an intruder, a pallet jack from a person, a fox from a thief at three in the morning. The camera that fulfils this claim and the camera that fails it look identical on the spec sheet. They differ in one thing only, which is who trained the model and on what data. That question is asked less often than it should be, and answered less honestly than it could be.
The two markets hiding inside one product category
The AI CCTV market in 2026 is not one market. It is two markets sold under the same name, and the spec sheet does not separate them. The first market is the volume market, dominated by manufacturers who source sensors from a handful of Asian foundries, integrate a publicly available detection model, and compete on price per channel. The second market is narrower, populated by manufacturers who maintain their own annotation teams, who collect data from the environments their cameras will be deployed in, and who treat the model as a product rather than a feature.
These two markets produce devices with overlapping data sheets. Both will list resolution, frame rate, low-light sensitivity, an IP rating, and a list of detected object classes. Both will state that artificial intelligence is on board. The first market is competing on hardware metrics because that is where its differentiation lies. The second market is competing on detection quality, which is harder to express in numbers a procurement officer can compare side by side. The buyer who reads only the data sheet will tend to choose the first market, because the first market's data sheets are written to be chosen.
The consequence is predictable. A construction site, a logistics yard, or an industrial perimeter that should have specialised detection ends up with a generic detector trained on the COCO dataset, which contains photographs of cats, dogs, surfboards and dining tables, and a comparatively small share of the situations that matter for security at three in the morning. The detector works, in the sense that it produces bounding boxes. Whether those bounding boxes correspond to events that warrant a response is a separate question, and one that the first market does not answer in the data sheet because answering it would require disclosing training data the manufacturer never collected.
The IEC 62443 family and the NIST Cybersecurity Framework 2.0 both treat the supply chain as a security concern in its own right. That principle extends to the model. A model whose provenance cannot be stated is a supply-chain artefact whose risk cannot be assessed. ISO 27001 controls around supplier relationships apply here in spirit, even if the certifier never opens the camera housing.
What "AI inside" actually means on a 2026 spec sheet
The phrase "AI inside" has been emptied of specific meaning through repetition. To recover a working definition, the buyer must distinguish four layers that the marketing flattens into one. The first layer is the sensor, which converts photons into pixels. The second layer is the image signal processor, which corrects, denoises and compresses those pixels. The third layer is the inference engine, which is the hardware that runs the model, typically a neural processing unit on the same board as the image signal processor. The fourth layer is the model itself, which is a set of weights produced by training on a dataset.
Most spec-sheet specifications cover layers one through three. Resolution, frame rate, codec, NPU TOPS rating, supported precision formats. These are real numbers and they are useful, but they describe the engine, not the driver. A camera with a six TOPS NPU running a poorly trained model will produce worse detections than a camera with a two TOPS NPU running a model trained on the right data. The buyer who optimises for TOPS is optimising for the wrong variable.
The fourth layer, the model, is where the actual intelligence lives, and it is the layer least disclosed. A specification that lists object classes, "person, vehicle, package, animal," is not a specification of the model. It is a specification of the model's output schema. The same output schema can be produced by a model trained on a million images from construction sites in Northern Europe and by a model trained on a hundred thousand stock images scraped from the internet. The two models will behave very differently when a worker in high-visibility clothing crosses the field of view at dusk, or when a deer enters from the treeline at four in the morning, or when an articulated lorry partly occludes a person crouching behind it.
CISA guidance on operational technology procurement makes the same point in a different domain. The buyer must know what is in the device, who put it there, and how it will behave under conditions the marketing material does not describe. For AI CCTV, the equivalent question is what data the model was trained on, who annotated that data, and how the manufacturer measures performance on conditions that match the deployment environment. A vendor who cannot answer those three questions has not built the model. They have integrated someone else's.
Who actually trains the models
In the current market, the population of manufacturers that train their own models is small. It is smaller than the marketing suggests. A useful test, which the buyer can apply in any vendor conversation, is to ask three questions in sequence and listen to which question produces evasion.
The first question is what model architecture the camera runs. A vendor who has built the model will name it, often a customised variant of a known family, and will discuss why that family was chosen. A vendor who has integrated will name the family and stop there. The second question is where the training data came from. A vendor who has trained will describe data collection partnerships, internal annotation operations, and the volume of frames in the training set. A vendor who has integrated will refer to public datasets, sometimes by name, sometimes vaguely. The third question is how the model is updated when a customer reports a failure. A vendor who controls the model will describe a feedback loop that runs from field report through annotation through retraining through validation through deployment. A vendor who does not control the model will describe a feature request to a software partner.
The third question is the one that separates the two markets most cleanly. Model ownership is not a marketing claim. It is an operational capacity that costs money and takes years to build. A manufacturer who has not built that capacity cannot acquire it in a release cycle. The buyer who needs the camera to work in a specific environment, with specific failure modes, needs a vendor whose model can be improved in response to that environment. That improvement requires the vendor to own the pipeline that produces the model, not merely the device that runs it.
The framework set out in BOSWAU + KNAUER. From Building to Security Technology makes a related point about manufacturer position. A manufacturer who has worked with the technology as a customer, before becoming a producer, knows which failures matter and which do not. That memory shapes the training data, the annotation guidelines and the validation criteria. A reseller who has never operated a camera in the conditions the camera will face cannot make those decisions, because the relevant information is not in the dataset they purchased.
Reading the spec sheet without being sold to
Spec sheets are written to survive comparison. They emphasise numbers that can be lined up against competitors and minimise descriptions that cannot. The buyer who reads them as engineering documents will be misled. The buyer who reads them as marketing documents, looking for what is not said, will be better informed.
Five absences are worth noting. The first is the absence of training data provenance. If the document does not state where the model's training data came from, the answer is almost always that the data came from public sources and the model was not customised. The second absence is the absence of false positive and false negative rates under specified conditions. A vendor who measures these will publish them, even if only in a separate technical note. A vendor who does not measure them will not. The third absence is the absence of a retraining cadence. A model that is never retrained is a model that decays as the environment changes. The fourth absence is the absence of edge-case behaviour documentation. How does the camera behave when the lens is partly obscured, when the light source flickers at the frame rate, when the network drops for ninety seconds, when the temperature crosses minus ten. The fifth absence is the absence of an audit log specification. A camera that detects events but does not log the inference inputs and outputs in a form that can be reviewed after the fact is a camera whose decisions cannot be defended in front of an insurer or a court.
NIST 800-53 controls around audit and accountability, applied to a video analytics device, would mean that every inference is traceable to a model version, a configuration state, and the raw data that produced it. Few cameras on the market meet that bar today. The buyer who needs to meet it must specify it, because it will not appear by default. The same applies to the ASIS International guidance on security operations, which treats the chain of evidence as part of the system, not as a feature added later.
A practical reading discipline emerges from these absences. The buyer reads the spec sheet, lists the questions the spec sheet does not answer, and sends those questions to the vendor in writing. The quality of the written response, not the quality of the spec sheet, is the indicator of what the buyer is actually buying. Vendors who train their own models will produce written responses that engage the questions. Vendors who do not will produce written responses that route the questions back to the spec sheet, or to a sales meeting in which the questions can be deflected through conversation.
Where rebranded detection becomes a problem
A generic object detector is not useless. It works for the cases on which it was trained, which are the cases present in the public datasets. For many security applications, those cases are sufficient. The buyer who watches a customer-facing retail floor during business hours, for example, will be reasonably served by a model that recognises people, baskets and packages, because that is what the model was trained on. The trouble begins when the deployment environment differs from the training environment in ways the buyer has not articulated, and the marketing has not surfaced.
Construction sites are one such environment. The training data for generic detectors does not contain enough construction footage at the relevant times of day to produce reliable detections of, for example, a person carrying a length of copper pipe across a partly-lit material yard at three in the morning. The model will produce a detection, but the confidence will be low, and the system will either trigger excessive alarms or suppress real events to keep the false-positive rate manageable. Either failure mode is expensive. The first burns out the operator. The second produces the loss the camera was bought to prevent.
Industrial perimeters present a different version of the same problem. Vegetation moves, animals enter the frame, lighting changes through the seasons. A model that has not been trained on continuous footage from comparable sites will treat these variations as anomalies, generating alerts that have no operational meaning. Over time the operator learns to ignore the system, which is the worst possible outcome. The German Federal Office for Information Security, the BSI, and the GDV insurance association both note in their guidance that systems which produce ignored alerts increase risk rather than reduce it, because they create an illusion of coverage that displaces real attention.
Logistics yards present a third version. Here the question is often not whether a person is present, but whether the person's behaviour matches the expected workflow. A driver opening a trailer at a scheduled time is routine. The same driver opening the same trailer ninety minutes after the scheduled handover is not. Generic detectors do not see workflows. They see objects. To see workflows, the model must be trained on the workflows, which means the manufacturer must have access to footage from operating logistics environments and the annotation capacity to label it consistently.
The point is not that generic detectors are bad. It is that they are general, and security is specific. A camera sold into a specific environment without specific training is a camera whose performance will be a matter of luck. The buyer who accepts that gamble is the buyer who has not asked the third question.
A procurement discipline that reflects the difference
A buyer who has accepted the framing above can structure procurement to reflect it. The structure has three elements. The first is a written specification that includes detection requirements stated in terms of the deployment environment, not in terms of generic object classes. The specification should describe the lighting conditions, the seasonal range, the workflow patterns, the false-positive tolerance and the response time requirement. The vendor's reply to this specification will indicate whether the vendor has the operational capacity to meet it.
The second element is a defined pilot. A pilot is the only mechanism that produces honest data about how a camera performs in the buyer's environment. Demonstrations in the vendor's facility produce data about how the camera performs in the vendor's environment, which is a different thing. A pilot of sixty to ninety days at a representative site, with documented performance metrics agreed before the pilot begins, separates the vendors who can deliver from the vendors who can demonstrate. The pilot costs money. It costs less than a procurement decision based on a demonstration.
The third element is a contractual right to model performance data over the life of the deployment. The camera that performs well in the first quarter is not necessarily the camera that performs well in the eighth quarter. Environments change, threats change, the model itself can drift if the inference path is updated. A contract that requires the vendor to report inference statistics, model versions and performance against agreed metrics is a contract that preserves the buyer's ability to manage the system as an operational asset, rather than as a sealed box.
These three elements correspond loosely to the three paths described in BOSWAU + KNAUER. From Building to Security Technology. The sixty-minute confidential conversation establishes whether a vendor's capacity matches the buyer's environment, before either side commits resources. The three to five day audit produces a written specification grounded in the buyer's actual sites, rather than in the categories the spec sheet provides. The ninety-day pilot tests the specification under operating conditions, with the data needed to make the scaling decision. A buyer who runs these three steps in sequence will not be sold to. They will be informed.
What holds
AI CCTV in 2026 is a category in which the marketing has run ahead of the engineering, and in which the engineering has run ahead of the operational discipline required to use it. The cameras themselves are capable. The models are improving. What remains weak is the buyer's position in the conversation, because the buyer is asked to evaluate claims against a specification format that does not contain the information needed to evaluate them.
The recovery of buyer position requires accepting that the camera is a vehicle for a model, that the model is the product, and that the manufacturer who controls the model is a different category of supplier from the manufacturer who integrates someone else's. This distinction is not visible on the spec sheet. It is visible in the written responses to specific questions, in the willingness to commit to detection performance under stated conditions, and in the existence of a feedback loop from field reports back into training data.
A buyer who needs cameras for a specific environment and a specific risk profile should not be selecting from data sheets. They should be running a sixty-minute conversation with a candidate manufacturer, followed where warranted by a structured audit of their existing sites, followed where warranted by a defined pilot. The three steps cost time. They produce a procurement decision that does not depend on luck.
Frequently asked questions
What is an AI CCTV camera?
An AI CCTV camera is a closed-circuit television device that runs an inference model on board, producing classifications or detections in addition to the video stream. The model typically identifies object categories, sometimes behaviours, and triggers events based on configured rules. The category covers a wide range of capability, from generic object detection running on integrated chips to specialised models trained for specific operational environments. The label "AI" does not by itself indicate either model quality or model provenance, both of which the buyer must determine through direct questions to the manufacturer.
Who actually trains the models?
A minority of manufacturers train their own models. Most integrate publicly available or third-party models, often variants of YOLO or comparable detection architectures, and brand the integrated result as proprietary. The distinction matters because a manufacturer who controls the training pipeline can improve the model in response to specific deployment environments, while an integrator can only request changes from an upstream party. The buyer can identify which category a vendor falls into by asking about training data provenance, annotation operations, and the feedback loop that runs from field reports back into retraining.
How is an AI camera different from a normal one?
A conventional CCTV camera records video for human review or playback. An AI camera adds an inference layer that processes the video in real time, producing structured outputs such as object classifications, event triggers or rule-based alerts. The practical effect is that the AI camera reduces the volume of footage requiring human attention, in principle. Whether it does so in practice depends on the quality of the model, the match between the training data and the deployment environment, and the configuration of the rules. A poorly matched AI camera produces more noise than a conventional camera, not less.
What does AI add over object detection?
Object detection is one capability inside the broader AI category. It produces bounding boxes around classes of objects. AI in a fuller sense extends to behaviour recognition, workflow anomaly detection, multi-camera tracking, and contextual filtering that suppresses expected events while surfacing unexpected ones. The added capabilities depend on the model architecture, the training data and the integration with other sensors. A camera marketed as AI that only performs object detection is offering object detection. The buyer who needs behaviour or workflow analysis must specify it directly, because it will not be present by default in most products sold under the AI label.

About the author
Dr. Raphael Nagel (LL.M.) is founding partner of Tactical Management. He acquires and restructures industrial businesses in demanding market environments and writes on capital, geopolitics, and technological transformation. raphaelnagel.com
More reading
Since 1892.
The firm is reached at boswau-knauer.de or +49 711 806 53 427.


