Image recognition: from the early days of technology to endless business applications today.
Artificial intelligence and machine learning are hot topics within companies today and will transform virtually every economic activity in the coming years. One of the AI applications that has long appealed to the imagination is where machines, by analogy with the human brain, can process, analyze and give meaning to images. Think of the science fiction cult classic Robocop from the 1980s in which police officer Alex Murphy scans the environment through his robot visor and thus distinguishes between innocent children and dangerous gangsters. Or much more recent, and no Hollywood fiction: computer vision systems that are built into cars and analyze the environment in-depth, enabling self-driving cars.
Everyone has heard about terms such as image recognition, image recognition and computer vision. However, the first attempts to build such systems date back to the middle of the last century when the foundations for the high-tech applications we know today were laid. In this blog, we take a look at the evolution of the technology to date. Subsequently, we will go deeper into which concrete business cases are now within reach with the current technology. And finally, we take a look at how image recognition use cases can be built within the Trendskout AI software platform.
The emergence and evolution of AI image recognition as a scientific discipline
Pioneers from other fields formed the basis for the discipline.
The first steps towards what would later become image recognition technology were taken in the late 1950s. Often reference is made to an influential 1959 paper by neurophysiologists David Hubel and Torsten Wiesel. However, this work in itself had nothing to do with building software systems. In their publication “Receptive fields of single neurons in the cat’s striate cortex” Hubel and Wiesel described the main properties of visual neurons and how the visual experiences of cats shape the cortical architecture. During their experiments on cats, they discovered, by chance, simple and complex neurons in the primary visual cortex, but also that visual processing always starts with the processing of simple structures such as easily distinguishable edges of objects. The complexity of images are then processed step by step. To this day, this principle is still the core principle behind the deep learning technology used in computer-aided image recognition.
Another benchmark also took place around the same period, in particular the invention of the first digital photo scanner. A group of researchers led by Russel Kirsch developed a machine that made it possible to convert images into grids of numbers. It is thanks to their groundbreaking work that we can process digital images in various ways today. One of the first photos to be scanned was a photo of his son Russell. It was a small, grainy photo captured at 30,976 pixels (176 * 176) but has become an iconic image today.
Growing into an academic discipline
As a true founder of computer vision applications as we know them today, Lawrence Roberts is often referred to. In his 1963 doctoral thesis entitled “Machine perception of three-dimensional solids” Lawrence describes the process of deriving 3D information of objects from 2D photos. The initial intention of the program he developed was to convert 2D photos into line drawings. On the basis of these line drawings, 3D representations would then be built with the invisible lines omitted. In his thesis he described the processes that had to be followed to convert a 2D construction into a 3D construction and how a 3D view can subsequently be converted into a 2D view. The processes described by Lawrence proved to be an excellent impetus for later research into computer-aided 3D systems.
In the 1960s, the domain of artificial intelligence became a fully-fledged academic discipline. For some, both researchers and believers outside the academic field, there was unbridled optimism about AI about what the future would bring. A number of researchers were convinced that in less than 25 years a computer would be built that would outperform humans in terms of intelligence. Today, more than 60 years later, this is still not the case. Seymour Papert was one of these outspoken optimists.
Papert was a professor at the AI lab of the reputed Massachusetts Institute of Technology (MIT) and in 1966 launched the “Summer Vision Project”. During the summer months, the intention was to tackle challenges and problems that the image recognition domain faced with a small group of MIT students. The students had to develop a system that automatically segmented foreground and background and could extract non-overlapping objects from photos. The project turned into a hiss and even today, despite the undeniable progress, there are still major challenges around image recognition. Nevertheless, this project was seen by many as the official birth of AI-based computer vision as a scientific discipline.
From hierarchy to neural network
Fast forward to 1982 and the moment when David Marr, a British neuroscientist, publishes the influential paper entitled “Vision: A computational investigation into the human representation and processing of visual information”. This built on the concepts and ideas that stated that image processing does not start from holistic objects. Marr added another important new insight: he found that the visual system works in a hierarchical manner. He stated that the main function of our visual system is to create 3D representations of the environment so that we can interact with it. He introduced a framework where low level algorithms detect edges, corners, curves etc. and are used as stepping stones to understand higher level visual data.
At about the same time, a Japanese scientist, Kunihiko Fukushima, built a self-organizing artificial network of simple and complex cells that could recognize patterns and be unaffected by changes in position. This network, called Neocognitron, consisted of several convolutional layers whose receptive fields had weight vectors, known as filters. These filters moved across input values (such as image pixels), performed calculations, and then triggered events that were used as input by subsequent layers of the network. Neocognitron can thus be labeled as the first neural network to earn the label “deep” and is therefore rightly regarded as the ancestor of today’s convolutional networks.
From 1999, more and more researchers began to move away from the path Marr had taken with his research, and attempts to reconstruct objects using 3D models were halted. Efforts began to orientate more towards feature-based object recognition. David Lowe’s work “Object Recognition from Local Scale-Invariant Features” was an important indicator of this shift. The paper describes a visual image recognition system that uses features that are unchanging from rotation, location and lighting. These characteristics, according to Lowe, resemble the properties of neurons in the inferior temporal cortex involved in primate object detection processes.
Mature technology, widely applicable
Since the 2000s, the focus has shifted to recognizing objects. An important evolution took place in 2006 when Fei-Fei Li (then Princeton Alumni, today Professor of Computer Science at Stanford) decided to establish Imagenet. At the time, Li was struggling with a number of obstacles in her machine learning research, including the problem of overfitting. Overfitting refers to a model in which irregularities are learned from a limited data set. The danger here is that the model may remember noise instead of relevant features. However, because image recognition systems can only recognize patterns based on what has already been seen and trained, this can result in unreliable performance for new data. The opposite principle, underfitting, on the other hand, causes over-generalization and fails to distinguish correct patterns between data.
To overcome these hurdles and make machines make better decisions, Li decided to build an improved data set. The project, which was given the name Imagenet, started in 2007. Barely three years later, Imagenet consisted of more than 3 million images, all carefully labeled and segmented into more than 5000 categories. This was only the beginning and grew into a huge boost for the entire image & object recognition world.
In order to further increase visibility, Imagenet organized an initial Imagenet Large Scale Visual Recognition Challenge (ILSVRC) in 2010. Here, algorithms for object detection and classification were widely evaluated. Thanks to this competition, there was another major breakthrough in the domain in 2012. A team from the University of Toronto came up with Alexnet (named after Alex Krizhevsky, the scientist who pulled the project) using a convolutional neural network architecture. The first year of the competition, the general error rate of the participants was at least 25%. Alexnet, the first team to use deep learning, managed to reduce the error rate to 15.3%. This success unleashed the enormous potential of image recognition as a technology. By 2017, the league’s error rate had fallen below 5%.
This meant much more than just winning a match. It undeniably proved that training via Imagenet could give the models a big boost, so that only fine-tuning was required to perform other recognition tasks. Convolutional neural networks trained in this way are closely linked to transfer learning. These neural networks are now used in many applications, such as how Facebook itself suggests certain tags in photos.
The current Image recognition or image recognition technology used for business applications
Quality control and inspection in production environments
The sector in which machine or computer vision applications are most often used today is the production or manufacturing industry. In this sector, the human eye has been, and continues to be, called upon to carry out controls on product quality. Experience has shown that the human eye is not infallible and external factors such as fatigue can have an impact on the results. These factors combined with the ever increasing labor costs ensured that computer vision systems quickly found their way into this sector.
Image recognition applications lend themselves perfectly to discovery abnormalities or anomalies on a large scale. Machines can be trained to detect imperfections in paintwork or to detect foods that contain rotten spots that do not meet the expected quality standard. Another popular application is the check when packing various parts where the machine performs the check to assess whether each part is present.
Surveillance and security applications
Another application for which the human eye is often used is surveillance via camera systems. Often several screens need to be constantly monitored, requiring permanent concentration. Image recognition allows a machine to be taught to recognize events, such as intruders who do not belong in a specific location. Apart from the security aspect of surveillance, there are many other security use cases. For example, pedestrians or other weak road users can be located on industrial sites to prevent incidents involving heavy equipment.
Asset management and project monitoring in energy, construction, rail or shipping
Large installations or infrastructure require immense efforts with regard to inspection and maintenance, often at high altitudes or in other hard-to-reach places, underground or even underwater. Small defects on large installations can escalate and cause major human and economic damage. Vision systems can be perfectly trained to take over these often risky inspection tasks.
Defects such as rust, missing bolts and nuts, damage or objects that do not belong are located and can thus be identified. These elements from the image recognition analysis may themselves be part of the data sources used for broader predictive maintenance cases. Combining AI applications will not only map the current state, but this data can also be used to predict future defects or fractures.
Mapping the health and quality of crops
Image recognition systems are also booming in the agricultural sector. Crops can be monitored for global condition and, for example, by mapping which insects can be found on crops at what concentration. In this way, diseases can be predicted. Drone or even satellite images are increasingly being used to map large areas of crops. Based on the incidence of light and shades, invisible to the human eye, chemical processes in plants can be detected and crop diseases can be detected at an early stage, enabling proactive intervention and avoiding greater damage.
Automation of administrative processes
In many administrative processes, large efficiency gains can still be made by automating the processing of orders, order forms, emails and forms. A number of AI techniques, including image recognition, can be combined for this. Optical Character Recognition (OCR) is a technique that can be used to digitize texts. However, OCR lacks a smart component that gives meaning to the data. AI techniques such as named enity recognition are then used to detect entities in texts. But even more is possible in combination with image recognition techniques. This includes automatic scanning of containers, trucks and ships based on external statements on these means of transport.
Image & object recognition within the Trendskout AI software platform
As described above, the technology behind image recognition applications has evolved tremendously. Today, deep learning algorithms and convolutional neural networks (convnets) are used for these types of applications. Within the Trendskout AI software platform, we make an abstraction of the complex algorithms that lie behind this application and we make it possible, as a non-data scientist, to build state of the art applications with image recognition. In this way, as an AI company, we make the technology accessible to a wider audience such as business users and analysts. In addition, the Trendskout AI software makes it possible to set up every step of this process, from labeling to training the model to controlling external systems such as robotics, within one and the same platform.
Input of training data and connection for real time AI image processing
A distinction is made between a data set to train the model and the data that will have to be processed live when the model is put into production. As training data you can choose to upload video or photo files in various formats (AVI, MP4, JPEG, …). When video files are used, the Trendskout AI software will automatically split them into separate frames, making labeling easier in the next step.
As with other AI or machine learning applications, the quality of the data is also important for the quality of the image recognition. The sharpness and resolution of the images will therefore have an impact on the result: the accuracy and deployability of the model. A good guideline here is that the more difficult something is to recognize with the human eye, the more difficult it will be for artificial intelligence.
It is often the case that in (video) images only a specific zone is relevant to perform an image recognition analysis. In the example used here, this concerned a specific area in which pedestrians had to be detected. In applications for quality control or inspection in production environments, this is often a zone like a specific part of the conveyor belt. To select certain zones, a user-friendly cropping function was therefore built in.
Labeling or annotating the data
To recognize objects or events, the Trendskout AI software must be trained accordingly. This should be done by labeling or annotating the objects to be detected by the computer vision system. To do this, labels must be applied to those frames. Within the Trendskout AI software, this can easily be done via a drag & drop function. Once a certain label has been assigned, this label is remembered by the software and can then simply be clicked on the subsequent frames. In this way, go through all frames of the training data and indicate all objects to be recognized.
Building and training the computer vision or image recognition model
Once all training data has been annotated, the deep learning model can be built. All you have to do is click the RUN button in the Trendskout AI platform. That is when the automated search for the best performing model for your application starts in the background. The Trendskout AI software performs countless algorithm combinations in the backend. Depending on the number of frames and objects to be processed, this search can take several hours to days. You do not have to wait for this. When the best performing model has been composed, the user will be notified. Along with this model, a number of metrics are also presented that reflect how high the accuracy and overall quality of the built model is.
Deploying the model for your business case
After the model has been trained, it can be put into use. For this purpose, it is usually necessary to connect to the camera platform that is used to make the (real time) video images. This can be done via the live camera input feature, which can connect to various video platforms via API. The outgoing signal consists of messages or coordinates that are generated on the basis of the image recognition model, which can then be used to control other software systems, robotics or even traffic lights.
Do you yourself see opportunities within your company for image or object recognition, but did the large investment cost hold you back? With the Trendskout AI software platform you can quickly and inexpensively set up advanced AI use cases. That is our mission as an AI company: democratizing AI. Contact us for a free demo, we are happy to help you.