Читать онлайн книгу - Cyberphysical Smart Cities Infrastructures. Группа авторов. Физика. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Cyberphysical Smart Cities Infrastructures - Группа авторов

Скачать книгу

Rise of the Embodied AI

In the mid‐1980s, a major paradigm shift took place toward embodiment, and computer science started to become more practical than theoretical algorithms and approaches. Embedded systems started to appear in all kinds of forms to aid humans in everyday life. Controllers for trains, airplanes, elevators, air conditioners, and software for translation and audio manipulation are some of the most important ones, to name a few [23].

Embodied artificial intelligence is a broad term, and those successes were for sure great ones to start with. Yet, it could clearly be seen that it was a huge room for improvement. Theoretically, the ultimate goal of AI is to not only master any given algorithm or task that is given to but also gain the ability to multitask and get to human‐level intelligence, and that as mentioned requires meaningful interaction with the real world. There are many specialized robots for a vast set of tasks out there, especially in large industries, which can do the assigned task to perfection, let it be cutting different metals, painting, soldering circuits, and many more. However, until one single machine emerges to have the ability to do different tasks or at least a small subset of them by itself and not just by following orders, it cannot be called intelligence.

Humanoids are the main thing that comes to mind when we talk about robots with intelligence. Although it is the ultimate goal, it is not the only form of intelligence on Earth. Other animals, such as insects, have their own kind of intelligence, and due to being relatively simpler compared to humans, they are a very good place to begin with.

Rodney Brooks has a famous argument that says it took the evolution much longer to create insects from scratch than getting to human‐level intelligence from there. Consequently, he suggested that these simpler biorobotics should be first dealt with in the road to make much more complex ones. Genghis, a six‐legged walking robot [24], is one of his contributions to this field.

This line of thought was a fundamental change and led researchers to have a change of direction in their work, and with that came attention to new domains and topics such as robotics, locomotion, artificial life, bioinspired systems, and so on. The classical approach did not care about tasks related to the interaction with the real world, and consequently, this journey is started by locomotion and grasping.

Since not much computational power was available at the time of this shift, a big challenge for the researchers was the trade‐off between simplicity and the potential to operate in complex environments. An extensive amount of work has been done in this area to explore or invent ways to exploit natural body dynamics, materials used in the modules, and their morphologies to make the robots move and become able to grasp and manipulate items without sophisticated processing units [25, 27]. It goes without saying that the ones who could use the physical properties of themselves and the environment to function were more energy‐efficient, but they had their own limitations. Not being able to generalize well to complex environments was a major drawback. However, they were fast as the machines with huge processing units needed a reasonable amount of time to think and plan their next action and often move their rigid and non‐smooth actuators.

Nowadays, a big part of these issues are solved, and we can see extremely fast and smooth natural moving robots capable of doing different types of maneuvers [28], but yet it is foreseen that with the advances of artificial muscles, joints, and tendons, this progress can be further improved.

3.3 Breakdown of Embodied AI

In this section, we try to categorize a broad range of research that has been done under the field of embodied AI. Due to the huge diversity, each section will necessarily be abstract and selective and reflect the authors' personal opinion.

3.3.1 Language Grounding

Machine and human communication has always been a topic of interest. As time goes on, more and more aspects of our lives are controlled by AIs, and hence it is crucial to have ways to talk with them. This is a must for giving new instructions to them or receiving an answer from them, and since we are talking about general day‐to‐day machines, we desire this interface to be higher level than programming languages and closer to spoken language. To achieve this, machines must be capable of relating language to actions and the world. Language grounding is the field that tries to tackle this and map natural language instructions to robot behavior.

Hermann et al.'s study shows that this can be achieved by rewarding an agent upon successful execution of written instructions in a 3D environment with a combination of unsupervised learning and reinforcement learning [29]. They also argue that their agent can generalize well after training and can interpret new unseen instructions and operate in unfamiliar situations.

3.3.2 Language Plus Vision

Now that we know that machines can understand languages and there exist sophisticated models just for this purpose out there [30], it is time to bring another sense into play. One of the most popular ways to show the potential of joint training of vision and language is the image and video captioning [31, 35].

More recently, a new line of work has been introduced to take advantage of this connection. AbbtextVisual question answering (VQA) [17] is the task of receiving an image along with a natural language question about that image as an input and attempting to find the accurate natural language answer for it as the output. The beauty of this task is that both the questions and the answers can be open‐ended and also the questions can target different aspects of the image such as the objects that are present in them, their relationship or relative positions, colors, and background.

Following this research, Singh et al. [36] cleverly added an optical character recognition (OCR) module to the VQA model to enable the agent to read the texts available in the image as well and answer questions asked from them or use the additional context indirectly to answer the question better.

One may ask where the new task stands relative to the previous one. Do agents who can answer questions more intelligent than the ones who deal with captions or not? The answer is yes. In [17], the authors show that VQA agents need a deeper and more detailed understanding of the image and reasoning than models for captioning.

3.3.3 Embodied Visual Recognition

Passive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.

Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.

This was powerful and in a sense, the agent developed imagination. However, there was an issue here. If we pay attention, we realize that the agent is still being fed prerecorded video as the input and is learning similar to the observer kitten in the kitten carousel experiment explained above. So, following this, the authors went after this problem and proposed to train an agent that takes any given object from an arbitrary angle and then predict or better to say imagine the other views by finding the representation in a self‐supervised manner [38].

Up until this point, the agent does not use the sound of its surroundings while humans are all about experiencing the world in a multisensory manner. We can see, hear, smell, and touch all

Скачать книгу

Cyberphysical Smart Cities Infrastructures. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Cyberphysical Smart Cities Infrastructures - Группа авторов страница 17

Информация о книге:

3.3 Breakdown of Embodied AI

3.3.1 Language Grounding

3.3.2 Language Plus Vision

3.3.3 Embodied Visual Recognition