Cyberphysical Smart Cities Infrastructures. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Cyberphysical Smart Cities Infrastructures - Группа авторов страница 17

Cyberphysical Smart Cities Infrastructures - Группа авторов

Скачать книгу

Rise of the Embodied AI

      Embodied artificial intelligence is a broad term, and those successes were for sure great ones to start with. Yet, it could clearly be seen that it was a huge room for improvement. Theoretically, the ultimate goal of AI is to not only master any given algorithm or task that is given to but also gain the ability to multitask and get to human‐level intelligence, and that as mentioned requires meaningful interaction with the real world. There are many specialized robots for a vast set of tasks out there, especially in large industries, which can do the assigned task to perfection, let it be cutting different metals, painting, soldering circuits, and many more. However, until one single machine emerges to have the ability to do different tasks or at least a small subset of them by itself and not just by following orders, it cannot be called intelligence.

      Humanoids are the main thing that comes to mind when we talk about robots with intelligence. Although it is the ultimate goal, it is not the only form of intelligence on Earth. Other animals, such as insects, have their own kind of intelligence, and due to being relatively simpler compared to humans, they are a very good place to begin with.

      Rodney Brooks has a famous argument that says it took the evolution much longer to create insects from scratch than getting to human‐level intelligence from there. Consequently, he suggested that these simpler biorobotics should be first dealt with in the road to make much more complex ones. Genghis, a six‐legged walking robot [24], is one of his contributions to this field.

      This line of thought was a fundamental change and led researchers to have a change of direction in their work, and with that came attention to new domains and topics such as robotics, locomotion, artificial life, bioinspired systems, and so on. The classical approach did not care about tasks related to the interaction with the real world, and consequently, this journey is started by locomotion and grasping.

      Nowadays, a big part of these issues are solved, and we can see extremely fast and smooth natural moving robots capable of doing different types of maneuvers [28], but yet it is foreseen that with the advances of artificial muscles, joints, and tendons, this progress can be further improved.

      In this section, we try to categorize a broad range of research that has been done under the field of embodied AI. Due to the huge diversity, each section will necessarily be abstract and selective and reflect the authors' personal opinion.

      3.3.1 Language Grounding

      Machine and human communication has always been a topic of interest. As time goes on, more and more aspects of our lives are controlled by AIs, and hence it is crucial to have ways to talk with them. This is a must for giving new instructions to them or receiving an answer from them, and since we are talking about general day‐to‐day machines, we desire this interface to be higher level than programming languages and closer to spoken language. To achieve this, machines must be capable of relating language to actions and the world. Language grounding is the field that tries to tackle this and map natural language instructions to robot behavior.

      Hermann et al.'s study shows that this can be achieved by rewarding an agent upon successful execution of written instructions in a 3D environment with a combination of unsupervised learning and reinforcement learning [29]. They also argue that their agent can generalize well after training and can interpret new unseen instructions and operate in unfamiliar situations.

      3.3.2 Language Plus Vision

      Now that we know that machines can understand languages and there exist sophisticated models just for this purpose out there [30], it is time to bring another sense into play. One of the most popular ways to show the potential of joint training of vision and language is the image and video captioning [31, 35].

      Following this research, Singh et al. [36] cleverly added an optical character recognition (OCR) module to the VQA model to enable the agent to read the texts available in the image as well and answer questions asked from them or use the additional context indirectly to answer the question better.

      One may ask where the new task stands relative to the previous one. Do agents who can answer questions more intelligent than the ones who deal with captions or not? The answer is yes. In [17], the authors show that VQA agents need a deeper and more detailed understanding of the image and reasoning than models for captioning.

      3.3.3 Embodied Visual Recognition

      Passive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.

      Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.

      This was powerful and in a sense, the agent developed imagination. However, there was an issue here. If we pay attention, we realize that the agent is still being fed prerecorded video as the input and is learning similar to the observer kitten in the kitten carousel experiment explained above. So, following this, the authors went after this problem and proposed to train an agent that takes any given object from an arbitrary angle and then predict or better to say imagine the other views by finding the representation in a self‐supervised manner [38].

      Up until this point, the agent does not use the sound of its surroundings while humans are all about experiencing the world in a multisensory manner. We can see, hear, smell, and touch all

Скачать книгу