Cyberphysical Smart Cities Infrastructures. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Cyberphysical Smart Cities Infrastructures - Группа авторов страница 18

Cyberphysical Smart Cities Infrastructures - Группа авторов

Скачать книгу

and use the relevant information that could be beneficial to our task at hand. All that said, understanding and learning the sound of objects present in a scene is not easy since all the sounds are overlapped and are being received via a single channel sensor. This is often dealt with as an audio source separation problem, and lots of work has been done on it in the literature [39, 43].

      Results show that policies indeed help the agent to achieve better visual recognition performance, and the agents can strategize their future moves and path for better results that are mostly different from shortest paths [51].

      3.3.4 Embodied Question Answering

      Embodied Question Answering brings QA into the embodied world. The task starts by an agent being spawned at a random location in a 3D environment and asked a question in which its answer can be found somewhere in the environment. For the agent to answer it, it must first strategically navigate to explore the environment, gather necessary data via its vision, and then answer the question when the agent finds it [52, 53].

      Following this, Das et al. [54] also presented a modular approach to further enhance this process by teaching the agent to break the master policy into subgoals that are also interpretable by humans and execute them to answer the question. This proved to increase the success rate.

      3.3.5 Interactive Question Answering

      Interactive Question Answering (IQA) is closely related to the Embodied version of it. The only main issue is that question is designed in a way that the agent must interact with the environment to find the answer. For example, it has to open the refrigerator or pick up something from the cabinet and then plan for a series of actions conditioned on the question [55].

      3.3.6 Multi‐agent Systems

      Multi‐agent systems (MAS) is another interesting line of development. The default standpoint of AI has a strong focus on individual agents. MAS research that has its origins in the field of biology tries to change this and studies the emergence of behaviors in groups of agents or swarms instead [56, 57].

      Every agent has a set of abilities and is good in them to an extent. The point of interest in MAS is how a sophisticated global behavior can emerge from a population of agents working together. A real‐life example of such behavior can be found in insects like ants and bees [58, 59]. One of the interesting goals of this research is to ultimately make agents that could self‐repair [60, 61].

      Now that we know about the fields and tasks that embodied AI can shine in, the question is how our agents should be trained. One may say it is good to directly train in the physical world and expose them to its richness. Although a valid solution, this choice comes with a few drawbacks. First, the training process in the real world is slow, and the process cannot be sped up or parallelized. Second, it is very hard to control the environment and create custom scenarios. Third, it is expensive, both in terms of power and time. Fourth, it is not safe, and improperly trained or not fully trained robots can hurt themselves, humans, animals, and other assets. Fifth, for the agent to generalize the training, it has to be done in plenty of different environments that is not feasible in this case.

      Our next choice is simulators, which can successfully deal with all the aforementioned problems pretty well. In the shift from Internet AI to embodied AI, simulators take the role that was previously played by traditional datasets. Additionally, one more advantage of using simulators is that the physics in the environment can be tweaked as well. For instance, some traditional approaches in this field [64] are sensitive to noise, and for the remedy, the noise in the sensors can be turned off for the purpose of this task.

      As a result, agents nowadays are often developed and benchmarked in simulators [65, 66], and once a promising model has been trained and tested, it can then be transferred to the physical world [67, 68].

      House3D [69], AI2‐THOR [70], Gibson [71], CHALET [72], MINOS [73], and Habitat [74] are some of the popular simulators for the embodied AI studies. These platforms vary with respect to the 3D environments they use, the tasks they can handle, and the evaluation protocols they provide. These simulators support different sensors such as vision, depth, touch, and semantic segmentation.

      In the last section, we saw numerous task definitions and how they each can be tackled by the agents. So, before jumping into MINOS and Habitat simulators and reviewing them, let us first get more familiarized with the three main goal‐directed navigation tasks, namely, PointGoal Navigation, ObjectGoal Navigation, and RoomGoal Navigation.

      In PointGoal Navigation, an agent is appeared at a random starting position and orientation in a 3D environment and is asked to navigate to target coordinates that are given relative to the agent's position. The agent can access its position via an indoor GPS. There exists no ground‐truth map, and the agent must only use its sensors to do the task. The scenarios start the same for ObjectGoal Navigation and RoomGoal Navigation as well; however, instead of coordinates, the agent is asked to find an object or go to a specific room.

      3.4.1 MINOS

      MINOS simulator provides access to 45 000 three‐dimensional models of furnished houses with more than 750 K rooms of different types available in the SUNCG [76] dataset and 90 multi‐floor residences with approximately 2000 annotated room regions that are in the Matterport3D [77] dataset by default. Environments in Matterport3D are more realistic looking than the ones in SUNCG. MINOS simulator can approximately reach hundreds of frames per second on a normal workstation.

      To benchmark the system, the authors studied four navigation algorithms, three of which were based on asynchronous advantage actor‐critic (A3C) approach [78] and the remaining one was direct future prediction (DFP) [79].

      The most basic one among the algorithms was feedforward A3C. In this algorithm, a feedforward convolutional neural network (CNN) model is employed as the function approximator to learn the policy along with the total value function that is the expected sum of rewards from the current timestamp until the end of the episode. The second one was LSTM A3C that used an LSTM model with the feedforward A3C acting as a simple memory. Next was UNREAL, an LSTM A3C model boosted with auxiliary tasks such as value function replay and reward prediction. Last but not

Скачать книгу