Abstract (EN):
With the emergence of an information-oriented society it soon became clear that the massive amount of information that was generated required effective ways for indexing and searching. From as early as the 50s in the 20th century, researchers have sought ways to implement information retrieval systems. These systems, and in particular text retrieval systems, have evolved considerably and became a part of our daily life. How we have now virtually the whole internet text content searchable and accessible in less than half a second is paradigmatic of this. The next natural step was indexing also multimedia content besides text content. However, multimedia content introduces additional problems to the indexing task. The large amount of information and the complexity of its relations are factors that dramatically increase the difficulty in achieving highly successful indexing and searching results. For instance, until recently, devising a system that could automatically detect and identify persons in a complex scene, track them across multiple cameras and analyse their behaviour in real-time would be too much of an arduous task. Though such a system is not yet fully accomplished, many recent successful advances, mostly in computer vision and machine learning, take us much nearer to that technological milestone.
In this dissertation we approach the issue of indexing content obtained from real-world scenes. We define ``real-world scene'' as any scene captured continuously in public or private spaces by automated and often passive sensors. These scenes are usually captured by multiple sensors of multiple types. The actions portrayed in the captured sequences consist of everyday actions, like people walking or running, cars passing by or parking, etc. An example of application is a surveillance system. Most of the information in surveillance scenarios is conveyed by a sequence of images but, more often than not, there is important information that can be obtained from analysing other types of data, or modalities -- multimodal scene analysis relies on that premiss. We start by analysing the concepts and challenges that are part of multimodal analysis, having in mind real-world scenes. Three processing areas are considered: object detection, object recognition, and event analysis. With object detection we separate both in space and in time each object and associate a label to them. This label distinguishes objects from one another but does not associate any semantic knowledge. That is the goal of object recognition with which we associate an identity to the object from a set of known classes. With event analysis on the other hand we identify relevant activities and events that are defined by the context of the scene under analysis. For each area we survey relevant algorithms and systems, and present original contributions.
Language:
Portuguese
Type (Professor's evaluation):
Scientific
Contact:
lfpt@fe.up.pt
No. of pages:
272