Subproject 5: On- and Offline Extraction of Semantic Structures

Work Package 5.1 Extraction from Syntactic Web Pages deals with the development and application of techniques that render answers to arbitrary, domain-independent user requests online and in real-time. These methods are rather shallow but fast and robust. They depend on already existing search engines like Google and only take classic ("syntactic") web pages into account that are written in pure HTML (without semantic annotation).

Work Package 5.2 Automatic Generation of Semantic Web Pages deals with the definition and development of automatic methods for the generation of semantic web pages. To this end, web pages are analyzed offline and filled with semantic meta-data as defined in the SmartWeb ontologies (WP 4.2). Through this, a knowledge base is created to answer user requests with a higher precision than would be possible with a faster but rather more imprecise extraction from syntactic web pages (WP 5.1). A higher precision follows from advanced inferences and the integration of information through an incremental, ontology-based processing of web-information.

The developed methods level mainly at un- or semi-structured data like textual descriptions and pictures. For the annotation of textual data, different approaches from the area of information extraction are used that take into account a combination of linguistic, semantic and document structure. Pictures, esp. logos, are analyzed and annotated through the methods of image recognition. In the following sections, these diverse aspects will be further deepened whereby they are broken down into four Sub Work Packages: Work Package 5.2.1: Generation of a document collection; Work Package 5.2.2: Linguistic and Semantic Analysis for the Extraction of Information; Work Package 5.2.3: Document Structure Analysis and Image Recognition; Work Package 5.2.4: Ontology-based Information Extraction.

