Merging HTML tables for extracting relations

Luzuriaga Carpio, Jhomara Tamara

Professor Advisor	dc.contributor.advisor	Hogan, Aidan
Author	dc.contributor.author	Luzuriaga Carpio, Jhomara Tamara
Associate professor	dc.contributor.other	Navarro Badino, Gonzalo
Associate professor	dc.contributor.other	Pérez Rojas, Jorge
Associate professor	dc.contributor.other	Angles Rojas, Renzo
Admission date	dc.date.accessioned	2020-04-01T22:45:02Z
Available date	dc.date.available	2020-04-01T22:45:02Z
Publication date	dc.date.issued	2019
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/173797
General note	dc.description	Tesis para optar al grado de Magíster en Ciencias, Mención Computación	es_ES
Abstract	dc.description.abstract	Con la aparición y evolución de la Web Semántica, las bases de conocimiento (e.g. DBpedia y Wikidata) han ido adquiriendo gran importancia como fuente de información para nuevos proyectos; existen al menos 1700 bases de conocimiento registradas en la nube de Linked Open Data, mientras se cuenta con cerca de 2 billones de sitios en la web; de aquí nace la necesidad de incrementar la información en formato estructurado. Cuanta más información con un alto grado de confianza se encuentre en dichas bases de conocimiento mayor será el beneficio para las aplicaciones que la utilizan. En este trabajo proponemos extraer información de tablas HTML en formato estructurado (RDF) para alimentar estas bases de conocimiento, específicamente Wikidata. El lenguaje HTML permite crear documentos web en un formato semiestructurado que es interpretado por cierto software para mostrar los documentos al usuario; sin embargo, ya que el contenido carece de estructura semántica, el software está limitado para leer y explotar el contenido HTML. Extraer información de HTML y proporcionarle una estructura semántica es por lo tanto un tema de muchos trabajos de investigación. En general, las tablas tienen una estructura relacional de donde se pueden extraer entidades, atributos y relaciones; sin embargo, en la web encontramos innumerables diseños de tablas, que plantean un desafío no trivial para extraer su información. Nuestro trabajo de investigación se basa en la extracción de relaciones entre entidades que pueden ser identificadas en tablas HTML. Aunque una serie de trabajos de investigación ya han abordado este problema, los enfoques de extracción de relaciones existentes tienden a procesar cada tabla de forma individual. Nosotros proponemos una extensión de estos métodos basados en la agrupación de tablas con información similar, de modo que podamos aumentar el contexto de la información contenida en tablas pequeñas y complejas, que por sí mismas no proporcionan suficiente información para extraer relaciones con un buen nivel de confianza. Aplicamos el método propuesto para enriquecer Wikidata con triples extraídos de Wikipedia. Los resultados de la tesis muestran que nuestro método para agrupar tablas obtiene mayor precisión al proporcionar características más robustas para clasificar relaciones candidatas como correctas o incorrectas, alcanzando 75% de precisión en la evaluación realizada sobre tablas individuales; mientras que al considerar las características propuestas por el método de Muñoz et al. [30] se obtuvo 71%. Además con 70% de precisión se pudo obtener más triples mediante nuestra propuesta de agrupar tablas. Consideramos estos resultados satisfactorios ya que a pesar de la gran cantidad de triples incorrectos que se pueden generar al agrupar las tablas pudimos obtener nuevos triples con similar nivel de precisión.	es_ES
Abstract	dc.description.abstract	With the appearance and evolution of the Semantic Web, knowledge bases (e.g. DBpedia and Wikidata) have acquired great importance as a source of information for new projects; there are at least 1700 knowledge bases registered in the Linked Open Data cloud, while there are about 2 billions of websites, hence the need to increase information available in a structured format. The more information with a high degree of confidence available in these knowledge bases, the greater the benefit will be for the applications that use it. In this paper we propose to extract information from HTML tables in a structured format (RDF) to feed these knowledge bases, specifically Wikidata. The HTML language allows to create web documents in a semi-structured format that is interpreted by certain software to show the documents to the user; however, since the content lacks semantic structure, the software is limited in terms of exploiting to read and exploit the HTML content. Extracting information from HTML and providing it with a semantic structure is therefore the subject of many research papers. In general, the tables have a relational structure from which we can extract entities, attributes and relationships; nevertheless, on the web we find innumerable designs of tables, which pose a non-trivial challenge to extract their information. Our research work is based on the extraction of relationships between entities that can be identified in HTML tables. Although a number of research papers have already addressed this problem, existing relationship extraction approaches tend to process each table individually. We propose an extension of these methods based on the grouping of tables with similar information, so that we can increase the context of the information contained in small and complex tables, which by themselves do not provide enough information to extract relationships with a good level of confidence. We apply the proposed method to enrich Wikidata with triples extracted from Wikipedia. The results of the thesis show that our method for grouping tables obtains greater precision by providing more robust characteristics to classify candidate relationships as correct or incorrect, reaching 75% precision in the evaluation performed on individual tables; in comparison, considering the characteristics proposed by the method of Muñoz et al. [30], 71% was obtained. In addition, with 70% precision, more triples could be obtained through our proposal of grouping tables. We consider these results satisfactory since, despite the large number of incorrect triples that can be generated when grouping the tables, we were able to obtain new triples with a similar level of precision.	es_ES
Lenguage	dc.language.iso	en	es_ES
Publisher	dc.publisher	Universidad de Chile	es_ES
Type of license	dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Chile	*
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/	*
Keywords	dc.subject	Minería de datos	es_ES
Keywords	dc.subject	Agentes inteligentes (Software computacional)	es_ES
Keywords	dc.subject	Web semántica	es_ES
Keywords	dc.subject	HTML (Lenguaje de marcación de documentos)	es_ES
Título	dc.title	Merging HTML tables for extracting relations	es_ES
Document type	dc.type	Tesis
Cataloguer	uchile.catalogador	gmm	es_ES
Department	uchile.departamento	Departamento de Ciencias de la Computación	es_ES
Faculty	uchile.facultad	Facultad de Ciencias Físicas y Matemáticas	es_ES

Files in this item

Name:: cf-luzuriaga_jc.pdf
Size:: 5.763Mb
Format:: PDF

This item appears in the following Collection(s)

Tesis Postgrado
Tesis Postgrado

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile