RESEARCH SPOTLIGHT

Research Spotlight (RS) provides an automated workflow that allows for the population of Scholarly Ontology's core entities and relations. To do so, RS provides distance supervision techniques, allowing for easy training of machine learning models, interconnects with various APIs to harvest (linked) data and information from the web, and uses pretrained ML models along with lexico/semantic rules in order to extract information from text in research articles, associate it with information from article's metadata and other digital repositories, and publish the infered knowledge as linked data. Simply put, Research Spotlight allows for the transformation of text from a research article into queriable knowledge graphs based on the semantics provided by the Scholarly Ontology.

RS employs a modular architecture that allows for flexible expansion and upgrade of its various components. It is writen in Python and makes use of various libraries such as SpaCy for parsing and syntactic analysis of text, Beautiful Soup for parsing the html/xml structure of web pages and scikit-learn for implementing advanced machine learning methodologies in order to extract entities and relations from text.

LAYERED APPROACH

In order to transform text in to queriable knowledge graphs, Research Spotlight (RS) follows a layered approach. The input comprises published research articles retrieved from repositories or web pages in the preferred html/xml format. The format is exploited in extracting the metadata of an article, such as authors’ information, references and their mentions in text, legends of figures, tables etc.. Entities, such as Activities, Methods, Goals, Propositions, etc., are extracted from the text of the article. These are associated in the relation extraction step, through various relations, e.g. follows, hasPart, hasObjective, resultsIn, hasParticipant, hasTopic, hasAffiliation, etc.. Encoded as RDF triples, these are published as linked data, using additional “meta properties”, such as owl:sameAs, owl:equivalentProperty, rdfs:Label, skos:altLabel, where appropriate.

PREPROCESSING

In Preprocessing, information is retrieved from sources such as DBpedia in order to build lists of named entities through the NE List Creation module. Specific queries using these entities are then submitted to the sources via theAPI Querying module. Retrieved articles are processed by the Text Cleaning module and the raw text at the output is added to a training corpus through the Automatic Annotation module that uses the entries of NE List to spot named entities in the text. The annotated texts are used to train a classifier to recognize the desired type of named entities.

MAIN PROCESSING

Main Processing begins with harvesting research articles from Web sources, either using their APIs or by scraping publication web sites. The articles are scanned for metadata which are mapped to SO instances according to a set of rules. In addition, specific html/xml tags inside the articles indicating static/images, tables and references are extracted and associated with appropriate entities according to SO, while the rest of the unstructured, “raw” text is cleaned and segmented into sentences by the Text Cleaning & Segmentation module. The unstructured, “raw” text of the article is then input into the Named Entity Recognition module, where named entities of specific types are recognized. The segmented text is also inserted into a dependency parser using the Syntactic Analysis module. The output consists of annotated text -in the form of dependency trees based on the internal syntax of each sentence- which is further processed by the Non-Named Entities Extraction module, so that text segments that contain other entities (such as Activities, Goals or Propositions) can be extracted. The output of the above steps (named entities, non-named entities and metadata) is fed into the Relation Extraction module that uses four kinds of rules: (i) syntactic patterns based on outputs of the dependency parser; (ii) surface form of words and POS tagging; (iii) semantic rules derived from Scholarly Ontology; (iv) proximity constraints capturing structural idiosyncrasies of texts. Finally, based on the information extracted in the previous steps, URIs for the SO namespace are generated, and linked -when possible- to other strong URIs (such as the DBpedia entities stored in the named entities lists) in order to be published as linked data through a SPARQL endpoint.