知识库用于语言资源数字化建设
基于知识图谱与本体的语言知识建模与应用
这些文献共同关注通过本体论(Ontology)、知识图谱(Knowledge Graph)和语义技术,构建结构化的语言或领域知识库,以实现知识的推理、可视化及智能化管理。
- The Linguistic Design of the EuroWordNet Database(Antonietta Alonge, Nicoletta Calzolari, Piek Vossen, Laura Bloksma, Irene Castellón Masalles, M. Antònia Martí, Wim Peters, 1998, Computers and the Humanities)
- An Ontology based Smart Management of Linguistic Knowledge(Mariem Neji, Fatma Ghorbel, Bilel Gargouri, Nada Mimouni, Elisabeth Métais, 2022, Journal of Data Mining & Digital Humanities)
- Ontologies and ontological methods in linguistics(Andrea C. Schalley, 2019, Language and Linguistics Compass)
- A Comprehensive Survey on Automatic Knowledge Graph Construction(Lingfeng Zhong, Jia Wu, Qian Li, Hao Peng, Xindong Wu, 2023, ACM Computing Surveys)
- Automatic Construction of Subject Knowledge Graph based on Educational Big Data(Ying Su, Yong Zhang, 2020, Proceedings of the 2020 3rd International Conference on Big Data and Education)
- An Approach of Ontology Based Knowledge Base Construction for Chinese K12 Education(Jiawei Hu, Zheng Li, Bin Xu, 2016, 2016 First International Conference on Multimedia and Image Processing (ICMIP))
- DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia(Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer, 2015, Semantic Web)
语言数据资源库的建设与数字化存储
这些文献专注于特定领域(如二语习得、低资源语言、听力与口语发展)语言数据的收集、标注、组织及存储,强调构建供研究者使用的语料库或知识库平台。
- Wordbank: an open repository for developmental vocabulary data(Michael C. Frank, Mika Braginsky, Daniel Yurovsky, Virginia A. Marchman, 2016, Journal of Child Language)
- Methodology for the creation of a linguistic database: challenges and contributions to the teaching-learning process(Raimundo Gouveia da Silva, Iandra Maria Weirich da Silva Coelho, 2020, Revista de Estudos e Pesquisas sobre Ensino Tecnológico (EDUCITEC))
- Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages(W. David Lewis, Fan Xia, 2010, Literary and Linguistic Computing)
- The Listening and Spoken Language Data Repository: Design and Project Overview(Tamala S. Bradham, Christopher Fonnesbeck, Alice E. Toll, Barbara F. Hecht, 2017, Language, Speech, and Hearing Services in Schools)
- Discourse annotation guideline for low-resource languages(Francielle Vargas, Wolfgang S. Schmeisser-Nieto, Zohar Rabinovich, Thiago Alexandre Salgueiro Pardo, Fabrício Benevenuto, 2025, Natural Language Processing)
- A Metadata Best Practice for a Scientific Data Repository(Jane Greenberg, Hollie White, Sarah Carrier, Ryan Scherle, 2009, Journal of Library Metadata)
语言处理工具链与知识库的自动化构建方法
这些文献讨论了从原始文本或复杂格式数据中自动抽取知识、转换现有语料库格式、以及整合多模态数据(如传感器数据)进行知识库更新的技术实现与算法模型。
- Constructing a Second Language: Analyses and Computational Simulations of the Emergence of Linguistic Constructions From Usage(Nick C. Ellis, Diane Larsen‐Freeman, 2009, Language Learning)
- UML AS DOMAIN SPECIFIC LANGUAGE FOR THE CONSTRUCTION OF KNOWLEDGE-BASED CONFIGURATION SYSTEMS(Alexander Felfernig, Gerhard Friedrich, Dietmar Jannach, 2000, International Journal of Software Engineering and Knowledge Engineering)
- Automatic construction and validation of French large lexical resources. Reuse of verb theoretical linguistic descriptions(Nabil Hathout, Fiammetta Namer, 1998, Proceedings of the Language Resources and Evaluation Conference)
- Research on Knowledge Base Construction and Incremental Updating Techniques for Low-resource domains Based on Large Language Models(Zixuan Zhang, Ziyao Han, Jixuan Zhang, Zhongbao Jia, Wentao Yu, Xiaohui Chen, 2025, 2025 5th International Conference on Computer, Internet of Things and Control Engineering (CITCE))
- Knowledge management and Cultural Heritage repositories: Cross-Lingual Information Retrieval strategies(Maria Pia di Buono, Mario Monteleone, Federica Marano, Johanna Monti, 2013, 2013 Digital Heritage International Congress (DigitalHeritage))
- Fonduer(Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, Christopher Ré, 2018, Proceedings of the 2018 International Conference on Management of Data)
- Construction of Learning Resources for International Chinese Language Education Based on Sensor Technology and Knowledge Graphs(Yue Fu, Lei Zhao, Borui Zheng, Yirong Wang, Liqing Yang, 2026, Sensors and Materials)
- Building a Morphological Treebank for German from a Linguistic Database(Petra Steiner, Josef Ruppenhofer, 2018, Proceedings of the Language Resources and Evaluation Conference)
语言数据库设计的理论基础与概论
这些文献提供了关于计算机在语言学研究中应用的基础性回顾、数据库建模原则及设计规范,起到综述和指南的作用。
- Designing linguistic databases: A primer for linguists(Alexis Dimitriadis, Simon Musgrave, 2009, The Use of Databases in Cross-Linguistic Studies)
- MAIN TYPES OF DATABASES IN LINGUISTIC RESEARCH OF THE XXI CENTURY: FEATURES AND FUNCTIONAL PURPOSE(V. V. Hromovenko, 2021, International Humanitarian University Herald. Philology)
本报告将语言资源数字化建设的相关研究分为四个核心维度:本体与知识图谱的语义建模、专用语言资源库的构建实践、自动化知识抽取与多模态数据整合技术,以及数据库设计的理论基础与方法论概论。这些文献共同反映了从简单的语料存储向智能化、结构化、语义化知识库演进的行业发展趋势。
总计23篇相关文献
We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base-and in some cases produces up to 1.87× the number of correct entries-compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of
Addressing the core challenges in constructing knowledge bases for Low-resource domains-namely the difficulty of extracting unstructured data, scarcity of annotation resources, and inefficient knowledge updates prone to conflictsthis paper proposes a knowledge base construction and incremental updating solution based on large language models. First, a three-tier knowledge extraction mechanism is designed: “rule engine pre-screening → large model few-shot extraction → BERT-BiLSTM-CRF model refinement”, achieving efficient conversion from unstructured text to structured entityrelationship pairs. Second, a dual-structure storage architecture combining “knowledge graph + vector database” is constructed to accommodate both structured relational queries and semantic similarity retrieval. Finally, a four-tier incremental update mechanism is proposed: “real-time monitoring - differential analysis - conflict detection - targeted updates”. This incorporates a joint conflict detection algorithm combining edit distance and semantic embedding to ensure low-latency, high-consistency dynamic knowledge updates. Experimental results on datasets demonstrate that this approach achieves an entity recognition F1 score of up to 92.6% and a relation extraction F1 score of up to 89.4%. Incremental update efficiency surpasses traditional fullupdate methods by over 81.6%, with conflict detection accuracy reaching 95.3% and a redundancy rate of merely 5.1%. This effectively resolves critical challenges in constructing and maintaining knowledge bases for vertically sparse domains, providing high-quality knowledge support for intelligent questionanswering systems in low-resource domains.
The increasing learning materials on the Internet bring the learners a flood of educational resources, along with many difficulties in the learning process. It becomes much harder for the learners to find the required resources and get the right knowledge from these unsorted resources. To give the learners a better learning experience and improve the efficiency of the education based on the Internet, an approach proposed to solve these problems by constructing an ontology based knowledge base for Chinese K12 education with high accuracy and efficiency. Since domain ontology is beneficial to knowledge representation, storage and sharing, and knowledge base has a high performance on knowledge retrieval, learners who use the applications built on the knowledge base will benefit in the learning process.
We developed a method for updating and constructing Chinese language teaching resources by integrating sensor technology and knowledge graphs.The method addresses the challenges of a mismatch between resource supply and learner demand through a closed loop of perceptionanalysis-update. Wearable sensors and AI cameras were used to collect real-time, quantitative multimodal data on learners' physiological and behavioral states, including body temperature, heart rate, and classroom interactions.These sensor data, along with learning platform data, were used to train an eXtreme Gradient Boosting model.The model achieved a prediction accuracy of 89.2% and an area under the receiver operating characteristic curve of 0.93, indicating the accurate distinction of knowledge entities that need updating and those that do not.The feature importance analysis revealed that user ratings (0.603) and recency (0.226) were the most influential factors for predicting update necessity.The knowledge graph was iteratively updated through a multistep process including pattern mining, filtering, and a final review by experts.The resulting knowledge graph incorporated nodes for new content, such as internet slang and cross-cultural variations in festivals, demonstrating the method's ability to adapt to linguistic evolution and cultural nuances.Through the establishment of a closed-loop architecture, multimodal sensor data, including physiological photoplethysmography signals and behavioral time-of-flight imaging, are used for the expansion and weight adjustment of a domain-specific knowledge graph.This cognitive-aware update mechanism ensures that learning resources evolve along with real-time learner demands, providing a scalable blueprint for intelligent, sensor-driven knowledge management systems in various disciplines.The results of this study also underscore the role of sensor data in developing contextualized, personalized, and optimized digital learning resources that lead to a learner-centered learning environment.In the AI era, digital learning resources have become fundamental for the quality enhancement of Chinese language education. (6)Research on effective Chinese language education has been extensively conducted, (7) in which classification systems of Chinese learning resources have been developed to construct learning resources and databases. (8)With the application of wearable sensors, edge computing devices, IoT, and knowledge graphs, learning resources can be further developed on the basis of new technological approaches and paradigms.Knowledge graphs, in the form of a formal semantic network of nodes, edges, and attributes, are used to model Chinese characters, vocabulary, grammar, cultural concepts, and their dynamic interrelations, for the scalable organization of learning resources. (9)Sensors are employed to capture cognitive, emotional, and behavioral signals of learners in real or virtual contexts through data collection at a millisecond-level interval, enabling high-resolution and contextualized resource updates.While knowledge graphs are used to present static knowledge, sensor data are used to analyze learning patterns and assess the learner's attitude and responses.However, knowledge graphs and sensor data have been used separately, which hinders the integration required to construct a closed-loop framework that enables a positive feedback cycle among the knowledge graph, sensor perception, and learning resources.Therefore, we studied how to leverage the synergy of knowledge graphs and sensor technology in constructing Chinese language learning resources through the real-time collection of multimodal data on learners' physiology, behavior, and cognition, and the data analytics through adaptive iteration and personalized adaptation.In this study, a sensor-driven cognitively adaptive system and its underlying architecture were constructed for real-time data perception using wearable and ambient sensors, predictive cognitive modeling, and automated knowledge
It is a commonplace, by now, to refer to the recent explosive growth in the power and availability of computers as an information revolution. The most casual of computer users, linguists included, have at their fingertips an enormous amount of computing power. Tasks such as writing a document or playing
… In this paper the linguistic design of the database under construction within the EuroWordNet project is described. This is mainly structured along the same lines as the Princeton Word…
This paper presents theoretical and methodological questions related to the creation of a Linguistic Database, made up of samples from the Cazumbá Iracema Extractive Reserve, located in the state of Acre, and discusses the main challenges found and contributions to the teaching and learning process of Portuguese. The methodology for collecting and organizing this database is based on the theoretical assumptions of sociolinguistic patterns, the empirical foundations of the Theory of Linguistic Variation and Change, and the methodology for collecting and manipulating data in sociolinguistics. The implementation of the proposal involves the use of software that can be used in education. The results show contributions of this sample use for the creation of teaching proposals, focusing on the language in use, identification of the sociocultural factors that influence the emergence and permanence of linguistic variation and researches in the scope of natural languages.
We address in this paper some problems related to the reuse for NLP of LADL's Lexicon-Grammar (LG). This major source of French verbs lexical knowledge has been publicly available on the Internet for several years. However, it has not been used by the NLP community, mainly because of its format: ASCII files each of them containing a table with binary values (+/\\Gamma). The interpretation of these tables is non trivial because large parts of the linguistic informations they contain are neither explicit nor represented in a uniform manner. The paper presents 3 aspects of the research: (1) The translation of LG into a PATR-II Intermediate Lexicon (IL). The aim of this translation is to normalize and to represent explicitly the lexical properties encoded in LG tables. IL representations are independent of any particular linguistic theory. (2) IL is used to generate lexicons for NLP applications based on unification grammars. We have build an HPSG lexicon used within the ALEP system to parse French, and a TAG lexicon used for French text generation. These lexicons are dual of one another since, for a each entry, the first represents the properties that hold while the later represents the ones that do not hold. The generation of these lexicons raises interesting questions regarding the lexicon organization in these theories. (3) The evaluation of LG coverage on a corpus. This evaluation uses a French shallow parser able to recognize quite precisely the constituents that the verbs take as arguments. The lexical descriptions of the verbs can then be saturated in order to recognize the phrases headed by these verbs.
German is a language with complex morphological processes.Its long and often ambiguous word forms present a bottleneck problem in natural language processing.As a step towards morphological analyses of high quality, this paper introduces a morphological treebank for German.It is derived from the linguistic database CELEX which is a standard resource for German morphology.We build on its refurbished, modernized and partially revised version.The derivation of the morphological trees is not trivial, especially for such cases of conversions which are morpho-semantically opaque and merely of diachronic interest.We develop solutions and present exemplary analyses.The resulting database comprises about 40,000 morphological trees of a German base vocabulary whose format and grade of detail can be chosen according to the requirements of the applications.The Perl scripts for the generation of the treebank are publicly available on github.In our discussion, we show some future directions for morphological treebanks.In particular, we aim at the combination with other reliable lexical resources such as GermaNet.
… dictionaries using databases on personal computers: construction of an abstract model of this … database; filling the lexicographic database; converting the lexicographic database to the …
Automatic knowledge graph construction aims at manufacturing structured human knowledge. To this end, much effort has historically been spent extracting informative fact patterns from different data sources. However, more recently, research interest has shifted to acquiring conceptualized structured knowledge beyond informative data. In addition, researchers have also been exploring new ways of handling sophisticated construction tasks in diversified scenarios. Thus, there is a demand for a systematic review of paradigms to organize knowledge structures beyond data-level mentions. To meet this demand, we comprehensively survey more than 300 methods to summarize the latest developments in knowledge graph construction. A knowledge graph is built in three steps: knowledge acquisition, knowledge refinement, and knowledge evolution. The processes of knowledge acquisition are reviewed in detail, including obtaining entities with fine-grained types and their conceptual linkages to knowledge graphs; resolving coreferences; and extracting entity relationships in complex scenarios. The survey covers models for knowledge refinement, including knowledge graph completion, and knowledge fusion. Methods to handle knowledge evolution are also systematically presented, including condition knowledge acquisition, condition knowledge graph completion, and knowledge dynamic. We present the paradigms to compare the distinction among these methods along the axis of the data environment, motivation, and architecture. Additionally, we also provide briefs on accessible resources that can help readers to develop practical knowledge graph systems. The survey concludes with discussions on the challenges and possible directions for future exploration.
In this paper, we propose an automatic construction method of subject knowledge graph for educational applications. The subject knowledge graph is constructed based on educational big data by using a bootstrapping strategy to gradually expand knowledge points and connections between them. In this paper two different datasets are used. One is the subject teaching resources such as syllabuses, teaching plans, textbooks and etc., which is used to automatically construct the core of subject knowledge graph so as to reduce the dependence on the manual annotation. Meanwhile the high-quality of subject teaching resources is the guarantee of accuracy of the knowledge graph core. The other dataset is the massive Internet encyclopedia texts, which is used to expand and complete the subject knowledge graph. As to algorithm, this paper utilizes the BERT-BiLSTM-CRF model to automatically identify the subject knowledge points, and then evaluates the relationship between the knowledge points by calculating their semantic similarity, PMI and Normalized Google Distance between them. The experimental results show that BERT-BiLSTM-CRF outperforms the baselines significantly, and the three kinds of relationship evaluation models have achieved good results. Finally, computer science and physics science are taken as examples to construct the subject knowledge graphs successfully, which show the effectiveness of our method.
Purpose: The purpose of the Listening and Spoken Language Data Repository (LSL-DR) was to address a critical need for a systemwide outcome data-monitoring program for the development of listening and spoken language skills in highly specialized educational programs for children with hearing loss highlighted in Goal 3b of the 2007 Joint Committee on Infant Hearing position statement supplement. Method: The LSL-DR is a multicenter, international data repository for recording and tracking the demographics and longitudinal outcomes achieved by children who have hearing loss who are enrolled in private, specialized programs focused on supporting listening and spoken language development. Since 2010, annual speech-language-hearing outcomes have been prospectively obtained by qualified clinicians and teachers across 48 programs in 4 countries. Results: The LSL-DR has been successfully implemented, bringing together the data collection efforts of these programs to create a large and diverse data repository of 5,748 children with hearing loss. Conclusion: Due to the size and diversity of the population, the range of assessments entered, and the demographic information collected, the LSL-DR will provide an unparalleled opportunity to examine the factors that influence the development of listening in spoken language in this population.
In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10% of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract ‘knowledge’ from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.
The MacArthur-Bates Communicative Development Inventories (CDIs) are a widely used family of parent-report instruments for easy and inexpensive data-gathering about early language acquisition. CDI data have been used to explore a variety of theoretically important topics, but, with few exceptions, researchers have had to rely on data collected in their own lab. In this paper, we remedy this issue by presenting Wordbank, a structured database of CDI data combined with a browsable web interface. Wordbank archives CDI data across languages and labs, providing a resource for researchers interested in early language, as well as a platform for novel analyses. The site allows interactive exploration of patterns of vocabulary growth at the level of both individual children and particular words. We also introduce wordbankr, a software package for connecting to the database directly. Together, these tools extend the abilities of students and researchers to explore quantitative trends in vocabulary development.
Abstract Most existing discourse annotation guidelines have focused on the English language. As a result, there is a significant lack of research and resources concerning computational discourse-level language understanding and generation for other languages. To fill this relevant gap, we introduce the first discourse annotation guideline using the rhetorical structure theory (RST) for low-resource languages. Specifically, this guideline provides accurate examples of discourse coherence relations in three romance languages: Italian, Portuguese, and Spanish. We further discuss theoretical definitions of RST and compare different artificial intelligence discourse frameworks, hence offering a reliable and accessible survey to new researchers and annotators.
Abstract In the last decade, linguists have started to develop and make use of ontologies, encouraged by the progress made in areas such as Artificial Intelligence and the Semantic Web. This paper gives an overview of notions and dimensions of “ontology” and of ontologies for and in linguistics. It discusses building blocks, design aspects, and capabilities of formal ontologies and provides some implementation pointers. The focus of this paper, however, is on linguistic research and what a modelling framework based on ontologies has to offer. Accordingly, the paper does not aim at providing an overview of specific models for computational processing. To illustrate the issues at hand, an example scenario from linguistic typology is selected instead, where the aim of describing the world's languages is approached through ontologies.
In the last years important initiatives, like the development of the European Library and Europeana, aim to increase the availability of cultural content from various types of providers and institutions. The accessibility to these resources requires the development of environments which allow both to manage multilingual complexity and to preserve the semantic interoperability. The creation of Natural Language Processing (NLP) applications is finalized to the achievement of CrossLingual Information Retrieval (CLIR). This paper presents an ongoing research on language processing based on the LexiconGrammar (LG) approach with the goal of improving knowledge management in the Cultural Heritage repositories. The proposed framework aims to guarantee interoperability between multilingual systems in order to overcome crucial issues like cross-language and cross-collection retrieval. Indeed, the LG methodology tries to overcome the shortcomings of statistical approaches as in Google Translate or Bing by Microsoft concerning Multi-Word Unit (MWU) processing in queries, where the lack of linguistic context represents a serious obstacle to disambiguation. In particular, translations concerning specific domains, as it is has been widely recognized, is unambiguous since the meanings of terms are mono-referential and the type of relation that links a given term to its equivalent in a foreign language is biunivocal, i.e. a one-to-one coupling which causes this relation to be exclusive and reversible. Ontologies are used in CLIR and are considered by several scholars a promising research area to improve the effectiveness of Information Extraction (IE) techniques particularly for technical-domain queries. Therefore, we present a methodological framework which allows to map both the data and the metadata among the language-specific onto
Natural language processing provides a very significant contribution to various application areas such as multilingual big data, information retrieval, data integration and multilingual web. However, handling linguistic knowledge to develop such lingware applications is a crucial issue, especially for linguistic novice users. To deal with this issue, a "smart" linguistic knowledge management may help the users to understand the meaning, scope and especially the use of related techniques and algorithms. In this paper, (1) we propose a semantic processing of linguistic knowledge based on a multilingual linguistic domain ontology, called LingOnto. Compared to related work, LingOnto does not only handles linguistic data, but also linguistic processing functionalities and linguistic processing features. Besides, it allows, via a reasoning engine, inferring new linguistic knowledge and assisting in the process of proposing lingware applications. This is particularly useful for novice users, but can also provide new perspectives for the expert ones. LingOnto covers the French, English and Arabic languages. (2) We propose also an assisted user friendly ontology visualization tool called LingGraph. It facilitates the interaction with LingOnto. It offers an easy to use interface for users not familiar with ontologies. It is based on a SPARQL pattern-based approach to allow a smart search interaction functionality to visualize only the ontological view corresponding to the user’s needs and preferences. In order to evaluate LingOnto, we apply it to a framework of identifying valid natural language processing pipelines. Finally, we give the results of the carried-out experiments.
In many domains, software development has to meet the challenges of developing highly adaptable software very rapidly. In order to accomplish this task, domain specific, formal description languages and knowledge-based systems are employed. From the viewpoint of the industrial software development process, it is important to integrate the construction and maintenance of these systems into standard software engineering processes. In addition, the descriptions should be comprehensible for the domain experts in order to facilitate the review process. For the realization of product configuration systems, we show how these requirements can be met by using a standard design language (UML-Unified Modeling Language) as notation in order to simplify the construction of a logic-based description of the domain knowledge. We show how classical description concepts for expressing configuration knowledge can be introduced into UML and be translated into logical sentences automatically. These sentences are exploited by a general inference engine solving the configuration task.
This article presents an analysis of interactions in the usage, structure, cognition, coadaptation of conversational partners, and emergence of linguistic constructions. It focuses on second language development of English verb‐argument constructions (VACs: VL, verb locative; VOL, verb object locative; VOO, ditransitive) with particular reference to the following: (a) Construction learning as concept learning following the general cognitive and associative processes of the induction of categories from experience of exemplars in usage obtained through coadapted micro‐discursive interaction with conversation partners; (b) the empirical analysis of usage by means of corpus linguistic descriptions of native and nonnative speech and of longitudinal emergence in the interlanguage of second language learners; (c) the effects of the frequency and Zipfian type/token frequency distribution of exemplars within the Verb and other islands of the construction archipelago (e.g., [Subj V Obj Obl path/loc ]), by their prototypicality, their generic coverage, and their contingency of form‐meaning‐use mapping, and (d) computational (emergent connectionist) models of these various factors as they play out in the emergence of constructions as generalized linguistic schema.
Abstract Digital data repositories ought to support immediate operational needs and long-term project goals. This paper presents the Dryad repository's metadata best practice balancing of these two needs. The paper reviews background work exploring the meaning of science, characterizing data, and highlighting data curation metadata challenges. The Dryad repository is introduced, and the initiative's metadata best practice and underlying rationales are described. Dryad's metadata approach includes two prongs: one addressing the long-term goal to align with the Semantic Web via a metadata application profile; and another addressing the immediate need to make content available in DSpace via an extensible markup language (XML) schema. The conclusion summarizes limitations and advantages of the two prongs underlying Dryad's metadata effort. KEYWORDS: metadatascientific dataDublin Core Application ProfileSingapore FrameworkSemantic Web ACKNOWLEDGMENT This work is supported by National Science Foundation Grant # EF-0423641. We would like to acknowledge contributions by the Dryad team members Hilmar Lapp and Todd Vision of NESCent; and Michael Whitlock, University of British Columbia. We would also like to thank Stuart Weibel, OCLC, for his thoughtful comments and support of this work. Notes 1. DOE (Department of Energy) Data Explorer (DDE): http://www.osti.gov/dataexplorer/ 2. Knowledge Network for Biocomplexity Data (KNB): http://knb.ecoinformatics.org/ 3. The Dublin Core comprises both the 15 core properties from the DCMES Metadata Element Set (DCMES), Version 1.1. Reference Description: http://dublincore.org/documents/2004/12/20/dces/ and a set of additional properties registered in the DCMI (Dublin Core Metadata Initiative) Metadata Terms namespace: http://dublincore.org/documents/dcmi-terms/ 4. Dublin Core Abstract Model (DCAM): http://dublincore.org/documents/abstract-model/ 5. Dublin Core Application Profile Guidelines: http://dublincore.org/usage/documents/profile-guidelines/. 6. Dryad repository: http://www.datadryad.org/repo/ 7. Dryad repository Partners: http://www.datadryad.org/repo/themes/Dryad/pages/partners.html 8. Joint Data Archiving Policy: http://www.datadryad.org/repo/ 9. Interoperability Levels for Dublin Core Metadata: http://dublincore.org/documents/interoperability-levels/ 10. Dryad Workshop: https://www.datadryad.org/wiki/Dec_5_Workshop_Minutes 11. Collectively the DCMES (http://dublincore.org/documents/2004/12/20/dces/) and DCMI Metadata Terms (http://dublincore.org/documents/dcmi-terms/), as explained in footnote 3. 12. Darwin Core (DwC), Version 1.3: http://digir.sourceforge.net/schema/conceptual/darwin/core/2.0/darwincoreWithDiGIRv1.3.xsd; Version 1.4 being reviewed, see: http://wiki.tdwg.org/twiki/bin/view/DarwinCore/DarwinCoreVersions 13. Publishing Requirements for Industry Standard Metadata (PRISM): http://www.prismstandard.org/specifications/ 14. Journal Publishing Tag Set Tag Library, Version 3.0, November 2008: http://dtd.nlm.nih.gov/publishing/tag-library/ 15. Data Document Initiative (DDI): http://webapp.icpsr.umich.edu/cocoon/DDI-LIBRARY/Version2-1.xsd?section=all 16. Ecological Metadata Language (EML): http://knb.ecoinformatics.org/software/eml/eml-2.0.1/index.html 17. PREMIS Editorial Committee. PREMIS Data Dictionary for Preservation Metadata Version 2.0, 2008: http://www.loc.gov/standards/premis/v2/premis-2-0.pdf 18. Status Element—Dryad: http://www.purl.org/dryad/terms/status 19. Dryad Domain: http://www.purl.org/dryad 20. Text Encoding Initiative (TEI) Header, Chapter 2 (P5: Guidelines for Electronic Text Encoding and Interchange): http://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html 21. Tim Berners-Lee on the next Web (TED Conferences, LLC): http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html 22. GenBank database: http://www.psc.edu/general/software/packages/genbank/genbank.php 23. TreeBASE: http://www.treebase.org 24. Long Term Ecological Research (LTER) Network's Metacat data catalog: http://metacat.lternet.edu/knb 25. Gleaning Resource Descriptions from Dialects of Languages (GRDDL): http://www.w3.org/TR/grddl-primer/
本报告将语言资源数字化建设的相关研究分为四个核心维度:本体与知识图谱的语义建模、专用语言资源库的构建实践、自动化知识抽取与多模态数据整合技术,以及数据库设计的理论基础与方法论概论。这些文献共同反映了从简单的语料存储向智能化、结构化、语义化知识库演进的行业发展趋势。