Nexus

About NEXUS

NEXUS: Legal Risk Database for Open‑source Datasets

Legal risks associated with AI training datasets cannot be properly understood by relying on their stated license terms alone. It is crucial to trace the dataset’s redistribution and full lifecycle to assess legal risks. However, even legal experts find it extremely difficult to manually sift through the vast amount of information surrounding datasets for AI training to identify source-related details, read them accurately, and interpret their legal risks. Tracking provenance, verifying redistribution rights, and assessing evolving legal exposure across stages require a level of precision and scale that exceeds human capability. To overcome these limitations, we have developed AutoCompliance, AI Agent for NEXUS, that emulates how legal experts review datasets—automatically identifying the full spectrum of legal risks in datasets for AI training.

NEXUS is an AI Compliance Database designed to deliver legal risk assessments generated by this agent. We are building a global repository of legal risk assessments for open-source datasets, with the goal of supporting transparent, responsible, and legally compliant AI development.

For more detailed information, please refer to our paper: Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI- Powered Lifecycle Tracing

Q. How does NEXUS explore the data sources(dependencies) of a dataset? A. When a dataset URL is submitted for review, the AutoCompliance begins by identifying its immediate components. For each discovered component, the agent recursively searches across various web sources to determine whether it, too, contains additional underlying data sources. This recursive process continues until no further subcomponents can be found. The result is a structured representation of all underlying data sources that constitute the dataset. Each identified data source is categorized into one of six types. - Dataset. When data is directly imported in part or whole from an existing dataset, or reconstructed for specific purposes, the used Dataset is defined as the dependency. - Service Provider(Contents). When data is sourced from services that directly produce and provide specific content including copyrighted material, the Content Service Provider is defined as the dependency. For example, this refers to data created by content providers. - Service Provider(Platform). Platforms that don’t directly create and supply content but provide functionality for collecting or modifying data are defined as dependencies. This includes cases where data is collected through platforms. - Underspecified. When data or content is not clearly specified and is expressed in general terms rather than referring to specific data, such expressions are defined as Underspecified. For instance, when something is simply described as “a collection of books gathered from websites”, it is considered as collecting copyrighted works through crawling. - AI Model. When specific AI models are used to generate/augment/transform data for dataset utilization, the used AI Model is defined as the dependency. - Software/API. When specific tools are used in the process of collecting, preprocessing, processing, and modifying data, the used Software/API is defined as the dependency.

Q. How does NEXUS assess the legal risk of a dataset?A. NEXUS collects not only the stated (surface-level) license of a dataset, but also the licenses and other risk determinants from its underlying data sources that are relevant to legal risk assessment. Based on this information, the AutoCompliance evaluates the dataset against 18 predefined legal risk criteria, assigning each item a score from 1 to 5 according to a structured assessment framework. These scores are then aggregated using a proprietary formula to calculate an overall legal risk Class for the dataset. The Details about class, the list of the 18 criteria and the scoring methodology can be found in the Legal Framework of Data Compliance.

Q. How can I trust the results provided by NEXUS?A. We make every effort to ensure accuracy. While our AutoCompliance generally performs with higher precision than human reviewers, all findings are additionally checked by licensed attorneys across relevant jurisdictions for an extra layer of reliability. However, creating a perfectly error‑free output remains beyond current limits. We therefore cannot guarantee absolute accuracy, and we ask for your understanding regarding this limitation. You can check whether a dataset has been reviewed by an agent or a human expert by referring to its Progress status. Please note that human expert reviews are not yet performed on a fixed schedule, but we are continuously working to expand and accelerate this process to provide verified results as promptly as possible.

NEXUS is an open-source DB committed to transparently sharing the compliance risks it collects. If you require more in-depth legal analysis, or are interested in B2B AI compliance consulting and system implementation, please feel free to contact us at legal@lgresearch.ai. We’re always open to collaboration.