English
What is it?
Tianchi Dataset helps to improve accessibility, reproducibility, and responsible use of a particular dataset artifact. To ensure proper referencing and usage of a dataset, information regarding its licensing is important.
A dataset license should clearly describe how to properly and responsibly use a dataset. It’s intended to help guide the user of the dataset in making informed decisions about how the dataset can and cannot be used.
Common Dataset Licences
Similar to licenses used for software products, licenses applied to datasets also vary in restrictions and permissions. Below are some of the main types of dataset licenses used in the field of machine learning. The summaries for the CC licenses provided below are not a substitute for the license. Please refer to the official licensing guide for complete guidance.
- Public domain (CC0) - is a public dedication tool, which allows creators to give up their copyright and put their works into the worldwide public domain. CC0 allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, with no conditions.
- CC BY - This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
- CC BY-SA - This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt or build upon the material, you must license the modified material under identical terms.
- CC BY-NC - This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for non-commercial purposes only, and only so long as attribution is given to the creator.
- CC BY-NC-SA - This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for non-commercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt or build upon the material, you must license the modified material under identical terms.
- CC BY-ND - This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.
- CC BY-NC-ND - This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
- MIT - This license is primarily used for licensing software products. See official license details here.
- GPL - This license is a free, copyleft license for software and other kinds of works. See official license details here.
- BSD-3-Clause - This license is primarily used for licensing software products. See official license details here.
- Custom - a non-standard license. Check the link to the license to learn more about the exact terms of usage. In some cases, we include a brief summary in parenthesis, e.g. Custom (non-commercial) is a custom license stipulating non-commercial usage.
- Unknown - refers to a dataset for which a license is unknown or hasn’t been annotated yet. Check the associated paper and dataset website for more information.
How to license your work?
- For CC-licensing it is as straightforward as you only need to choose the appropriate license that fits your needs. This must be communicated clearly, including a direct link to the license you have chosen. An example could be: This work is licensed under a CC BY 4.0 license. See official instructions here.
- For all other types of licenses, used primarily for software products like MIT, GPL, and Apache License 2.0, the instructions vary. Find more information about such licenses in the Open Source Initiative documentations.
Common Pitfalls
Regarding dataset licensing, here are some common pitfalls that may occur:
- Annotation-only licenses. In some cases, especially in Computer Vision, it’s common for only the annotations to be released under a free license, but images to remain proprietary (e.g. owned by Flickr). Check the license terms to understand what is covered and what is not.
- Code-only licenses. In some cases, only the code to e.g. scrape the data is available. While the code might be open source and free license, the resulting data may still not be made open and available for everything.
- Data ownership. Check the description of the dataset, as well as how it was collected, to determine whether you have permission to use the dataset for your purposes.
中文
什么是数据集许可证?
天池数据集致力于增进数据集的可访问性、可重复性和使用合法性。为了确保数据集被正确引用和使用,数据集的许可信息就很重要了。
数据集许可证旨在帮助和指导使用数据集的用户如何正确地使用数据集。
通用数据集许可证
与用于软件产品的许可证类似,应用于数据集的许可证在限制和权限方面也各不相同。以下是机器学习领域中使用的一些主要类型的数据集许可证。
- Public domain (CC0) - 允许创作者放弃其版权并将其作品放入全球公共领域。 CC0 允许使用者无条件地以任何媒体或格式分发、重新混合、改编和构建。
- CC BY - 该许可证允许使用者以任何媒体或格式分发、重新混合、改编和构建,只是数据集归其创作者所属;该许可证允许用于商业用途。
- CC BY-SA - 该许可证允许使用者以任何媒体或格式分发、重新混合、改编和构建,只要向创作者提供署名即可;该许可证允许用于商业用途;如果使用者重新混合、改编或构建,则必须在相同条款下许可修改后的内容。
- CC BY-NC - 该许可允许使用者以任何媒体或格式分发、重新混合、改编和构建,仅用于非商业目的,并且在只归属于创作者情况下。
- CC BY-NC-SA - 该许可允许使用者以任何媒体或格式分发、重新混合、改编和构建,仅用于非商业目的,并且只有在归属于创作者的情况下;如果使用者重新混合、改编或构建,则必须在相同条款下许可修改后的材料。
- CC BY-ND - 该许可仅允许使用者以任何媒体或格式以未经改编的形式复制和分发材料,并且仅当归属于创作者时;该许可证允许用于商业用途。
- CC BY-NC-ND - 该许可允许使用者以任何媒体或格式以未经改编的形式复制和分发材料,仅用于非商业目的,并且仅当归属于创作者时。
- GPL - 此许可证是软件和其他类型作品的免费版权许可。
- Custom - 非标准许可证。在某些情况下,我们会在括号中包含一个简短的摘要,例如自定义(非商业)是规定非商业用途的自定义许可证。
- Unknown - 指的是许可证未知或尚未注释的数据集。
如何授权您的作品?
常见问题
- 仅注释许可证:在某些情况下,尤其是在计算机视觉中,通常只在免费许可下发布注释,但图像仍然是专有的,您可以查看许可条款以了解涵盖的内容和不涵盖的内容。
- 纯代码许可证:在某些情况下,只有用于例如抓取数据的代码可用。虽然代码可能是开源的并且是免费许可的,但生成的数据可能仍然无法公开并可供所有人使用。
- 数据所有权:检查数据集的描述及其收集方式,以确定您是否有权将数据集用于您的目的。