<del>in</del><ins>on</ins>

Automated Evaluation of Scientific Writing Data Set

License

The Dataset is released for non-commercial purposes under the following license agreement:

By downloading this dataset and license, this license agreement is entered into, effective this date, between you, the Licensee, and VTeX, the Licensor.
Automated Evaluation of Scientific Writing Data Set is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
Several restrictions are applied in addition to the CC-BY-NC-SA 4.0 License:
- It is not allowed to do any kind of direct or indirect data engineering or data mining for the detection of personal identity of texts.
- If Licensee (i.e., you) has recognized personal identity of samples, Licensee is not allowed to bring this information to public or share with other persons.
- Licensee is not allowed to enrich the original data-set or make derivative data, which can facilitate in the detection of personal identity of text samples, and this information should be withdrawn from any derivative data immediately.
The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the following dataset: Vidas Daudaravicius. (2016). Automated Evaluation of Scientific Writing Data Set (Version 1.2) [Data file]. VTeX.

Download

You may download the AESW Dataset if you agree to the license.

The Official Data set (Version 1.2):

Supplementary data:
- Training: tokens (49MB), POS (20MB), CFG (65MB), DEP (186MB)
- Development: tokens (6MB), POS (3MB), CFG (8MB), DEP (23MB)
- Test (without edits): tokens (5MB), POS (2MB), CFG (6MB), DEP (17MB)
Tokens, POS, and CFG data have the following format: [TYPE] TAB [ID] TAB DATA, where TYPE is -1 (before editing), 0 (not edited), or 1 (after editing); ID is sentence ID. One sentence is one line.

DEP data has the following format: [TYPE] TAB [ID] NL DATA, where TYPE is -1 (before editing), 0 (not edited), or 1 (after editing); ID is sentence ID. There is an empty line between sentences.

Note: If edits are related to sentence boundary insertion or deletion, then the same IDs may appear several times in supplementary data.

Wikipedia 2015

We dumped Wikipedia.org database on Apr 2015, and adapted WikiExtractor.py tool to extract article texts. 4 Kb length limit of article text is set to filter suspiciuos texts. One article is one text file. You may download and use extracted Wikipedia texts for the AESW Shared Task.

Extracted texts (!!!Attention!!! 2.5GB)