Automated Evaluation of Scientific Writing Data Set


The Dataset is released for non-commercial purposes under the following license agreement:

  • By downloading this dataset and license, this license agreement is entered into, effective this date, between you, the Licensee, and VTeX, the Licensor.
  • Automated Evaluation of Scientific Writing Data Set is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • To view a copy of this license, visit
  • Several restrictions are applied in addition to the CC-BY-NC-SA 4.0 License:
    • It is not allowed to do any kind of direct or indirect data engineering or data mining for the detection of personal identity of texts.
    • If Licensee (i.e., you) has recognized personal identity of samples, Licensee is not allowed to bring this information to public or share with other persons.
    • Licensee is not allowed to enrich the original data-set or make derivative data, which can facilitate in the detection of personal identity of text samples, and this information should be withdrawn from any derivative data immediately.
  • The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the following dataset: Vidas Daudaravicius. (2016). Automated Evaluation of Scientific Writing Data Set (Version 1.2) [Data file]. VTeX.


You may download the AESW Dataset if you agree to the license.

Wikipedia 2015

We dumped database on Apr 2015, and adapted tool to extract article texts. 4 Kb length limit of article text is set to filter suspiciuos texts. One article is one text file. You may download and use extracted Wikipedia texts for the AESW Shared Task.