DATA

Preprocessed ACL Anthology Reference Corpus

Version: aclarc2012-06-30 (80 Mb)

This preprocessed corpus is made freely available to others who need paper/article based version of the ACL Anthology Reference Corpus version 20090501.

The ACL ARC corpus which was used in :

  • Daudaravičius V. Applying Collocation Segmentation to the ACL Anthology Reference Corpus. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Korea, July, 2012,

Features:

  • Concatenated pages of the same paper: one file per paper.
  • Deleted papers with erroneous OCR. I.e., all characters are white-space separated and no clear word boundaries.
  • Cleaned headings, footers and page numbers.
  • Deleted all papers shorter than 1 Kb.
  • Each paragraph on a single line: line and word breaks removed.
  • 8581 papers.
  • 51,881,537 tokens.

Collocation segments from ACL ARC

Version: aclarccoll2012-06-30

The list of collocation segments extracted from the preprocessed ACL Anthology Reference Corpus (aclarc2012-06-30).

The full list of collocation segments used in:

  • Daudaravičius V. Applying Collocation Segmentation to the ACL Anthology Reference Corpus. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Korea, July, 2012,

Features:

  • 756,670 segments.
  • The histogram of the segment significance in the corpus throughout all the years.
  • Significance is based on NCTFIDF (modified TF-IDF) - Normalized TF-IDF with Confidence.
  • The year of the peak significance of a collocation segment.
  • Filtered yearly significance (above the average of the yearly significance of a collocation segment).

Lists:

  • Full list aclarccoll2012-06-30 (13 Mb)
  • Sorted by the peak significance and filtered (peak significance >15) txt (3 Mb).
  • Filtered (peak significance >15, at least three years appear) and sroted by the year txt (3 Mb).
  • Filtered (peak significance >15, at least three years appear and n-grams (n>1)) and sorted by the year txt (2 Mb).

Full list was extracted with aclHist.hs and this list was used to generate filtered lists with aclHistSort.hs.