|
DATA
Preprocessed ACL Anthology Reference Corpus
Version: aclarc2012-06-30 (80 Mb)
This preprocessed corpus is made freely available to others who need paper/article based version of the ACL Anthology Reference Corpus version 20090501.
The ACL ARC corpus which was used in :
- Daudaravičius V. Applying Collocation Segmentation to the ACL Anthology Reference Corpus. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Korea, July, 2012,
Features:
- Concatenated pages of the same paper: one file per paper.
- Deleted papers with erroneous OCR. I.e., all characters are white-space separated and no clear word boundaries.
- Cleaned headings, footers and page numbers.
- Deleted all papers shorter than 1 Kb.
- Each paragraph on a single line: line and word breaks removed.
- 8581 papers.
- 51,881,537 tokens.
Collocation segments from ACL ARC
Version: aclarccoll2012-06-30
The list of collocation segments extracted from the preprocessed ACL Anthology Reference Corpus (aclarc2012-06-30).
The full list of collocation segments used in:
- Daudaravičius V. Applying Collocation Segmentation to the ACL Anthology Reference Corpus. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Korea, July, 2012,
Features:
- 756,670 segments.
- The histogram of the segment significance in the corpus throughout all the years.
- Significance is based on NCTFIDF (modified TF-IDF) - Normalized TF-IDF with Confidence.
- The year of the peak significance of a collocation segment.
- Filtered yearly significance (above the average of the yearly significance of a collocation segment).
Lists:
- Full list aclarccoll2012-06-30 (13 Mb)
- Sorted by the peak significance and filtered (peak significance >15) txt (3 Mb)>.
- Filtered (peak significance >15, at least three years appear) and sroted by the year txt (3 Mb).
- Filtered (peak significance >15, at least three years appear and n-grams (n>1)) and sorted by the year txt (2 Mb).
Full list was extracted with aclHist.hs and this list was used to generate filtered lists with aclHistSort.hs.
|