Collocation Segmentation (CoSegment) --------------------- 0. Quick Start Windows: unzip cosegment.zip cp your.files .\corpus\ perl lowercase.pl perl tokenize.pl cosegment Linux: tar -xzj cosegment.tgz g++ -O2 -o cosegment cosegment.cpp cp your.files ./corpus/ perl lowercase.pl perl tokenize.pl ./cosegment --------------------- 1. Installation Windows: a) unzip cosegment.zip file. b) put your text files to the default direcotry .\corpus; c) run Perl script file lowercase.pl to lowercase letters in your files; d) run Perl script file tokenize.pl to tekonize words in your files; e) run cosegment.exe to perform collocation segmentation in your files; f) the segmented files are stored in the same directory as source and the extention is added at the end of filename '.c'. g) the list of collocations is stored in the main direcotry as collocations_abc.txt (alphabetical) and collocations_freq.txt (frequency) files. You can skip c), d) if your data is already tokenized and lowercased. The cosegment accepts tokens between spaces. Linux: a) untar cosegment.tgz file, e.g. tar -xzj cosegment.tgz b) compile cosegment.cpp with g++ or similar, e.g. g++ -O2 -o cosegment cosegment.cpp c) put your text files to the default direcotry ./corpus; d) run Perl script file lowercase.pl to lowercase letters in your files; e) run Perl script file tokenize.pl to tekonize words in your files; f) run ./cosegment to perform collocation segmentation in your files; g) the segmented files are stored in the same directory as source and the extention is added at the end of filename '.c'. h) the list of collocations is stored in the main direcotry as collocations_abc.txt (alphabetical) and collocations_freq.txt (frequency) files. You can skip d), e) if your data is already tokenized and lowercased. The cosegment accepts tokens between spaces. --------------------- 2. Data The cosegment looks for files in the 'corpus' directory and process all data that are in this directory. The more data are processed the more accurate results of collocation segmentation are achieved. A corpus of 1 million running words is enough to achieve good results. You can even try a small text file of 50 000 words. -------------------- 3. Reference The tool implements the collocation segmentation described in the paper: V. Daudaravičius. The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance. In proceedings of Computational Linguistics and Intelligent text processing CICling-2010, Iasi, Romania. Lecture Notes in Computer Science. Springer-Verlag. 648–660. -------------------- 4. Copyright This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see .