Currently, many projects in GeoVista center (at Pennsylvania State University) deal with unstructured textual data and require parsing of space, time, and entitiy information from textual document. Trying to compare one of our text-mining tool with other projects, I stumbled across many freely available text-mining tools. However, comparing them directly wouldn’t have made sense as they offer different levels of functionality. Hence, I have classified text-mining approaches into four categories and the table below list some of the text-mining tools that are freely available online.
Classification of Text-mining approaches:
1. Keyword Extractors – Traditionally, text-mining tools mainly involve determining important keywords in a document. This is done by creating a “term vector matrix” and assigning certain score to each word. This approach forms the core of any search engine (checkout the table below).
2. Entity Extractor – Current text-mining tools go beyond identifying terms but they also try to classify these terms into basic categories such as person, orgnaization, city, region, money, etc. Such text-mining tools are often referred as “entity-extraction tools” (checkout the table below).
3. Entity Relation extractors: The objective here is not only to find entities mentioned in the document, but also how they are related to each other. I wasn’t able to find any freely available online tools that do this, but I am aware that some PennState researchers are working on this.
4. Document Relation Extractors: The objective here is to go beyound the limits of a single document and identify common themes between different documents and how they related to each other. I haven’t seen any tool that currently provide such feature.
List of text-mining tools
|Organization||Web Service||Online Tool||Type (based on above categories)||Freeware||Comments|
|ClearForest SWS||Yes||Yes||Entity||Yes||They also provide Java Desktop client and Firefox Add|
|TermeExtractor||No||Yes||Keyword||Yes||To use full version you need to create a login. Also the tool only works in Firefox.|
|Whatizit||Yes||Yes||Keyword/Entity||Yes||Whatizit has interesting concept of pipeline which allows you to select a vocabulary|
Based on my personal evaluation, I felt ClearForest SWS does a pretty good job of entity extraction. It was able to find people, organizations, cities, regions, country. Further it offers its technology and tools in various formats such as firefox addon, desktop java application, webservice, and an online tool. Below is an image of clearForest tool as a firefox-addon.