Text mining concerns itself with discovering structure and patterns in unstructured data – usually text. There are many different approaches to this task, some focus on ancillary structures such as taxonomies and ontologies, some focus on semantics and natural language processing, while others use various algorithms to categorise and summarise. It all depends on need as to which will be the most appropriate.
GATE (General Architecture for Text Engineering)
This is a large full-lifecycle open source text mining software suite with several components:
* GATE Developer is an integrated environment consisting of language processing components which incorporate the widely used Information Extraction system along with other plugins.
* GATE Teamware provides a collaborative environment for document annotation. This is built around a workflow paradigm.
* GATE Embedded is a Java object library to provide an interafec to other applications within the organisation.
KNIME Text processing is a plug-in to the (free) KNIME data mining suite. It supports a six step text processing process which starts with the reading and parsing of text, followed by named entity recognition, filtering and manipulation, word counting and keyword extraction, bow and vector representation, and finally visualisation.
LPU (learning from Positive and Unlabeled Examples)
This is a text learning and classification system that utilises support vector machines (SVM) and EM (Expectation Maximisation) techniques. Runs in a DOS window.
Orange-Text
This is an add-in to the free Orange data mining suite. It operates within the visual analytics tools provided in Orange and adds the ability to process unstructured data.
RapidMiner Text Extension
This provides operators for the RapidMiner environment for statistical text analysis. Many data sources are supported including plain text, HTML and pdf. A large number of filtering techniques are supported and support for tokenization, stemming, stopword filtering and n-gram generation. This is all embraced within the graphical interface provided by RapidMiner (which is a free data mining suite) and many tasks can be completed through drag and drop functionality.
Other products of interest include:
Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
Apache Mahout supports recommendation mining taking users’ behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.
Text Analytics: A Business Guide.tabgw
A report for business and technology managers wishing to understand the impact of rapidly evolving text analytics capabilities, and their application in business.
Carrot2 – text and search results clustering framework.
GATE – General Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering
Gensim - large-scale topic modelling and extraction of semantic information from unstructured text (Python)
OpenNLP - natural language processing
Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
RapidMiner with its Text Processing Extension – data and text mining software.
Unstructured Information Management Architecture (UIMA) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM.
The programming language R provides a framework for text mining applications in the package tm.[4] The Natural Language Processing task view contains tm and other text mining library packages.[5]
The KNIME Text Processing extension.
KH Coder - For content analysis, text mining or corpus linguistics.
I agree with Hassan, I'd suggest R and in particular the package tm (in my opinion, not extremely efficient but very easy to study text mining approaches on small/medium scale collections).
For my new scientific paper study, I installed the text mining packages for R.
In my new study I will work on R. It looks not so diffucult.
You can visit the webpages such as http://www.r-project.org/ for R, http://www.rstudio.com/ for R Studio and their packages such as tm http://127.0.0.1:33888/help/library/tm/doc/extensions.pdf, mlflex.RLearner that interfaces with the R statistical package.
I specialize in public data gathering (web harvesting) from open access websites by programming a web-crawler. The data can later be used for statistical or content analysis. For example, my recent collection was data from booking.com and tripadvisor.com with information about reviews, ratings and prices along with the accompanying data such as geographical region, addresses, and many more. The data comes out in a form that is easily converted to SPSS or Excel format.
Technically, any website or social network can be a source of data. Please feel free to contact me if you are interested as I am open for research collaboration.
The Coding Analysis Toolkit (CAT) is free, open source, web-based, collaborative text analysis tool.
http://cat.texifter.com
The key features are related to the measurement of inter-rater reliability and the adjudication of annotator disagreements. I have attached a peer-reviewed scholarly paper on the software itself as well as the general question of using software for text analysis (I am the co-author).
Hi Nouran Radwan, It really depends on what you are trying to accomplish with the user profile text mining. I usually work with interview data and rely on the R language (as Nikos Koutsoupias suggested RQDA is one of the option here). I mostly stick with tidy tools. The 'Text Mining with R' book walks through the most common text mining operations in that style. The authors have a number of other related projects that are worth a look: