There is no strict "rule", but I can provide you a simple example of framework, considering the text classification task:
STEP 1-Pre-Processing:
Activities that might be performed in this step:
(i) performing a preliminary descriptive statistics study of your collection of documents (e.g.: determine the frequency of each word in the collection, determine the strongest correlations among words, etc.).
(ii) According to the results of your study you may apply a set of techniques to reduce the problem dimensionality (stop words, stemming, feature selection, etc.).
STEP 2- Dataset Modeling
- Deciding how to construct your training dataset, i.e., how to transform the collection of documents into a dataset (example: bag of words, n-grams, etc.);
STEP 3- Analysis
- Construction of your text analysis model using one or more algorithms. In the case of classification you have a number of options: k-NN, SVM, Naive Bayes, Neural Networks, etc.
As I said before, this is just a simplified example.
There is no strict "rule", but I can provide you a simple example of framework, considering the text classification task:
STEP 1-Pre-Processing:
Activities that might be performed in this step:
(i) performing a preliminary descriptive statistics study of your collection of documents (e.g.: determine the frequency of each word in the collection, determine the strongest correlations among words, etc.).
(ii) According to the results of your study you may apply a set of techniques to reduce the problem dimensionality (stop words, stemming, feature selection, etc.).
STEP 2- Dataset Modeling
- Deciding how to construct your training dataset, i.e., how to transform the collection of documents into a dataset (example: bag of words, n-grams, etc.);
STEP 3- Analysis
- Construction of your text analysis model using one or more algorithms. In the case of classification you have a number of options: k-NN, SVM, Naive Bayes, Neural Networks, etc.
As I said before, this is just a simplified example.
As Eduardo says, there is no strict "rule", but a text analytics projects is a data analytics projects, so you can follow the five classic steps of any analytics project (as KDD, for example):
Get the data
Preprocessing
Data Model (Text Minig Task)
Visualization
Knowledge
Get the data: The firts step, obtain you raw data, the text source you want to analyze (blog, news, twitts)
Preprocessing: Is the most important step. You transform and prepare your text data for analyze. Stopwords eliminations, tokenitation, lenmatization, and other task.
Data Model: Clustering, classification, summarization, topics model: the selected task and algoritms for answer the goal question.
Visualization: dendograms, wordcloud, histograms, correlations maps and many others.
Knowledge: The interpretation of the results obtained, to turn them into the desired knowledge
In the nexts book you can find more information about text mining:
Text Mining: Predictive Methods for Analyzing Unstructured Information by Sholom M. Weiss, Nitin Indurkhya, Tong Zhang and Fred J. Damerau, published by Springer
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, published by Cambridge University Press.
Mining Text Data by Charu C. Aggarwal, published by Springer
Natural Language Processing and Text Mining by Steve R. Poteet, published by Springer
Text Mining and Visualization: Case Studies Using Open-Source Tools by Markus Hofmann, Andrew Chisholm, published by CRC Press.
That really depends on the problem you want to solve, but as stated before and from my own experience, pre-processing is the most important step. Typical pre-processing steps in text analytics are:
Data deduplication, entity resolution if you have several data sources, stemming and lemmatization, word sense disambiguation
Depending on the analysis method you might use word frequencies, bag-of-words, part-of-speech tagging, tokenization, TF-IDF
Feature extraction is probably the most important step where you extract the observation points for your analysis tool.
Typical text analysis methods are Latend Dirichlet Allocation, subjectivity analysis, clustering, but again that depends on what you want to achieve
Text mining is an interesting field, There are many libraries available that uses NLP to facilities text mining process. Standford NLP and Python are recommended.
as already mention pre-processing step is very important, in this phase for example stop word list is used to eliminate commonly used words such as full stop, punctuation etc. then tokenization is the second phase which disintegrates a sentences into n grams token, from tokens you create bag of words as feature vector to be used in text classification or clustering. the very common use model is vector space model that generates a bag of words from the unstructured text. Each document is represented as weighted terms =term frequency * inverse document frequency. it is represented as term matrix and then a distance matrix is calculated from this term matrix using similarity measures.
the similarity is measured using distance functions commonly use distance functions are cosine , jaccard etc.
Adding to the above, if your approach involves NLP at the pre-processing step, there are several sub-tasks in NLP which are generally represented as a sequential chain/pipeline performed other your input items. These tasks go from low-level operations (tokenization, stopword removal, statistical analysis like TF-IDF) to higher level ones (WSD, Coreference detection, NER...). Quick search with "NLP chain" will give you examples and frameworks that suits your needs.
From this intermediate data representation you can build the analytics tasks described in previous answers (data modeling, clustering/classifiation, visualization...).
1. файл subtopics.txt - 20 topics from scientific articles on text analytics
2. файл a3_3_last_years_articles_potentially_with_novelty.xlsm- articles published in 3 last years the best on being potentially with novelty
3. файл a1_basic_articles.xlsm- most important articles with basic knowledge on topic
=
Here are 20 topics from the articles which contain words "text* analytic*" in their titles or abstracts. Each topic is presented with 20 words and 20 phrases as it is done in these articles.
As from privious experience not more then 30% of these topics will be useful for you.
Reading these topics will require efforts from you, since this is not a coherent presentation, and each topic is a list of substances, methods, theories, cases, etc. associated with this topic. Try to be patient, in the hope that new hypotheses and thoughts will appear.