ha! Loved the answer above :-). Indeed, I've been playing around with this subject for some time. Existing work claim to achieve around amazing 90% accuracy when trying to predict future stock trends (see, trends, not values). Well, when we tried to reproduce one of such work guess what? Slightly over 50%. For a trend, that's the statistical equivalent of tossing a coin. Allright, we did it in a different market (i.e., with different data), so I'm not claiming there has been any dishonesty involved in the original publishing, but still, the technique simply did not work when moving from one market to the other.
But I'm puzzled. What do you intend to mine in your text? Because that would determine the best technique for mining the desired information. Also, it helps knowing what genre is the text, for some are more formal and structured than others. And that's just part of the problem. A completely unrelated part, and which lies in the heart of your intentions, as I understand them, is how to use the information you just gathered from the text to correlate to future trends or values of some assets or indexes in the stock market.
I think you have got excellent answers already. I also have the same question why text data. We have done a study where we have tried to correlate twitter mood with stock market movement, this was a case where we used textual data. The treatment by which this is different is only in terms of may be domain specific ones. For textual data, if you use proper feature selection you can get good results. We have done a similar work, which you can refer at
There doesn’t exist one optimal method of text mining (or data mining in general) in any domain.
The work that has to be done was focused on specific markets (entire market represented by a stock index, specific industry, individual companies), used different text data (e.g., annual reports, twitter messages, Facebook posts and comments, newspaper articles), and analyzed the data differently (e.g., determining the mood in the texts which should explain irrational behavior of traders). You should also consider the situation on markets. For example, during a crisis, all stock might have the tendency to fall in prices, whereas normally, in the long run, they are growing. The prices are influenced by many factors that don’t have to be covered in the texts. These factors might also change in time so we might see something called concept drift. Telling that the accuracy might be 90% might be true, but only for some specific conditions. Not generally, sometimes the price movements are completely random and cannot be explained not only by texts but also by any other variables.
You might want to use the texts to find information that you would use as one of sources for your decisions. Then you might use information retrieval (to find relevant texts), information extraction (to find structured information in unstructured texts), or summarization (when you have many texts). You might use only the texts – then a relation between the texts and stock price movements should be found. If you use only the texts the task might be then seen as a classification problem where texts or days are classified into classes representing the price movement (usually the direction). In that case you need to well define the classes. You might use the differences of closing/opening/… stock prices every day, work with stock prices every minute/hour/…, consider actual stock values or values smoothed, e.g., by a moving average. You might have different number of classes, like price increase, stagnation, and decrease, you need to decide on the thresholds for assigning a class etc. There is no simple solution. You might use different dictionaries where the words or expressions are related to a sentiment which might have a relation to stock price movements. You might apply a machine learning approach where you need labeled training data to train a classifier. You might apply different linguistic procedures to texts before further processing, you might rely on external knowledge for a specific industry/period/…
Therefore, no single answer to the asked question.
results greatly depends on the considered time frame: often the variability is greater or lower based on detail (i.e. having a lot of random noise)... I bet that if who claims to have an optimum really had it, they won't be writing papers but swimming in other kinds of paper ;)
seriously, as my colleagues up here said, there is no optimum in general that can beat them all, this applies to any prediction.
If you appreciate an answer, please use the green arrow, thanks