Lucene is a nice engine. Books like "lucene in action" (http://www.manning.com/hatcher2/) not only show how to use lucene, but give insight on how a search engine works.
I wanna build a search engine that specifically do the statistics-specific domain searching.
So the engine will give a list about the websites and journals about statistics only.
Beside that, the requirement also requires sorta a dictionary in that search engine, so if users put some statistics-term, the result will show the definition about the term.
The last thing, the search engine will be able to not only search for standard query, but also symbolized query (queries in symbols, like beta, alpha, variance, etc)
On the symbolic side, think of a symbol as a word in a graphical language, so, knowing the character set used for your symbols, you can simply add the symbol set to your dictionary and assign a number to the symbol the same way you did with the word.
To determine if a document is in your interest area, it should be evaluated as having at least a minimum number of high value dictionary terms, you can set the threshold, or barrier number below which it ignores the document to something that ignores all documents with too few dictionary terms. As well, if you are just going to define one term, You don't necessarily need the "Define" statement since any search term with only one symbol or word isn't really a search term.
a special treatment. For example there's the Adobe
IFilter plugin for Windows to extract text."
There is also free software for that. You can integrate software like Apache Tika (http://projects.apache.org/projects/tika.html) into your project. It can extract text from pdf, doc etc...
Although the approach explained by Joachim Pimiskern and Jens Peter Andersen is correct, all these things are already integrated, implemented, and ready to use or to be modified in Apache Lucene or Solr. I don't see the need to reinvent the wheel.
I've read all your comments and recommendations, as well as literatures I might need to understand them better..
Mr Jaoachim: I've read about the robots.txt. But pardon me, I still don't get it quite well.. What is it exactly? I've heard about it, some say that it's exactly the same as crawler..
About ranking strategies, I've learned once, guess I need to work harder in this.
Mr Pens: Of course I want to use that relevance method much more than Boolean method. About the metadata, are you saying it's the indexing result? I'm not really sure about it now.. ^^'
I prefer to refine the query, if possible I want to give such query suggestion to the users if they somehow made a mistake in typing the query they needed.. How do I do that?
Mr Graeme: About that symbol thing, I read that Google and other conventional search engines don't really "see" the symbols. Why is it? And is it important for me now to index the symbols that the big search engine such as Google doesn't even "index" the symbols?
Mr William: Yeah I've been dreaming about that too.. Just dunno how to do that..
Mr Bastian: Gosh it's very kind of you.. Thanks.. I'll check it out..
To Mr Harsh: I'm sorry what's a seed page?
Well, lemme try to make some simple steps I've concluded from what I've learned these days.
If I wanna build a Statistical Search Engine, here's the things I will have to do:
1. Learn how to crawl web pages, pdf/doc, e book, journal, etc, only those contain statistics information and knowledge. To do this I need to specialize the crawler for the web pages differently than the crawler for the documents (so there are "two crawlers"?) Beside that, I'll build sorta focused crawler that wisely and smartly crawls the web. I also need to know how to crawl symbols.
2. Build a spider/crawler to crawl the pdf/word documents and web pages, put it to the server. This crawler must be enclosed with some sorta robots which automatically help the crawler to update the crawling result.
3. Build the indexer that stores and indexes the crawling results. This indexer than creates a dynamic-content database that consists of metadata about each and every one of web page and document. It also indexes the symbols the crawler has crawled.
4. Learn which method of searching that would be the best for me to use. It's either Boolean, or term frequency, or relevance, or any other method of searching (query processing)
5. Build the query processor. It can either function traditionally (blindly process of what users type), or "modernly" (process the query better, firstly refine it or something).
6. Build the web interface (search box) that finally integrate the whole system..
7. The whole system would be divided into 4 main parts; crawler, indexer, query processing, and search box.
Do I get this right? If yes, thanks a lot. If not, I'd be really happy if you guys could help me again to correct in which step I get wrong.
My next question would be, what best programming language do I need to use here? Ones I know well enough are PHP and Java. Can I use exactly the same language for each main parts above? What other tools that can help me beside Lucene, Adobe IFilter, Apache Tika, YiiFramework (or CI Framework)?
How long do you think it would take for me to accomplish this? Thanks..
Oh yeah, to make it easier for us to imagine, I simply want to make an interface like Google has, simple, quick, and precise. It just simplify and specialize the search results in which also enhanced the crawler and the indexer.
There I did make you guys read a pretty long piece of text, I'm sorry for that..
I'll be waiting for your reply. Thank you very much!
On the graphics symbols, the problem is to recognize the font. Once you have the font recognized the symbols are just letters in the font. For each font you recognize i.e. Symbols Font in Windows, a specific symbol has a specific code, which you can use in combination with the font to designate that symbol. Since different manufacturers have different symbols in different fontscapes, you will probably have to code for the particular font used.
In DOC and PDF files, the font type is encoded into the file, and can be extracted when symbolic characters are found, given the font and the character code, you can then figure out what the character looks like, and transpose it into a symbol. This might be more work than you want to do for a Mini.
Following Harsh’s suggestion to go for php, may I put forward the modifiable open source web spider and search engine Sphider (http://www.sphider.eu/ ) not to build all from scratch?
You should give a care that you do not step on your instructors toes here, if only because he tends to mark your work. If he says java, it could be that he wants java because he can mark it better.
I am not familiar with the focussed crawler, but in general the more specific your topic to search the more specific your search has to be. Even semantic search may not be enough by itself to segregate out your required search terms.
Ah yeah, my friend said so, I must not follow everything My Supervisor suggest me.
So is it OK if I use PHP instead of Java? What is "scientific reason" for me to give him to explain why I'm more willing to use PHP other than my concern of my ability in it?
I'm still confused in focused crawler anyway, there are lots and lots of methods of it and I can't yet understand to differentiate them.
Semantic wouldn't be enough? What would I do then? I read a literature and there are only three kinds of searching algorithm?
If the semantic search does not do the job, you might need to augment it with your own search specifically for what you are looking for. For instance semantic search might work very well for words, but not for symbols. This is a gotcha that you sometimes get in software once you get past a certain point, where the generic stuff makes assumptions that you need to overide for a specific purpose.
If you are adamant that you will use PHP despite your advisors advice to use java, then you will need a reason, one reason is the maturity of the focussed package if it is not available in java. But that is also why you are lost in a thicket of methods, and haven't learned to differentiate them yet. Mature packages tend to have rethought their factoring a number of times and thus have unexplained assumptions in the oldest code, that can confuse someone who is trying to use it out of the box.