How can I design training and test set for a document classifier using Naive Byes machine learning algorithm?

More Zafar Ali's questions See All

What could be the requirements difference between novice researchers and expert researchers while searching for a research article?

There can be two types of researchers, those with no or just few published works, we call them novice. On the other hand, other class of researchers possess a good number of published articles in...

01 February 2019 438 3 View

Word2Vec or GloVe, which word embedding method is better to efficiently learn word vectors..??

Word2Vec and GloVe are two core word embedding techniques used to learn word vectors using a text corpus. Please suggest me the suitable one to use it in recommendation environment.

04 May 2018 1,690 6 View

Is there any benchmark data set for research articles?

There are many available data sets that can be used to recommend different items in various domains. However, few of these data sets are providing enough information that can be used in producing...

08 September 2016 8,745 0 View

What are the different potential applications of context-aware recommender systems?

Recommendation systems are one of the information personalisation approach which is used to suggest relevant items and information according to user preferences and objective behaviors. As far as...

04 May 2016 6,788 7 View

Any idea about using Multi-Class SVM (Support Vector Machine) to classify large size PDF documents?

I want to classify PDF documents belong to various topics and field of study. As SVM (support vector machine) is a suitable approach to classify such kind of textual documents, can anybody provide...

03 April 2016 2,200 12 View

Content-based or Collaborative Filtering, which one is best approach to adopt for designing a research article recommender system?

As there is a large amout of textual information in research articles as compared to other items like camra, movies, accessories and songs which don't have that much information rather abstract...

02 March 2016 2,091 6 View

After how long time user profile should be updated in any recommender system?

As we know that user preferences get change with the passage of time, in that case, for how long time a user profile should be updated in order to achieve better recommendation results and user...

02 March 2016 3,818 7 View

Is there any API available for researchgate?

We are working on a system for which we need information related to various researchers. Is there any API which can be used to crawl Researchgate data related to various researchers.?

01 February 2016 4,204 0 View

How researchgate recommend experts to novice researchers and using which formula it calculates RG score?

I want to know about the recommendation process of researchgate used for recommending experts in a particular field of study. what formula is used to calculate researchgate score?

01 February 2016 8,189 0 View

Which one is the best supervised machine learning algorithm out of decision tree, naive byes,K-NN and SVM for classifying large PDF documents?

I am classifying large PDF documents on the basis of different terms (features) these documents have. I want to apply a supervised learning algorithm for this purpose. please guide me in this...

31 December 2015 7,970 9 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

Please can anyone support with the survey questions based on RQ measures and propose how to do it in FMCG industry and include as well the role of brand equity Thanks

06 August 2024 949 0 View

How to convert a privately loaded document into a public document?

I attempted to make a privately uploaded text public but a window appeared that said an error occurred. There was no explanation provided as to why there was an error or what might be done to...

05 August 2024 8,025 7 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

Anyone having idea about VN primer for miRNA primer design ?

How to design VN primer to attach with universal reverse primer

05 August 2024 2,116 3 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Riadh Belkebir

Hi,

First you have to construct the dictionnary i.e. the set of all words in your corpus without repetition. Second, you have to construct the features vectors (for both training and testing). At first step, I recommand to use bag of words representation (with binary representation 1 if the word exists 0 otherwise). Then you construct the classifier using libsvm (available in many languages) for example and you save the .model file. Finaly, you import this file for the test. In case you want to use Naive Bayes, I recommand WEKA.

I recommand you to read my paper:

A Hybrid BSO-Chi2-SVM Approach to Arabic Text

Categorization

http://www.computer.org/csdl/proceedings/aiccsa/2013/9999/00/06616437.pdf

Zafar Ali

Riadh Belkebir i will read the paper you recommended if there is any queries will ask later on.

M.O. Omisore

You can try an unsupervised approach for building dictionary, hence start with know words and always update the columns in the dictionary, I advice you give clustering experiment a trail over classification otherwise, you can classify with document angular proximity such as cosine similarity

Vasudha Bhatnagar

I assume you have the document repository, with a class assigned to each document.

You need to decide the features of the documents that you think can discriminate between classes, and extract those features for all documents. So if you have N documents and you decide to extract f features, then your data set is N X f matrix.

From this set, you can use 2/3 documents (rows) as training and 1/3 as test set. This method is called Hold-and-test. 2/3 documents are chosen randomly.

I suggest you use Weka (free software for data mining), which saves you the botheration of constructing explicit test/train set. The options in Weka allow you to easily classify you data with the selected algorithm. Documentation is available.

best wishes

Thanks all of you for helping me. I am new in this field please guide me step wise so that i can understand how i can select features if i take books as entire domain.

Alain Lesaffre

Hi Zafar,

the classification is a problem on its own , I mean how to build the initial categories and for books librarians have a good system.

Now to go from fixed categories and to classify text using some algorithms the following paper gives a good overview of the full process, from preprocessing to classification. As mentioned you can use Weka to do all (it is not the only possibility, if you work in Python for example this language has some package for NLP).

Hope this help.

Alain

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.9153&rep=rep1&type=pdf

Alain Lesaffre, i cannot access download this article. please provide the title of the paper so that i can download it using other sources..