Unigrams, bigrams or n-grams?

More Manjula Wijewickrema's questions See All

How to overcome simulation issue in Sentaurus TCAD?

I tried to obtain the photocurrent of DoubleGate MOSFET in Sentaurus TCAD with gate metal and silicon as substrate material. But I am facing convergence issue while working for visible light...

31 January 2023 7,833 1 View

Free software to compare regression models?

What are the FREE and easy to use software packages that available to compare the appropriateness of a number of regression models without writing scripts/commands? For example, STATGRAPHICS (not...

10 October 2021 9,576 4 View

How to calculate wet film and dry thickness for decal transferring electrode?

I know the amount of total catalyst used and total volume of the solution. How to calculate coating thickness (both wet and dry). Also total weight % of solids calculation.

27 August 2021 4,576 0 View

Why it produce only roots in cotyledon cultures? Are there any special treatments to initiate shoots?

I am doing a study using cotyledons as the explant to mass propagate forest tree plants through micropropagation. I experienced only rooting in all the treatment with auxin and cytokinin (BAP,...

13 August 2020 9,541 9 View

Are there tools for assessment of emotions (mainly regulation) in a systemic context?

Emotional regulation is known to be .associated with Non-suicidal self-injury. In order to understand this phenomenon in a systemic context we are looking for measurement tools that evaluate...

13 July 2020 4,149 4 View

Export more than 2000 document information from Scopus?

Hi, I need to export information included in more than 150000 documents that belong to more than 250 journals from Scopus (in CSV format). However, Scopus allows to export only 2000 documents at a...

09 May 2020 5,090 3 View

How RPR is correcting the multiplier wrong output&how to change voltage levels in xilinx tool to apply voltage overscaling technique&how area reduces?

1.In this paper RPR-reduced precision replica redundancy compensation circuit is designed in order to correct the MDSP multiplier output. When there is wrong output beyond some threshold voltage,...

27 December 2019 7,436 0 View

Maximum graded relevance of NDCG?

Hi, Normalized Discounted Cumulative Gain (NDCG) allows each retrieved document to have a graded relevance for performance calculations. Someone told me that it usually assigns relevance grades...

14 February 2019 4,713 0 View

A test for dichotomous variables?

HI, Non -parametric test, Mann-Whitney U can be used to compare differences between two independent groups when the dependent variable is either continuos or ordinal. What is the equivalent...

19 December 2018 9,460 4 View

Can you suggest an estimation statistical approach?

Hello, I have values for two variables with a non-normal distribution. So, the non-parametric Mann-Whitney U test was applied to check for statistically significant differences between the two...

07 February 2018 3,726 10 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Vikas Ramachandra

depends on your corpus.

you can use this as a feature selection and select the best using cross validation accuracy.

Samer Sarsam

Hi Manjula,

Generally, it depends on the task, so I recommend reading previous work in your area to figure this issue out.

On the other hand, applying features selection can be useful after knowing the required tokenizer (i.e., unigrams, bigram, etc.). In addition, obtaining the "accuracy" is NOT enough to evaluate certain prediction model, you need use another measurements, such as ROC curve, for evaluation purpose.

HTH.

Samer

Manjula Wijewickrema

Dear Vikas and Samer,

Thank you very much for your answers. It seems both of you suggest that the best fitting tokenizer, size of feature space, etc. will depend on the corpus we use, nature of the documents we classify, etc. Hence, the best way to know the most suitable model will be classifying a set of test documents and inspecting the accuracy, ROC curve, etc. Am I correct?

Generally, extract text features, preprocess them. After that, apply classification algorithms, and evaluate them using accuracy, ROC curve, etc.

Ian Wood

The best approach depends on the type of data (what the text is is about), how much you have, if you have already labelled data to train a model (without this you'll need to use unsupervised approaches and have a strategy for evaluation) and if you want to use the model on new, unseen data later on.

For your evaluation, be sure that the categorisation of test data is done independently of the model (ie: don't look at model output and say "yes, looks good", instead categorise some test data FIRST, then see if the model agrees).

Be sure to use good experimental design here so as to be sure to detect overfitting. That is:

take a sample of your data for evaluation and hide this sample away (ie: don't use it until the last step). ~30% of your data is a common choice, depending on the task at hand.

use cross-validation to experiment with different tokenisations (eg: unigrams, bigrams, special lexicons appropriate to your task etc...) and different classification algorithms (eg: SVM, naive bayes, ... for supervised models, k-nearest neighbours, k-means, hierarchical clustering, ... for unsupervised). Here you should NOT include the data held out in step 1.

once you have settled on what seems a good approach, apply it to the data held out in step 1 and evaluate the result.

This approach is important to get a realistic estimate of the quality of your model on new data. Even when you only want to classify data you already have, use this approach to estimate the quality of your final model, as overfitting may be a problem even then (it'll add a hidden bias to your categorisations), and it can be hard to detect post-factum.

Jon Patrick

Information Gain is the easiest way to assess which features help your model best, but you will need to build those features first to get the evaluation.

All your ideas are very helpful to improve my work and thanks a for the contributions.

@Ian:

Could you please clarify me the importance of evaluating the system again (using ~30% of data initially not used) after selecting the best model? Why is it not enough to conclude the performance of the best model based on the results obtained in Step 2?

Thanks.