Are there any available datasets for data integration problems?

More Mohamed M. Hafez M. AbdElRahman's questions See All

I have get this error when i calculated the geometrical optimization for prco3, it takes 12 hours until gives this message in the outputfile?

Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(1628)......: MPI_Allreduce(sbuf=000002459254A180, rbuf=000002459F86A140, count=4851, MPI_DOUBLE_COMPLEX, MPI_SUM,...

09 August 2024 7,615 1 View

How much total RNA concentration to be extracted from sorted plasma cells from bone marrow of C57BL/6 mice for RT-PCR ?

i have sorted anti-NP specific plasma cells from bone marrow of C57BL/6 mice at certain times after immunization with variable counts and isolated total RNA using TRIZOL method for RT-PCR using...

05 August 2024 8,835 1 View

How can we use mobile apps for improving students' academic performance?

Mobile apps can be a powerful tool for enhancing academic performance, how can we use mobile apps for improving academic performance

04 August 2024 9,492 0 View

Is someone interested in testing some of our new Alluaudite compounds (Molybdate) in the fields of heterogeneous catalysis and photocatalysis?

We have an original Alluaudite-Molybdate we want to test for catalysis and photocatalysis applications.

04 August 2024 2,261 1 View

Should I remove an item from a scale to raise Cronbach's alpha and McDonald's omega or is it better to leave it if they are both over .7 already?

Hello! I have this scale which had 10 items initially. I had to remove items 8 and 10 because they correlated negatively with the scale, and then I removed item 9 because Cronbach's alpha and...

01 August 2024 4,606 7 View

Has anyone used Jump 2 before?

Jump 2 is the 1st app scientifically developed to measure your jump height.

31 July 2024 8,194 0 View

Why applying a saved template of a bar chart in a different/new one doesn't take effect in SPSS?

Using SPSS, I made my edits in one bar chart — e.g., font type and size, hiding grid lines, and colours, namely: blue, orange, green, purple, and grey for 5 bars, respectively, etc. — and I saved...

29 July 2024 1,981 1 View

IHC profiler in image j software. has anyone used it to quantify nuclear DAB positivity?

Dear researchers. I tried using the IHC PROFILER in image j to quantify nuclear DAB staining. I followed the instructions in the original article by "Varghese F, Bukhari AB, Malhotra R, De A...

29 July 2024 2,229 0 View

While using the IHC PROFILER plugin in image j software, to quantify the epithelial cytoplasmic DAB staining do you crop remove the stromal tissue?

My question pertaining to the DAB staining in cytoplasm of human oral squamous cell carcinoma tissue. When quantifying the epithelial cancer cells do we have to crop remove the stromal tissue?...

29 July 2024 2,682 6 View

Are there always been barcodes, apapters and primer sequences in the FASTQ files of NGS?

Hello researchers, Sorry for my stupid question. I am learning the QIIME2 workflow for analyzing some 16s amplicon NGS fastq data. I found a very nice paper with data and code public available...

20 July 2024 5,405 2 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Ahmed Hamed

Really I'm not sure. but check this, it may help

http://archive.ics.uci.edu/ml/datasets.html

Antoon Bronselaer

Very good question. I'm looking for such datasets myself. I have tried making them, but the main problem is always: providing groundtruth. An indication of which records are duplicate is doable, but providing the best fusion result is really difficult. Anyway, the closest I have come, is the datasets you can find here

http://www.hpi.uni-potsdam.de/naumann/projekte/repeatability/datasets.html

Mohamed M. Hafez M. AbdElRahman

Thanks a lot for the link. I think they might do the job, but we need more powerful datasets similar to those in UCI

What do you mean with more powerful? More records? More entities?

Alyaa M Abdel-Haleem Mohamed

check CAMDA2014 challenge, there are matched gene expression, protein, microRNA, CNV and methylation datasets.

Cliff RICHARD Kikawa

I can see some of the links provided still provide simulated datasets. I think it would as well be appropriate to simulate a dataset and may be add a Gaussian noise to make it a candidate suspect of a real life sitaution, then go a head to test any algorithms you wish to test. You may minimise doubt of your results by carrying out a consistence analysis. I believe in that way you can disseminate your finds with some degree of confidence.

Hi Cliff, a problem with making such a simulated dataset is that there not so much known on how you should introduce "noise" in a realistic fashion. Most researchers agree that the size of a cluster of duplicates is Zipf distributed and some have proposed some models for introducing typographical errors, but apart from that, it would be "guessing" what a realistic error model would be. For example, abbreviations, multi-valued attributes, subjectiveness (e.g. the musical genre of a CD)... And of course, if there is no realistic error model, simulating a dataset tends to be biased to work good on the algorithm you want to test.

I mean by powerful: more trusted datasets that have multiple records, attributes and a range of distinct values within each attribute with a good percentage of duplicates between datasets. In other words, datasets that could be used in duplicate detection between records.

Alyaa: Thanks, I will take a look at them

Antoon, I cant disagree with you at all, however,the difficulty should not lie in how we introduce the error, but the size of the error we ought to introduce. And may be in line with what you have pointed out, the distribution of that error, if not Gaussian. Your answer draws me to the fact that, we statisticians are always faced with challenges on certain facts, like estimation, forecasting and all that, those facts can never be possible without, making strong assumptions about the used datasets. In that regard, I believe, we can do with the noised simulated dataset, till a "better" dataset is obtained, if it will ever be gotten anyway!

Randa M Abd El-Ghafar

I also need a Big dataset with duplicate record to use it in record linkage.

Some related datasets are available here:

https://hpi.de/naumann/projects/repeatability/datasets.html

Giovanni Simonini

https://sites.google.com/site/anhaidgroup/useful-stuff/data

Vasilis Efthymiou

https://www.csd.uoc.gr/~vefthym/minoanER/datasets.html