How can I finding the collection of duplicate items in Bigdata?

More Mohammad Ahmadzadeh's questions See All

How can i find a journal in Software Engineering without publication fee?

I need a journal to publish a paper in software quality.

31 December 2016 2,601 2 View

What simulator to simulate the Internet of Things should I use?

I need a simulator to simulate IOT

08 September 2016 10,000 4 View

Is there any tool for implementing DEMATLE and Fuzzy ANP Methods?

DEMATLE and Fuzzy ANP are two method of MCDM methods. MCDM equal with Multi Criteria Decision Making Methods. please help me. best regard Mohammad

08 September 2016 1,963 20 View

How can I download NSL-KDD Dataset?

NSL-KDD Dataset used to intrusion detection

08 September 2016 7,799 8 View

How can I analysis relation between Social Intelligence and Information Technology?

But I don't understand components of IT and SI, help me please

07 August 2016 5,424 10 View

How can I simulate Distributed Intrusion Detection System?

I want to simulate Datacenter DIDS

07 August 2016 4,399 1 View

How can i to extract association rules in data stream?

how?

07 August 2016 9,625 3 View

What is algorithms for finding Frequent data in big data?

everybody guide me to find frequent data in big data

07 August 2016 8,921 3 View

In your opinion, what is the job future of computer experts?

What I mean is that future computer scientists gain new jobs.It is better now. What do we learn

07 August 2016 7,319 1 View

How can i reducing energy in wireless sensor networks?

I want to reduce energy by Neural Gas Network Help me, please

07 August 2016 6,196 8 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Absorption coefficient of methane?

Hello, Can anyone provide me with the absorption coefficient of methane gas at 7.7 um? Any reference?

06 August 2024 980 5 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

How can I interpret the data without the need of solving it manually?

How can I interpret the data gathered without solving?

03 August 2024 9,054 3 View

Why can't academics earn the money they deserve?

Only Journals make money from the articles we have worked on for years. Academics do not earn money from their refereeing. Then shouldn't the solution be a system in which academics can earn...

01 August 2024 6,469 6 View

Conjugation of PEG-Amine to an Amino Acid Using EDC?

I am attempting to conjugate PEG to an amino acid at the C-terminus, for the purposes of producing nanoparticles. I have been told that PEG modified with amine groups can be used for this purpose,...

31 July 2024 2,033 1 View

Tiago Tresoldi Popular answer

One alternative is always to have a new field with a hash of the concatenated fields; when choosing the right hash function, the false positives will be minimum even in big data (and, still, you can check equal hashes field by field to make sure that you have no false positives). Depending on your data, you can also use a probabilistic data structure such as Bloom Field or Locality Sensitive Hashing, in some (rare) cases removing similar entries can improve performance.

Ronald Avrom Barr

There are specific function in R software for identifying duplicated data...

namely unique() or duplicated()

Is this what you are looking for?

Mohammad Ahmadzadeh

I need a scientific method

Mohammad zakaria Masoud

the easiest method is two for loops and if the iteam has been found delete it. the complixity of this algorithm is O(nlogn). if you do not delete duplicated iteams it will be O(n^2). This algorithm may take a big time in normal computation. try to do it in parallel programming (threading or GPUs)

Javad Ghorbani

you can use Hadoop for finding replica in bigdata.

Julian Vasilev

You can add a new field to the dataset by concatenating several string fields with a separator, e.g. "~". Than group the dataset by the new field and count cases.

Tiago Tresoldi

Ergin Soysal

It's a very broad question with no clue. First you need to find the features (columns in your data) that make them the same or similar. Based on your data, you can develop a model to select/calculate an item score to uniquely represent these features. Then, find the score treshold (acceptible score variation) which will minimize the error. So this may save you to compare each row with others.

Jianlin Shi

How big it will be. Can you hash it and compare the hash value? That should be able to process a significant amount of big data.

Hossein Hashemi

good