Transition from Python to Pyspark?

More Tony C. Scott's questions See All

How to address the Pitfalls of Python's sparse matrices?

I have been working with Python's package for sparse matrices (scipy.sparse) which boast a number of sparseness structures: CSR, CSC, LIL, COO, etc... I find that certain operations are faster in...

08 September 2017 7,004 0 View

Interested in a Career in IT?

We are building a unique ambient data platform at Near.co in India, that is a) truly global in scale b) solving some really core problems with highly disparate data sets and c) creating products...

08 September 2017 5,188 3 View

What is inside Adroll/cantor?

I downloaded adroll/cantor to experiment getting hash counts of dataset intersections. It uses Hyperloglog (HLL) + minhash but I find the test cases insufficient. I need to know if ALL...

05 June 2016 564 5 View

Can anyone point me to some data for cluster analysis?

Examples can be molecular dynamics, chemical reactions, particle physics, astronomy, even weather patterns. It's the data in flat-form, e.g. spreadsheets that I need.Attached is an example for...

10 November 2015 645 3 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Request Python code?

Request Python code from this article : Gender equity of authorship in pulmonary medicine over the past decade. THANKS!

08 August 2024 6,242 2 View

Why does everyone use vs code?

Visual Studio Code (VS Code) has become a popular choice among developers for several reasons: 1. **Free and Open Source**: VS Code is free to use and open source, making it accessible to...

07 August 2024 7,013 4 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

How combine yolo with Faster R-CNN?

I want a model that is balanced with accuracy or speed, faster rcnn has high accuracy while yolo have fast speed. i am thinking to combine them to get a hybrid model to achieve both speed and accuracy

02 August 2024 3,104 0 View

Can I use Likert scale with Paired Sample T-test?

Hey researchers! I am currently doing a research about to what extent in which A has accelerated the inclusiveness of the payment system in my country. Down below are a few examples of the...

26 July 2024 5,654 3 View

How to do FEL analysis?

In molecular dynamics simulation, to get FEL analysis, I got an error. My Python version is 3.10.7. My input files are made with a lower version of Python. But the final command to generate the...

23 July 2024 5,646 2 View

Mass spectra averaging algorithm?

I am now developing a python module for ms2 database searching, would like to realize a function that similar to what Xcalibur did, choose multiple mass spectra and get an averaged spectra. But...

22 July 2024 3,975 1 View

Is there anything faster than Xarray or Pandas out there?

Hello, dear RG community. Personally, I have found Xarray to be excruciatingly slow, especially for big datasets and nonstandard operations (like a custom filtering function). The only suggestion...

15 July 2024 4,705 2 View

Leon Palafox

Hi,

First of all, If you are using a single machine, I would highly recommend using threading in Python or some other way to use all the cores at the same time.

If you have a while-do loop, I would recommend trying to parallelize it with some of Spark's mapping routines, since logical cycles (for, while) are not very efficient in pySpark.

Also, I would only recommend Spark if you have multiple machines, Spark in a single machine is hardly more efficient than a good Python implementation.

Roberto Diaz

Is spark a requirement?

As Leon commented, threading in Python is the best option for a single machine with many cores.

If you have several machines the best option is MPI for python:

http://mpi4py.scipy.org/docs/

And you can also combine MPI with multhreading to use several machines and all the cores in every machine.

Spark offer fault tolerance and things like that but is very slow (compared with mpi of multithreading).

Anyway, if you must use Spark I think we need more information about you process to help you.

Tony C. Scott

Thank you Leon and Roberto. Yes, Spark is a requirement. What needs parallelization is within a for loop but what is inside that loop is somewhat complicated code with the creation of a sparse matrix followed by sums over columns and a final logical row reduction within a while-do loop. The final result is then represented by a much reduced sparse matrix. After the loop all the resulting sparse matrices are concatenated and row reduced one final time. It's a bit like a map-reduce operation. I have looked at "parfor" in MATLAB (joblib in python) but got only moderate results (2 processors only). The spark machine I am using offers more in terms of cores and threads though more complicated. Thanks again.

I am very disappointed! The idea of parallelization was to get improved computational performance. Neither threading nor multiprocessing in python yield faster results than a simple list comprehension which in general does not produce a faster result than a for-loop unless the function is VERY simple. In part, it's because of the Global Interpreter Lock (GIL). David Beazly showed a seminar about this point. Shared memory is often a liability. I had hoped that the Java layer in pyspark would produce better results but it doesn't. A machine with 48 cpu cores using pyspark produces results no faster than what I can get on my MAC with 4 cpu cores. It's a disgrace. IMHO, it's a design flaw.