How to clean a complex dataset of anomalous observations?

01 January 2014 0 1K Report

My co-author and I made the first cut at the MIT Sports Statistics & Analytics Conference, and we were looking MLB pitcher injuries and Pitch f/x data. Needless to say, the data maintenance on the pitch-by-pitch data set (nearly 4 million observations) was intensive, and took up much of the last year. We got the analysis done right before the deadline and were able to submit our paper.

Was it everything that we had wanted to do? Of course not. We didn't get to the graphical analysis or the predictive modeling that we had initially wanted to do. But we were able to answer the primary research questions that we had proposed and demonstrate proof-of-concept for the use of Pitch f/x data for monitoring propensity to injury indicators. Of course we had to say, "more refinement and research is needed in this area." That just gives us something to do for JSM 2014.

Now my question concerns the fact that we found that there were anomalies in the data.

About 10% of the data (360K observations) were located in Seattle, and only had pitch placement data. Pitch types, speeds, etc., all the other "good stuff" was missing. We were able to use that for analysis involving pitch counts, but not for anything more involved than that.

The really anomalous observations involved "low volume" observations. Apparently the neural net algorithms used since 2007-08 to identify pitch types have changed and adapted. Some of the observations only appear a very few times (e.g., 26) and are never seen again, while others are legitimate pitch types (e.g., Eephus ... very low speed pitch ... 326 times). These observations have a disproportionate amount of influence on the outcomes, like what you'd expect an outlier to have on a regression equation.

Here's my problem. If you were going to "clean" the data what would you do? How would you go about it? I know that it's common practice in a lot of the published studies (Baseball Prospectus et al.) to limit subjects to those players with a threshold number of innings. But, if you're going to look at injuries, I would think that you would only do that for your control group.

Since this was our first pass at the data, I was disinclined to do anything but use the whole population and be conservative in my findings; explain in the discussion the peculiarities of the dataset. For further work, though, I know that I'm going to have to do things like stratify outcomes based on injury type; control for pitcher/batter handedness; etc.

Any suggestions that you've got, based on working with messy, real life data, would be appreciated.

Badges
Science topic

More Nicholas S Miceli's questions See All

How would you go about "cleaning" sports activity / injury data that contains anomailes?

My co-author and I made the first cut at the MIT Sports Statistics & Analytics Conference, and we were looking MLB pitcher injuries and Pitch f/x data. Needless to say, the data maintenance on the...

31 December 2013 5,119 2 View

I'm designing an SIRS (susceptible, infected, recover, susceptible) study & would like recommendations for the statistical method to use.

Secondary data following individuals across multiple years; covers an entire population with high level of accuracy. // Update: Looking at the data such that two methods appear to be more...

02 March 2013 4,435 2 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View