Analyzing Chemical Spills Datasets (.csv files)¶
An OSPool Tutorial¶
Spills of hazardous materials, like petroleum, mercury, and battery acid, that can impact water and land quality are required to be reported to the United State's government by law. In this tutorial, we will analyze records provided by the state of New York on occurrences of spills of hazardous materials that occurred from 1950 to 2019.
The data used in this tutorial was collected from https://catalog.data.gov/dataset/spill-incidents/resource/a8f9d3c8-c3fa-4ca1-a97a-55e55ca6f8c0 and modified for teaching purposes.
To access all of the materials to complete this tutorial, first log into your OSPool access point and run the following command: git clone https://github.com/OSGConnect/tutorial-spills-R/
.
Step 1: Get to Know Hazardous Spills Dataset¶
Let's explore the data files that we will be analyzing. Before we do so, we must make sure we are in the tutorial directory (tutorial-spills-R/
). We can do this by printing your working directory (pwd
):
pwd
We should see something similar to /home/jovyan/tutorial-spills-R/
, where jovyan
could alternatively be your OSG account username.
Next, let's navigate to our /data
directory and list (ls
) the files inside of it:
cd data/
ls
We should see seven .csv
files, one for each decade between 1950-2019.
To explore the contents of these files, we can use commands like head -n 5 <fileName>
to view the first 5 lines of our data files.
head -n 5 spills_1980_1989.csv
We can also use the navigation bar on the left side of your notebook to double-click and open each comma-separated value ("csv") .csv file and see it in a table format, instead of a traditional command line rendering above.
Step 2: Prepare the R Executable¶
Next, we need to create an R script to analyze our datasets. An example of an R script can be found in our main tutorial directory, so let's navigate there:
cd ../ # change directory to move one up
ls # list files
cat spill_calculation.r
Then let us print the contents of our executable script:
cat spill_calculation.r
This script will read in different datasets as arguments and then will carry out summary statistics to print out the number of spills recorded per decade and the total size (in gallons) of the hazardous spills.
Step 3: Prepare Portable Software¶
Some common software, like R, is provided by OSG using containers. Because of this, you do not need to install R yourself, you will just tell HTCondor what container to use for your jobs. Additionally, this tutorial just uses base-R and no special libraries, but if you need libraries (e.g., tidyverse, ggplot2) you can always install them in your R container.
A list of containers and other software provided by OSG staff can be found on our website https://portal.osg-htc.org/documentation/, along with resources for learning how to add libraries to your container.
We will be using the R container for R 3.5.0, which is accessible under /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-r:3.5.0
, so we must make sure to tell HTCondor to fetch this container when starting each of our jobs. To learn how to tell HTCondor to do this, see below.
Step 4: Prepare and Submit an HTCondor Submit File for One Test Job¶
The HTCondor submit file tells the HTCondor how you would like your job to be run on your behalf.
For example, you should specify what executable you want to run, if you want a container/the name of that container, the resources you would like available to your job, and any special requirements.
Step 4A: Prepare and Submit an HTCondor Submit File¶
A sample submit file to analyze our smallest dataset, spills_1950_1959.csv
, might look like:
cat R.submit
We can submit this job using condor_submit <SubmitFile>
:
condor_submit R.submit
We can check on the status of our job in HTCondor's queue by running:
condor_q
Once our job is done running, it will leave HTCondor's queue automatically.
Step 4B: Review Test Job Results¶
Once our job is done running, we can check the results by looking in our output
folder:
cat output/spills.out
We should see that from 1950-1959, New York recorded five spills that totalled less than 0 recorded gallons.
Step 5: Scale Out Your Workflow to Analyze Many Datasets¶
We just prepared and ran one job analyzing the spills_1950_1959.csv
dataset! But now, we want to analyze the remaining 6 datasets. Luckily, HTCondor is very helpful when it comes to rapidly queueing many small jobs!
To do so, we will update our submit file to use the queue <variable> from <list>
syntax. But before we do this, we need to create a list of the files we want to queue a job for:
ls data > list_of_datasets.txt
cat list_of_datasets.txt
Great! Now we have a list of the files we want analyzed, where each file is on it's own seperate line.
Step 5A: Update submit file to queue a job for each dataset¶
Now, let's modify the queue line of our submit file to use the new queue syntax. For this, we can choose almost any variable name, so for simplicity, let's choose dataset
such that we have queue dataset from list_of_datasets.txt
.
We can then call this new variable, dataset
, elsewhere in our submit file by wrapping it with $()
like so: $(dataset)
.
Our updated submit file might look like this:
cat many-R.submit
Step 5B: Submit Many Jobs¶
Now we can submit our new submit file using condor_submit
again:
condor_submit many-R.submit
Notice that we have now queued 7 jobs using one submit file!
Step 5C: Analysis Completed!¶
We can check on the status of our 7 jobs using condor_q
:
condor_q
Once our jobs are done, we can also review our output files:
cat output/*.csv.out
In a few minutes, we were able to take our R script and run several jobs to analyze all of our real-world data. Congratulations!