CSV Analyzer

Eric D Schmidt

4.00/5 (1 vote)

1 Jan 2020CPOL4 min read

8.2K

CSV Analyzer is a simple scriptable python tool for plotting comma separated value data (CSV).

Download csv_analyzer.tar.gz - 743.7 KB

Introduction

CSV Analyzer is a simple scriptable python tool for plotting comma separated value data (CSV). It is primarily aimed at timeseries and scatter plots. You can create multiple plots from a single reading of a CSV file. This greatly speeds up the process for many use cases when dealing with very large data sets. In addition, plots can be highlighted based on selection criteria, defined either by "pseudo-SQL" selection or numpy conditions.

Background

CSV data is parsed into a dictionary of arrays. Each array is generated from a column of CSV data indexed by its header label. Generally, the Python "csv" module is used, but in the rare edge case where it may fail on extremely large files, a raw file parsing I/O is available. In the newer releases of Python, large file parsing seems to be less of an issue than it used to be when dealing with csv files on the order of magnitude of 100MB of text or more. See the methods in the class "BigCSVReader".

Python

def get_csv_data(self, filename, columns, xaxis_name, rowstart, rowend, rawmode=False):
    '''parse the raw CSV data from the source file using raw file I/O or csv module (default)'''

    for col in columns:
        self.data[col] = []

    row_count = 0
    row_num = 0

    with open(filename, 'r') as csvfile:
        row_count = sum(1 for row in csvfile)
    if rowstart != rowend:
        row_count = min(row_count, rowend - rowstart)

    with open(filename, 'r') as csvfile:
        # fails on big files ??
        if not rawmode:
            reader = csv.reader(csvfile)
            for row in reader:
                if not self.__process_row_data(row, row_num, columns, xaxis_name, rowstart, rowend, row_count):
                    break
                row_num += 1
        else:
            lines = (line.rstrip() for line in csvfile)
            for rawrow in lines:
                row_array = rawrow.split(',')
                if not self.__process_row_data(row_array, row_num, columns, xaxis_name, rowstart, rowend, row_count):
                    break
                row_num += 1

    print("CSV loading complete")
    return self.data, self.xaxis

When a csv file is designated for loading by calling the script with the "--sessionstart" switch, a pickle file of the data dictionary is cached for subsequent loading and use, facilitating extremely fast reloading on the next call. Subsequent script calls can then use the switch "--sessioncontinue" to load from that pickle file. Color pallettes chosen for each header are also saved. Care is taken to "lock" the file in the case of multi-process access. See the methods in the class "CSVAnalyzer".

Python

def restore_data(self):
    print('restoring session data')
    # check if file is still being used
    while os.path.exists('csvsession.pickle.lock'):
        print("waiting for lock on pickle file")
        time.sleep(2)

    with open("csvsession.pickle", "rb") as f:
        self.dict_data = pickle.load(f)
        self.x_axis = pickle.load(f)
        self.dict_colors = pickle.load(f)
        self.color_palette = pickle.load(f)

def serialize_pickle(self):
    # indicate that file is in use
    lockfile = open('csvsession.pickle.lock', 'w+')
    lockfile.close()
    with open("csvsession.pickle", "wb") as f:
        pickle.dump(self.dict_data, f)
        pickle.dump(self.x_axis, f)
        pickle.dump(self.dict_colors, f)
        pickle.dump(self.color_palette, f)
        
    if os.path.exists('csvsession.pickle.lock'):
        os.remove('csvsession.pickle.lock')

Highlighting and Selection

Critical events can be highlighted in a plot via the "filter" switch. This allows one to write psuedo-SQL style selections based on the CSV header labels. For instance, given the CSV headers of: t,x,y,velocity, if t is time, and defined as the x-axis for a timeseries, all the time-points where the velocity exceeded a particular value on a timeseries plot of "x vs t" may be highlighted with the pseudo-SQL "filter":

SQL

"SELECT x WHERE v > 150"

"SELECT" items will be drawn in solid lines, other items will turn dotted. Alternatively, the precise numpy filter code for highlighted timestamp regions can be passed since it is simply a wrapper around calling "exec" on numpy arrays:

Python

"numpy.where(numpy.array(dict_data[\"v\"]) > 150)"

The pseudo-SQL is simply translated to the appropriate numpy syntax and the final filtering code is wrapped into the run-time execution call of:

Python

"res  = " + filterstring

The timeseries indexes of the returned data are marked for highlighting in the output plot. Matplotlib is used for generating all plots.

Useage

usage: csv_analyzer.py [-h] [-f FILE] [-x X_COL_NAME] [-r STARTROW] [-e ENDROW] [-t expression] [-i TITLE] [-s] [-c] [-m] [--scatter] [--colorbyplot] COL_NAME [COL_NAME ...]

Bash

Plot collection of variables from a csv file.

positional arguments:
  COL_NAME              column name(s) of the plot items

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  CSV file to plot
  -x X_COL_NAME, --xaxis X_COL_NAME
                        column name of x-axis. Omission assumes first 
                        column name is x-axis
  -r STARTROW, --rowstart STARTROW
                        row start number
  -e ENDROW, --rowend ENDROW
                        row end number (0 to end of file)
  -t expression, --filter expression
                        filtering expression
  -i TITLE, --title TITLE
                        title of plot
  -s, --sessionstart    starts a new session so we only load data & assign
                        colors once
  -c, --sessioncontinue
                        Continues an existing session so we only load data &
                        assign colors once
  -m, --terminate       Closes immediately after data load and session save, no visual plotting
  --scatter             Create scatter plots from pairs of header names
  --colorbyplot         Keep plot color scheme consistent by plot order

Real Examples

Given a CSV file named "path.csv" with columns t, x, y, v, offx, offy

Timeseries plot of "v vs t". Note that the x-axis "t" is the first header passed:

Bash

./csv_analyzer.py t v -f path.csv --title "Mouse Speed"

Highlight x-axis timestamps where v > 150 and draw x as solid, other lines dotted (pseudo-SQL):

Bash

./csv_analyzer.py t x v -f path.csv --filter "SELECT x WHERE v > 150"  --title "x WHERE v > 150"

Highlight x-axis timestamps where t > 1.5 (numpy where):

Bash

./csv_analyzer.py t x v -f path.csv --filter "numpy.where(numpy.array(dict_data[\"t\"]) > 1.50)" --title "numpy.where(numpy.array(dict_data[\"t\"]) > 1.50"

Show a scatter plot of x vs y (path plot):

Bash

./csv_analyzer.py x y -f path.csv --scatter --title "Path Tracking"

Compare 2 scatter plots:

Bash

./csv_analyzer.py x y offx offy -f path.csv --scatter --title "Path Tracking Compare"

You can generate multiple plots from the same data and load the CSV file only a single time by using the switches "--sessionstart" and "--sessioncontinue" at the first call and subsequent calls, respectively. See the example in the "update_plots.sh" script within the test directory for an example. Intermediate calls to the script will load a temporary pickle file to save processing time.

Give it a Try

A script to generate path and speed tracking data from your mouse movements exists in the "test" directory: "generate_path_data.py". Use the "duration" switch to vary the duration in seconds to collect CSV data. The default output file name is "path.csv", as used in the above examples. In my testing, 1 minute of data collection will generate around 30-40MB of CSV text data.

A good example of making multiple calls to the csv_analyzer while only loading the raw CSV data a single time appears in the test directory as well, see the bash script: "update_plots.sh". Note the little bash trick to capture Ctrl-C keyboard events that will automatically close all the open matplotlib windows when triggered through the SIGINT trap.

Visit the projects page on Github: https://github.com/e1d1s1/csv_analyzer

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)