DataPlunger¶

DataPlunger is a prototype ETL processing toolchain.

The goal is to create a modular package for the purpose of extracting data from multiple backing stores, performing n-number of transformational processing steps on those records, with the final output being loaded into a new format.

A workflow, or processing pipeline, is defined via a JSON configuration file containing the following information:

Connection information to source data for processing.
Processing steps to be applied to individual records extracted from source.

Source code for this project can be found at: https://github.com/mattmakesmaps/DataPlunger

Install Instructions:

# Create virtualenv
$ mkvirtualenv dp_dev_test
(dp_dev_test)$ cd /path/to/DataPlunger
# Install in development mode (sym-link to site-packages)
(dp_dev_test)$ python setup.py develop

Configuration¶

Processing pipelines are described using a JSON configuration file.

Configuration File
- Example Configuration File
- Parameters

Main Modules¶

The DataPlunger package is broken down into three main modules, dataplunger.core, dataplunger.processors, and dataplunger.readers.

Core contains configuration and control code.

dataplunger.core

Processors perform actions on a collection records.

dataplunger.processors

Readers are responsible for creating a connection to a backing datasource, and returning an iterable that yields a single record of data from that datasource.

dataplunger.readers

Test Coverage¶

Unit tests currently cover the processors and readers modules.

dataplunger.tests
- dataplunger.tests.tests_processors module
- dataplunger.tests.tests_readers module

DataPlunger¶

Configuration¶

Main Modules¶

Test Coverage¶

Indices and tables¶