• [ANN] mdarray-jCSV Version 0.6.0

    From Rodrigo Botafogo@21:1/5 to All on Wed May 25 06:42:53 2016
    Announcement

    MDArray-jCSV (jCSV for short) is the first and only (as far as I know) multidimensional
    CSV reader. Multidimensional? Yes… jCSV can read multidimensional data, also known
    sometimes as “panel data”.

    From Wikipedia: “In statistics and econometrics, the term panel data refers to
    multi-dimensional data frequently involving measurements over time. Panel data contain
    observations of multiple phenomena obtained over multiple time periods for the same firms
    or individuals. In biostatistics, the term longitudinal data is often used instead,
    wherein a subject or cluster constitutes a panel member or individual in a longitudinal
    study.” jCSV makes this definition a bit less strict as it can read observations of
    multiple phenomena obtained over multiple time periods for multiple firms or individuals.

    Other than reading panel data, jCSV is also a very powerful and feature packed CSV
    reader. The CSV file format is a common format for data exchange between diverse
    applications. It is widely used; however, suprisingly, there aren’t that many good
    libraries for CSV reading and writing. In Ruby there are a couple of well known libraries to accomplish this task. First, there is the standard Ruby CSV that comes
    with any Ruby implementation. This library according to Smarter CSV (https://github.com/tilo/smarter_csv) has the following limitations:

    “Ruby’s CSV library’s API is pretty old, and it’s processing of CSV-files returning
    Arrays of Arrays feels ‘very close to the metal’. The output is not
    easy to use - especially not if you want to create database records from it. Another
    shortcoming is that Ruby’s CSV library does not have good support for huge CSV-files,
    e.g. there is no support for ‘chunking’ and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq).

    In order to eliminate those restrictions, smarter_csv was developed. Although it does
    remove those restrictions it removes support for Arrays of Arrays. Altough such format
    is really ‘very close to metal’ in some cases this is actually what is needed. This format
    is less memory intensive than the ‘hash’ approach from smarter_csv and it might make it
    easier to put the date in a simple table. When reading scientific data, such as an matrix
    or multidimensional array, it might also be better to remove headers and informational
    columns and read the actual data as just a plain array.

    jCSV was developed to be the “ultimate” CSV reader (and soon writer). It tries to
    merge all the good features of standard Ruby CSV library, smarter_csv, and other CSV
    libraries from other languages. jCSV is based on Super CSV (http://super-csv.github.io/super-csv/index.html), a java CSV library. According to
    Super CSV web page its motivation is “for Super CSV is to be
    the foremost, fastest, and most programmer-friendly, free CSV package for Java”. jCSV
    motivation is to bring this view to the Ruby world, and since we are in Ruby, make
    it even easier and more programmer-friendly.

    jCSV reading features are:

    * Reads data as lists (Array of Arrays);
    * Reads data as maps (Array of hashes);
    * Reads multidimensional (panel) data to lists or hashes;
    * Reads multidimensional data to vectors, i.e., a multidimensional array (MDArray);
    * When reading panel data, use dimensions as keys, allowing random access to any row in the data by use of the key. For instance, if first_name, last_name are dimensions, then one can access data by doing data[“John.Smith”];
    * Read panel data with the ‘critbit’ reader which automagically sorts keys and allows for prefix retrieval of data, i.e., doing data.each(“D”) { } will retrieve all names starting with “D” and give it to the block;
    * When reading panel data, organize data as maps of maps (deep_map);
    Able to read files with headers or no-headers;
    * When the file has no-headers, allow the user to provide headers so that reading can be done either as array of arrays, array of hashes, or multidimensional with keys;
    * Able to process large CSV-files;
    * Able to chunk the input from the CSV file to avoid loading the whole CSV file into memory;
    * Able to treat the file as an enumerator, so that reading more data can be done at any time during the script execution, it can be stopped and restarted at any time;
    * Able to pass a block to the read method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
    * Allows a bit more flexible input format, where comments are possible, and col_sep, row_sep can be set to any character sequence, including control characters;
    * Able to re-map CSV “column names” to Hash-keys of your choice (normalization);
    * Able to ignore “columns” in the input (delete columns);
    * Able to change columns´ order, when reading to an Array of Arrays;
    Provide dozens of filters/validators for the data;
    * Filters can be chained allowing for complex data manipulation. For instance, suppose one column can have empty values or dollar values. If it is a dollar values, then it should be a float. Consider that the data is stored using a Brazilian locale
    format, i.e., decimal separator is ‘,’ and grouping is ‘.’ (the reverse of US locale). Suppose also that the value should be in the range of US$ 1.000,00 and US$ 2.000,00 and finally suppose that we actually want to see this data not as dollar
    amounts but as Brazilian Reais, converted with the day´s current rate.
    Then this sequence of filters should do it:

    Jcsv.optional >> Jcsv.float(locale: Brazil) >> Jcsv.in_range(1000, 2000) >> Jcsv.dynamic { |value| rate * value }

    * Date can be parsed by any of Ruby DateTime formats: httpdate, iso8601, jd, etc.;
    * Can filter data by any of the Ruby String methods: :[], :reverse, :gsub, :prepend, etc.

    Tutorial: https://github.com/rbotafogo/jCSV/wiki/Tutorial

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)