Announcement
MDArray-jCSV (jCSV for short) is the first and only (as far as I know) multidimensional
CSV reader. Multidimensional? Yes… jCSV can read multidimensional data, also known
sometimes as “panel data”.
From Wikipedia: “In statistics and econometrics, the term panel data refers to
multi-dimensional data frequently involving measurements over time. Panel data contain
observations of multiple phenomena obtained over multiple time periods for the same firms
or individuals. In biostatistics, the term longitudinal data is often used instead,
wherein a subject or cluster constitutes a panel member or individual in a longitudinal
study.” jCSV makes this definition a bit less strict as it can read observations of
multiple phenomena obtained over multiple time periods for multiple firms or individuals.
Other than reading panel data, jCSV is also a very powerful and feature packed CSV
reader. The CSV file format is a common format for data exchange between diverse
applications. It is widely used; however, suprisingly, there aren’t that many good
libraries for CSV reading and writing. In Ruby there are a couple of well known libraries to accomplish this task. First, there is the standard Ruby CSV that comes
with any Ruby implementation. This library according to Smarter CSV (
https://github.com/tilo/smarter_csv) has the following limitations:
“Ruby’s CSV library’s API is pretty old, and it’s processing of CSV-files returning
Arrays of Arrays feels ‘very close to the metal’. The output is not
easy to use - especially not if you want to create database records from it. Another
shortcoming is that Ruby’s CSV library does not have good support for huge CSV-files,
e.g. there is no support for ‘chunking’ and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq).
In order to eliminate those restrictions, smarter_csv was developed. Although it does
remove those restrictions it removes support for Arrays of Arrays. Altough such format
is really ‘very close to metal’ in some cases this is actually what is needed. This format
is less memory intensive than the ‘hash’ approach from smarter_csv and it might make it
easier to put the date in a simple table. When reading scientific data, such as an matrix
or multidimensional array, it might also be better to remove headers and informational
columns and read the actual data as just a plain array.
jCSV was developed to be the “ultimate” CSV reader (and soon writer). It tries to
merge all the good features of standard Ruby CSV library, smarter_csv, and other CSV
libraries from other languages. jCSV is based on Super CSV (
http://super-csv.github.io/super-csv/index.html), a java CSV library. According to
Super CSV web page its motivation is “for Super CSV is to be
the foremost, fastest, and most programmer-friendly, free CSV package for Java”. jCSV
motivation is to bring this view to the Ruby world, and since we are in Ruby, make
it even easier and more programmer-friendly.
jCSV reading features are:
* Reads data as lists (Array of Arrays);
* Reads data as maps (Array of hashes);
* Reads multidimensional (panel) data to lists or hashes;
* Reads multidimensional data to vectors, i.e., a multidimensional array (MDArray);
* When reading panel data, use dimensions as keys, allowing random access to any row in the data by use of the key. For instance, if first_name, last_name are dimensions, then one can access data by doing data[“John.Smith”];
* Read panel data with the ‘critbit’ reader which automagically sorts keys and allows for prefix retrieval of data, i.e., doing data.each(“D”) { } will retrieve all names starting with “D” and give it to the block;
* When reading panel data, organize data as maps of maps (deep_map);
Able to read files with headers or no-headers;
* When the file has no-headers, allow the user to provide headers so that reading can be done either as array of arrays, array of hashes, or multidimensional with keys;
* Able to process large CSV-files;
* Able to chunk the input from the CSV file to avoid loading the whole CSV file into memory;
* Able to treat the file as an enumerator, so that reading more data can be done at any time during the script execution, it can be stopped and restarted at any time;
* Able to pass a block to the read method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
* Allows a bit more flexible input format, where comments are possible, and col_sep, row_sep can be set to any character sequence, including control characters;
* Able to re-map CSV “column names” to Hash-keys of your choice (normalization);
* Able to ignore “columns” in the input (delete columns);
* Able to change columns´ order, when reading to an Array of Arrays;
Provide dozens of filters/validators for the data;
* Filters can be chained allowing for complex data manipulation. For instance, suppose one column can have empty values or dollar values. If it is a dollar values, then it should be a float. Consider that the data is stored using a Brazilian locale
format, i.e., decimal separator is ‘,’ and grouping is ‘.’ (the reverse of US locale). Suppose also that the value should be in the range of US$ 1.000,00 and US$ 2.000,00 and finally suppose that we actually want to see this data not as dollar
amounts but as Brazilian Reais, converted with the day´s current rate.
Then this sequence of filters should do it:
Jcsv.optional >> Jcsv.float(locale: Brazil) >> Jcsv.in_range(1000, 2000) >> Jcsv.dynamic { |value| rate * value }
* Date can be parsed by any of Ruby DateTime formats: httpdate, iso8601, jd, etc.;
* Can filter data by any of the Ruby String methods: :[], :reverse, :gsub, :prepend, etc.
Tutorial:
https://github.com/rbotafogo/jCSV/wiki/Tutorial
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)