New system cleans messy data tables automatically
By Rachel Paiste, May 11, 2021
-
https://news.mit.edu/2021/system-cleans-messy-data-tables-automatically-0511
"MIT researchers have created a new system that automatically cleans
dirty data the typos, duplicates, missing values, misspellings,
and inconsistencies dreaded by data analysts, data engineers, and
data scientists. The system, called PClean, is the latest in a
series of domain-specific probabilistic programming languages
written by researchers at the Probabilistic Computing Project that
aim to simplify and automate the development of AI applications
(others include one for 3D perception via inverse graphics and
another for modeling time series and databases).
According to surveys conducted by Anaconda and Figure Eight, data
cleaning can take a quarter of a data scientist's time. Automating
the task is challenging because different datasets require
different types of cleaning, and common-sense judgment calls about
objects in the world are often needed (e.g., which of several
cities called Beverly Hills someone lives in). PClean provides
generic common-sense models for these kinds of judgment calls that
can be customized to specific databases and types of errors.
PClean uses a knowledge-based approach to automate the data
cleaning process: Users encode background knowledge about the
database and what sorts of issues might appear. Take, for instance,
the problem of cleaning state names in a database of apartment
listings. What if someone said they lived in Beverly Hills but left
the state column empty? Though there is a well-known Beverly Hills
in California, theres also one in Florida, Missouri, and Texas and
theres a neighborhood of Baltimore known as Beverly Hills. How can
you know in which the person lives? This is where PCleans
expressive scripting language comes in. Users can give PClean
background knowledge about the domain and about how data might be
corrupted. PClean combines this knowledge via common-sense
probabilistic reasoning to come up with the answer. For example,
given additional knowledge about typical rents, PClean infers the
correct Beverly Hills is in California because of the high cost of
rent where the respondent lives." ...
--
__ __
#_ < |\| |< _#
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)