• [LINK] New system cleans messy data tables automatically

    From Computer Nerd Kev@21:1/5 to All on Sat Jul 24 02:27:31 2021
    New system cleans messy data tables automatically
    By Rachel Paiste, May 11, 2021
    - https://news.mit.edu/2021/system-cleans-messy-data-tables-automatically-0511

    "MIT researchers have created a new system that automatically cleans
    dirty data the typos, duplicates, missing values, misspellings,
    and inconsistencies dreaded by data analysts, data engineers, and
    data scientists. The system, called PClean, is the latest in a
    series of domain-specific probabilistic programming languages
    written by researchers at the Probabilistic Computing Project that
    aim to simplify and automate the development of AI applications
    (others include one for 3D perception via inverse graphics and
    another for modeling time series and databases).

    According to surveys conducted by Anaconda and Figure Eight, data
    cleaning can take a quarter of a data scientist's time. Automating
    the task is challenging because different datasets require
    different types of cleaning, and common-sense judgment calls about
    objects in the world are often needed (e.g., which of several
    cities called Beverly Hills someone lives in). PClean provides
    generic common-sense models for these kinds of judgment calls that
    can be customized to specific databases and types of errors.

    PClean uses a knowledge-based approach to automate the data
    cleaning process: Users encode background knowledge about the
    database and what sorts of issues might appear. Take, for instance,
    the problem of cleaning state names in a database of apartment
    listings. What if someone said they lived in Beverly Hills but left
    the state column empty? Though there is a well-known Beverly Hills
    in California, theres also one in Florida, Missouri, and Texas and
    theres a neighborhood of Baltimore known as Beverly Hills. How can
    you know in which the person lives? This is where PCleans
    expressive scripting language comes in. Users can give PClean
    background knowledge about the domain and about how data might be
    corrupted. PClean combines this knowledge via common-sense
    probabilistic reasoning to come up with the answer. For example,
    given additional knowledge about typical rents, PClean infers the
    correct Beverly Hills is in California because of the high cost of
    rent where the respondent lives." ...

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)