• Preprocessing not quite fixed-width file before parsing

    From Loris Bennett@21:1/5 to All on Wed Nov 23 17:00:44 2022
    Hi,

    I am using pandas to parse a file with the following structure:

    Name fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none

    I was initially able to use

    df = pandas.read_csv(file_name, delimiter=r"\s+")

    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.

    I can't use

    pandas.read_fwf

    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.

    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the readable but poorly parsable format above that I need to deal with.

    If I were doing something similar in the shell, I would just pipe the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally
    without having to generate a lot of temporary files?

    Cheers,

    Loris

    --
    This signature is currently under constuction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Loris Bennett on Wed Nov 23 16:50:02 2022
    "Loris Bennett" <loris.bennett@fu-berlin.de> writes:
    If I were doing something similar in the shell, I would just pipe the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally >without having to generate a lot of temporary files?

    I do not have Pandas installed, but the documentation of
    "read_csv" says it accepts any object with a "read" method
    (such as "StringIO").

    So maybe you could write the results of your preprocessor
    into a StringIO object, invoke "seek(0)" on it, and pass it
    to "read_csv".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to All on Wed Nov 23 20:38:49 2022
    This seems to work. I’m inferring the | is present in each line that needs to be fixed.

    import pandas
    import logging


    class Wrapper:
    """Wrap file to fix up data"""

    def __init__(self, filename):
    self.filename = filename

    def __enter__(self):
    self.fh = open(self.filename,'r')
    return self

    def __exit__(self, exc_type, exc_val, exc_tb):
    self.fh.close()

    def __iter__(self):
    """This is required by pandas for some reason, even though it doesn't seem to be called"""
    raise ValueError("Unsupported operation")

    def read(self, n: int):
    """Read data. Replace 'grace' before | if it has underscores in it"""
    try:
    data = self.fh.readline()
    ht = data.split('|', maxsplit=2)
    if len(ht) == 2:
    head,tail = ht
    hparts = head.split(maxsplit=7)
    assert len(hparts) == 8
    if ' ' in hparts[7].strip():
    hparts[7] = hparts[7].strip().replace(' ','_')
    fixed_data = f"{' '.join(hparts)} | {tail}"
    return fixed_data

    return data
    except:
    logging.exception("read")

    logging.basicConfig()
    with Wrapper('data.txt') as f:
    df = pandas.read_csv(f, delimiter=r"\s+")
    print(df)


    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Loris Bennett <loris.bennett@fu-berlin.de>
    Date: Wednesday, November 23, 2022 at 2:00 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Preprocessing not quite fixed-width file before parsing
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Hi,

    I am using pandas to parse a file with the following structure:

    Name fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none

    I was initially able to use

    df = pandas.read_csv(file_name, delimiter=r"\s+")

    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.

    I can't use

    pandas.read_fwf

    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.

    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the readable but poorly parsable format above that I need to deal with.

    If I were doing something similar in the shell, I would just pipe the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally
    without having to generate a lot of temporary files?

    Cheers,

    Loris

    --
    This signature is currently under constuction.
    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$<https://urldefense.com/v3/__https:/mail.python.org/mailman/
    listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Weatherby,Gerard@21:1/5 to All on Wed Nov 23 20:40:59 2022
    Oops. Forgot to Reformat file before sending. Here’s the proper PEP-8 (at least according to PyCharm)

    import pandas
    import logging


    class Wrapper:
    """Wrap file to fix up data"""

    def __init__(self, filename):
    self.filename = filename

    def __enter__(self):
    self.fh = open(self.filename, 'r')
    return self

    def __exit__(self, exc_type, exc_val, exc_tb):
    self.fh.close()

    def __iter__(self):
    """This is required by pandas for some reason, even though it doesn't seem to be called"""
    raise ValueError("Unsupported operation")

    def read(self, n: int):
    """Read data. Replace 'grace' before | if it has underscores in it"""
    try:
    data = self.fh.readline()
    ht = data.split('|', maxsplit=2)
    if len(ht) == 2:
    head, tail = ht
    hparts = head.split(maxsplit=7)
    assert len(hparts) == 8
    if ' ' in hparts[7].strip():
    hparts[7] = hparts[7].strip().replace(' ', '_')
    fixed_data = f"{' '.join(hparts)} | {tail}"
    return fixed_data

    return data
    except:
    logging.exception("read")


    logging.basicConfig()
    with Wrapper('data.txt') as f:
    df = pandas.read_csv(f, delimiter=r"\s+")
    print(df)


    From: Weatherby,Gerard <gweatherby@uchc.edu>
    Date: Wednesday, November 23, 2022 at 3:38 PM
    To: Loris Bennett <loris.bennett@fu-berlin.de>, python-list@python.org <python-list@python.org>
    Subject: Re: Preprocessing not quite fixed-width file before parsing

    This seems to work. I’m inferring the | is present in each line that needs to be fixed.

    import pandas
    import logging


    class Wrapper:
    """Wrap file to fix up data"""

    def __init__(self, filename):
    self.filename = filename

    def __enter__(self):
    self.fh = open(self.filename,'r')
    return self

    def __exit__(self, exc_type, exc_val, exc_tb):
    self.fh.close()

    def __iter__(self):
    """This is required by pandas for some reason, even though it doesn't seem to be called"""
    raise ValueError("Unsupported operation")

    def read(self, n: int):
    """Read data. Replace 'grace' before | if it has underscores in it"""
    try:
    data = self.fh.readline()
    ht = data.split('|', maxsplit=2)
    if len(ht) == 2:
    head,tail = ht
    hparts = head.split(maxsplit=7)
    assert len(hparts) == 8
    if ' ' in hparts[7].strip():
    hparts[7] = hparts[7].strip().replace(' ','_')
    fixed_data = f"{' '.join(hparts)} | {tail}"
    return fixed_data

    return data
    except:
    logging.exception("read")

    logging.basicConfig()
    with Wrapper('data.txt') as f:
    df = pandas.read_csv(f, delimiter=r"\s+")
    print(df)


    From: Python-list <python-list-bounces+gweatherby=uchc.edu@python.org> on behalf of Loris Bennett <loris.bennett@fu-berlin.de>
    Date: Wednesday, November 23, 2022 at 2:00 PM
    To: python-list@python.org <python-list@python.org>
    Subject: Preprocessing not quite fixed-width file before parsing
    *** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

    Hi,

    I am using pandas to parse a file with the following structure:

    Name fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none

    I was initially able to use

    df = pandas.read_csv(file_name, delimiter=r"\s+")

    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.

    I can't use

    pandas.read_fwf

    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.

    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the readable but poorly parsable format above that I need to deal with.

    If I were doing something similar in the shell, I would just pipe the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally
    without having to generate a lot of temporary files?

    Cheers,

    Loris

    --
    This signature is currently under constuction.
    -- https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$<https://urldefense.com/v3/__https:/mail.python.org/mailman/
    listinfo/python-list__;!!Cn_UX_p3!hBypaGqqmBaUa_w_PNTK9VelYEJCChO6c7d8k1yz6N56806CJ0wtAfLhvj5UaWrGaccJTzKxrjQJCil9DJ470VZWO4fOfhk$>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Loris Bennett on Wed Nov 23 16:36:53 2022
    On 11/23/2022 11:00 AM, Loris Bennett wrote:
    Hi,

    I am using pandas to parse a file with the following structure:

    Name fileset type KB quota limit in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none

    I was initially able to use

    df = pandas.read_csv(file_name, delimiter=r"\s+")

    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.

    I can't use

    pandas.read_fwf

    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.

    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the readable but poorly parsable format above that I need to deal with.

    If I were doing something similar in the shell, I would just pipe the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally without having to generate a lot of temporary files?

    This is really annoying, isn't it? A space-separated line with spaces
    in data entries. If the example you give is typical, I don't think
    there is a general solution. If you know there are only certain values
    like that, then you could do a search-and-replace for them in Python
    just like the example you gave for "5 days".

    If you know that the field that might contain entries with spaces is the
    same one, e.g., the one just before the "|" marker, you could make use
    of that. But it could be tricky.

    I don't know how many files like this you will need to process, nor how
    many rows they might contain. If I were to do tackle this job, I would
    probably do some quality checking first. Using this example file,
    figure out how many fields there are supposed to be. First, split the
    file into lines:

    with open("filename") as f:
    lines = f.readlines()

    # Check space-separated fields defined in first row:
    fields = lines[0].split()
    num_fields = len(fields)
    print(num_fields) # e.g., 100)

    # Find lines that have the wrong number of fields
    bad_lines = []
    for line in lines:
    fields = line.split()
    if len(fields) != num_fields:
    bad_lines.append(line)

    print(len(bad_lines))

    # Inspect a sample
    for line in bad_lines[:10]:
    print(line)

    This will give you an idea of how many problems lines there are, and if
    they can all be fixed by a simple replacement. If they can and this is
    the only file you need to handle, just fix it up and run it. I would
    replace the spaces with tabs or commas. Splitting a line on spaces
    (split()) takes care of the issue of having a variable number of spaces,
    so that's easy enough.

    If you will need to handle many files, and you can automate the fixes - possibly with a regular expression - then you should preprocess each
    file before giving it to pandas. Something like this:

    def fix_line(line):
    """Test line for field errors and fix errors if any."""
    # ....
    return fixed

    # For each file
    with open("filename") as f:
    lines = f.readlines()

    fixed_lines = []
    for line in lines:
    fixed = fix_line(line)
    fields = fixed.split()
    tabified = '\t'.join(fields) # Could be done by fix_line()
    fixed_lines.append(tabified)

    # Now use an IOString to feed the file to pandas
    # From memory, some details may not be right
    f = IOString()
    f.writelines(fixed_lines)

    # Give f to pandas as if it were an external file
    # ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Loris Bennett@21:1/5 to Thomas Passin on Thu Nov 24 15:06:17 2022
    Thomas Passin <list1@tompassin.net> writes:

    On 11/23/2022 11:00 AM, Loris Bennett wrote:
    Hi,
    I am using pandas to parse a file with the following structure:
    Name fileset type KB quota limit
    in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none
    I was initially able to use
    df = pandas.read_csv(file_name, delimiter=r"\s+")
    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.
    I can't use
    pandas.read_fwf
    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.
    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the
    readable but poorly parsable format above that I need to deal with.
    If I were doing something similar in the shell, I would just pipe
    the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally
    without having to generate a lot of temporary files?

    This is really annoying, isn't it? A space-separated line with spaces
    in data entries. If the example you give is typical, I don't think
    there is a general solution. If you know there are only certain
    values like that, then you could do a search-and-replace for them in
    Python just like the example you gave for "5 days".

    If you know that the field that might contain entries with spaces is
    the same one, e.g., the one just before the "|" marker, you could make
    use of that. But it could be tricky.

    I don't know how many files like this you will need to process, nor
    how many rows they might contain. If I were to do tackle this job, I
    would probably do some quality checking first. Using this example
    file, figure out how many fields there are supposed to be. First,
    split the file into lines:

    with open("filename") as f:
    lines = f.readlines()

    # Check space-separated fields defined in first row:
    fields = lines[0].split()
    num_fields = len(fields)
    print(num_fields) # e.g., 100)

    # Find lines that have the wrong number of fields
    bad_lines = []
    for line in lines:
    fields = line.split()
    if len(fields) != num_fields:
    bad_lines.append(line)

    print(len(bad_lines))

    # Inspect a sample
    for line in bad_lines[:10]:
    print(line)

    This will give you an idea of how many problems lines there are, and
    if they can all be fixed by a simple replacement. If they can and
    this is the only file you need to handle, just fix it up and run it.
    I would replace the spaces with tabs or commas. Splitting a line on
    spaces (split()) takes care of the issue of having a variable number
    of spaces, so that's easy enough.

    If you will need to handle many files, and you can automate the fixes
    - possibly with a regular expression - then you should preprocess each
    file before giving it to pandas. Something like this:

    def fix_line(line):
    """Test line for field errors and fix errors if any."""
    # ....
    return fixed

    # For each file
    with open("filename") as f:
    lines = f.readlines()

    fixed_lines = []
    for line in lines:
    fixed = fix_line(line)
    fields = fixed.split()
    tabified = '\t'.join(fields) # Could be done by fix_line()
    fixed_lines.append(tabified)

    # Now use an IOString to feed the file to pandas
    # From memory, some details may not be right
    f = IOString()
    f.writelines(fixed_lines)

    # Give f to pandas as if it were an external file
    # ...


    Thanks to both Gerard and Thomas for the pointer to IOString. I ended up
    just reading the file line-by-line, using a regex to replace

    '<n> <units> |'

    with

    '<n><units> |'

    and writing the new lines to an IOString, which I then passed to pandas.read_csv.

    The wrapper approach looks interesting, but it looks like I need to read
    up more on contexts before adding that to my own code, otherwise I may
    not understand it in a month's time.

    Cheers,

    Loris

    --
    This signature is currently under constuction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Loris Bennett on Thu Nov 24 20:59:39 2022
    On 11/24/2022 9:06 AM, Loris Bennett wrote:
    Thomas Passin <list1@tompassin.net> writes:

    On 11/23/2022 11:00 AM, Loris Bennett wrote:
    Hi,
    I am using pandas to parse a file with the following structure:
    Name fileset type KB quota limit
    in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none
    I was initially able to use
    df = pandas.read_csv(file_name, delimiter=r"\s+")
    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.
    I can't use
    pandas.read_fwf
    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to
    the right.
    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the
    readable but poorly parsable format above that I need to deal with.
    If I were doing something similar in the shell, I would just pipe
    the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally
    without having to generate a lot of temporary files?

    This is really annoying, isn't it? A space-separated line with spaces
    in data entries. If the example you give is typical, I don't think
    there is a general solution. If you know there are only certain
    values like that, then you could do a search-and-replace for them in
    Python just like the example you gave for "5 days".

    If you know that the field that might contain entries with spaces is
    the same one, e.g., the one just before the "|" marker, you could make
    use of that. But it could be tricky.

    I don't know how many files like this you will need to process, nor
    how many rows they might contain. If I were to do tackle this job, I
    would probably do some quality checking first. Using this example
    file, figure out how many fields there are supposed to be. First,
    split the file into lines:

    with open("filename") as f:
    lines = f.readlines()

    # Check space-separated fields defined in first row:
    fields = lines[0].split()
    num_fields = len(fields)
    print(num_fields) # e.g., 100)

    # Find lines that have the wrong number of fields
    bad_lines = []
    for line in lines:
    fields = line.split()
    if len(fields) != num_fields:
    bad_lines.append(line)

    print(len(bad_lines))

    # Inspect a sample
    for line in bad_lines[:10]:
    print(line)

    This will give you an idea of how many problems lines there are, and
    if they can all be fixed by a simple replacement. If they can and
    this is the only file you need to handle, just fix it up and run it.
    I would replace the spaces with tabs or commas. Splitting a line on
    spaces (split()) takes care of the issue of having a variable number
    of spaces, so that's easy enough.

    If you will need to handle many files, and you can automate the fixes
    - possibly with a regular expression - then you should preprocess each
    file before giving it to pandas. Something like this:

    def fix_line(line):
    """Test line for field errors and fix errors if any."""
    # ....
    return fixed

    # For each file
    with open("filename") as f:
    lines = f.readlines()

    fixed_lines = []
    for line in lines:
    fixed = fix_line(line)
    fields = fixed.split()
    tabified = '\t'.join(fields) # Could be done by fix_line()
    fixed_lines.append(tabified)

    # Now use an IOString to feed the file to pandas
    # From memory, some details may not be right
    f = IOString()
    f.writelines(fixed_lines)

    # Give f to pandas as if it were an external file
    # ...


    Thanks to both Gerard and Thomas for the pointer to IOString. I ended up just reading the file line-by-line, using a regex to replace

    '<n> <units> |'

    with

    '<n><units> |'

    and writing the new lines to an IOString, which I then passed to pandas.read_csv.

    The wrapper approach looks interesting, but it looks like I need to read
    up more on contexts before adding that to my own code, otherwise I may
    not understand it in a month's time.

    Glad that IOString works for you here. I seem to remember that after
    writing to the IOString, you have to seek to 0 before reading from it.
    Better check that point!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Loris Bennett@21:1/5 to Thomas Passin on Fri Nov 25 08:09:52 2022
    Thomas Passin <list1@tompassin.net> writes:

    On 11/24/2022 9:06 AM, Loris Bennett wrote:
    Thomas Passin <list1@tompassin.net> writes:

    On 11/23/2022 11:00 AM, Loris Bennett wrote:
    Hi,
    I am using pandas to parse a file with the following structure:
    Name fileset type KB quota limit
    in_doubt grace | files quota limit in_doubt grace
    shortname sharedhome USR 14097664 524288000 545259520 0 none | 107110 0 0 0 none
    gracedays sharedhome USR 774858944 524288000 775946240 0 5 days | 1115717 0 0 0 none
    nametoolong sharedhome USR 27418496 524288000 545259520 0 none | 11581 0 0 0 none
    I was initially able to use
    df = pandas.read_csv(file_name, delimiter=r"\s+")
    because all the values for 'grace' were 'none'. Now, however,
    non-"none" values have appeared and this fails.
    I can't use
    pandas.read_fwf
    even with an explicit colspec, because the names in the first column
    which are too long for the column will displace the rest of the data to >>>> the right.
    The report which produces the file could in fact also generate a
    properly delimited CSV file, but I have a lot of historical data in the >>>> readable but poorly parsable format above that I need to deal with.
    If I were doing something similar in the shell, I would just pipe
    the
    file through sed or something to replace '5 days' with, say '5_days'.
    How could I achieve a similar sort of preprocessing in Python, ideally >>>> without having to generate a lot of temporary files?

    This is really annoying, isn't it? A space-separated line with spaces
    in data entries. If the example you give is typical, I don't think
    there is a general solution. If you know there are only certain
    values like that, then you could do a search-and-replace for them in
    Python just like the example you gave for "5 days".

    If you know that the field that might contain entries with spaces is
    the same one, e.g., the one just before the "|" marker, you could make
    use of that. But it could be tricky.

    I don't know how many files like this you will need to process, nor
    how many rows they might contain. If I were to do tackle this job, I
    would probably do some quality checking first. Using this example
    file, figure out how many fields there are supposed to be. First,
    split the file into lines:

    with open("filename") as f:
    lines = f.readlines()

    # Check space-separated fields defined in first row:
    fields = lines[0].split()
    num_fields = len(fields)
    print(num_fields) # e.g., 100)

    # Find lines that have the wrong number of fields
    bad_lines = []
    for line in lines:
    fields = line.split()
    if len(fields) != num_fields:
    bad_lines.append(line)

    print(len(bad_lines))

    # Inspect a sample
    for line in bad_lines[:10]:
    print(line)

    This will give you an idea of how many problems lines there are, and
    if they can all be fixed by a simple replacement. If they can and
    this is the only file you need to handle, just fix it up and run it.
    I would replace the spaces with tabs or commas. Splitting a line on
    spaces (split()) takes care of the issue of having a variable number
    of spaces, so that's easy enough.

    If you will need to handle many files, and you can automate the fixes
    - possibly with a regular expression - then you should preprocess each
    file before giving it to pandas. Something like this:

    def fix_line(line):
    """Test line for field errors and fix errors if any."""
    # ....
    return fixed

    # For each file
    with open("filename") as f:
    lines = f.readlines()

    fixed_lines = []
    for line in lines:
    fixed = fix_line(line)
    fields = fixed.split()
    tabified = '\t'.join(fields) # Could be done by fix_line()
    fixed_lines.append(tabified)

    # Now use an IOString to feed the file to pandas
    # From memory, some details may not be right
    f = IOString()
    f.writelines(fixed_lines)

    # Give f to pandas as if it were an external file
    # ...

    Thanks to both Gerard and Thomas for the pointer to IOString. I
    ended up
    just reading the file line-by-line, using a regex to replace
    '<n> <units> |'
    with
    '<n><units> |'
    and writing the new lines to an IOString, which I then passed to
    pandas.read_csv.
    The wrapper approach looks interesting, but it looks like I need to
    read
    up more on contexts before adding that to my own code, otherwise I may
    not understand it in a month's time.

    Glad that IOString works for you here. I seem to remember that after
    writing to the IOString, you have to seek to 0 before reading from
    it. Better check that point!

    Stefan (whom I forgot to thank: Verziehung, Stefan!), mentioned seek(0),
    so fortunately I was primed when I read the Python documentation for
    IOString.

    Cheers,

    Loris

    --
    This signature is currently under constuction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)