Forum: >>> Magnum BBS <<<

Need a little help

From Rider@21:1/5 to All on Fri May 14 00:43:19 2021

Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving a few lines here, but there are around 100 entries in the actual file). Each line has one email id' followed by a user id (both separated by a tab). I am just giving first three lines
here.
========================
abc@google.com abc1
cdef@yahoo.com cde
xyz@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file called text.xml) and replace the first email with the second entry (example: all abc@google.com entries in the dump should be replaced by abc1 and so on and so forth). Can someone help
me with the code?

Now the Perl script should be like this:

read entries.txt file;
separate each line (split) in to two entries
loop through the below dump (whatever is below __DATA__)
Replace the first email entry with the second user id
Write all the updated data to a new file, updated.xml

__DATA__ (the below dump is in fact a file text.xml)
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

Now the output file, updated.xml should contain the following dump: ============
Hello world abc1 this is line 1
This is the second line with a lot text cde and much more
Here is the third line xyz2 and lot of stuff here
One more line with abc2
=============

Thanks in advance..
Ryder

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rider on Fri May 14 12:44:11 2021

Rider <clearguy02@yahoo.com> writes:

Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving
a few lines here, but there are around 100 entries in the actual
file). Each line has one email id' followed by a user id (both
separated by a tab). I am just giving first three lines here. ========================
abc@google.com abc1
cdef@yahoo.com cde
xyz@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file
called text.xml) and replace the first email with the second entry
(example: all abc@google.com entries in the dump should be replaced by
abc1 and so on and so forth). Can someone help me with the code?

What have you tried? What bits are causing you trouble? If you just
want someone to write it for you, you may get lucky, but most people
prefer to help with learning rather than coding for free.

Now the Perl script should be like this:

read entries.txt file;
separate each line (split) in to two entries
loop through the below dump (whatever is below __DATA__)
Replace the first email entry with the second user id
Write all the updated data to a new file, updated.xml

Why must it be done like that? This very narrow prescription makes it
sound like coursework.

__DATA__ (the below dump is in fact a file text.xml)
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

Now the output file, updated.xml should contain the following dump: ============
Hello world abc1 this is line 1
This is the second line with a lot text cde and much more
Here is the third line xyz2 and lot of stuff here
One more line with abc2
=============

I think that last abc2 shold be abc1.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Otto J. Makela@21:1/5 to Ben Bacarisse on Fri May 14 19:24:32 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

What have you tried? What bits are causing you trouble? If you just
want someone to write it for you, you may get lucky, but most people
prefer to help with learning rather than coding for free.

Indeed. The quick-and-dirty approach I've done in this kind of stuff is
to collect the strings & replacements into a hash, then make a regexp

$r='\\b('.join('|',map {...} sort {...} keys %myhash).')\\b';

(with appropriate regexp quoting for the individual keys with map {},
selecting the sort {} to put long strings first), then do something like

s/$r/$myhash{$1}/goe

on our whole target string.

I'm not terribly fond of using clunky string operations to build
regexps, and then there's the question of getting the regexp quoting
right. Is there some more elegant method people can think of?
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Wasell@21:1/5 to Rider on Fri May 14 20:22:42 2021

On Fri, 14 May 2021 00:43:19 -0700 (PDT), in article <b37cb72e- 7787-48f5-b158-01992be9e789n@googlegroups.com>, Rider wrote:

Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving a
few lines here, but there are around 100 entries in the actual file).
Each line has one email id' followed by a user id (both separated by a
tab). I am just giving first three lines here.
========================
abc@google.com abc1
cdef@yahoo.com cde
xyz@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file
called text.xml) and replace the first email with the second entry
(example: all abc@google.com entries in the dump should be replaced by
abc1 and so on and so forth). Can someone help me with the code?

Now the Perl script should be like this:

read entries.txt file;
separate each line (split) in to two entries
loop through the below dump (whatever is below __DATA__)
Replace the first email entry with the second user id
Write all the updated data to a new file, updated.xml

__DATA__ (the below dump is in fact a file text.xml)
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

Now the output file, updated.xml should contain the following dump: ============
Hello world abc1 this is line 1
This is the second line with a lot text cde and much more
Here is the third line xyz2 and lot of stuff here
One more line with abc2
=============

Thanks in advance..
Ryder

Can I assume this is homework? Maybe we can have some fun with
it...

I'm not much of a Perl golfer, but here's an attempt:

#!/usr/bin/perl
{local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
=~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much
more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

It seems to follow the specification.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Otto J. Makela on Fri May 14 20:08:26 2021

om@iki.fi (Otto J. Makela) writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

What have you tried? What bits are causing you trouble? If you just
want someone to write it for you, you may get lucky, but most people
prefer to help with learning rather than coding for free.

Indeed. The quick-and-dirty approach I've done in this kind of stuff is
to collect the strings & replacements into a hash, then make a regexp

$r='\\b('.join('|',map {...} sort {...} keys %myhash).')\\b';

(with appropriate regexp quoting for the individual keys with map {}, selecting the sort {} to put long strings first), then do something like

s/$r/$myhash{$1}/goe

on our whole target string.

I'm not terribly fond of using clunky string operations to build
regexps, and then there's the question of getting the regexp quoting
right. Is there some more elegant method people can think of?

Not elegant, no, but I think I'd slurp the input and then loop over the substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

\Q and \E ensure the quoting is correct. And I stole your \b...\b
because I'd forgotten about that! I expect the OP wants it.
--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Wasell on Fri May 14 20:02:34 2021

Wasell <usenet2020@wasell.eu> writes:

Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving a
few lines here, but there are around 100 entries in the actual file).
Each line has one email id' followed by a user id (both separated by a
tab). I am just giving first three lines here.
========================
abc@google.com abc1
cdef@yahoo.com cde
xyz@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file
called text.xml) and replace the first email with the second entry
(example: all abc@google.com entries in the dump should be replaced by
abc1 and so on and so forth). Can someone help me with the code?

[...]

Can I assume this is homework? Maybe we can have some fun with
it...

I'm not much of a Perl golfer, but here's an attempt:

#!/usr/bin/perl
{local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
=~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much
more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

It seems to follow the specification.

%m=map{split}`cat entries.txt`; for(<DATA>){/\G\S+/gc&&(print($m{$&}//$&),redo);/\G\s+/gc&&(print($&),redo)} __DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

:-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Rainer Weikusat on Fri May 14 20:10:03 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Wasell <usenet2020@wasell.eu> writes:

Hi experts,

I need a little perl help. Here is the requirement (at unix shell).

I have a text file, entries.txt with the following lines (just giving a
few lines here, but there are around 100 entries in the actual file).
Each line has one email id' followed by a user id (both separated by a
tab). I am just giving first three lines here.
========================
abc@google.com abc1
cdef@yahoo.com cde
xyz@gmail.com xyz2
=========================
Now the perl script should parse through a big dump of data (a file
called text.xml) and replace the first email with the second entry
(example: all abc@google.com entries in the dump should be replaced by
abc1 and so on and so forth). Can someone help me with the code?

[...]

Can I assume this is homework? Maybe we can have some fun with
it...

I'm not much of a Perl golfer, but here's an attempt:

#!/usr/bin/perl
{local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
=~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much
more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

It seems to follow the specification.

%m=map{split}`cat entries.txt`; for(<DATA>){/\G\S+/gc&&(print($m{$&}//$&),redo);/\G\s+/gc&&(print($&),redo)} __DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

:-)

Actually, that's much to complicated:

%m=map{split}`cat entries.txt`;
s|\S+|$m{$&}//$&|ge,print for<DATA>;
__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Fri May 14 21:56:30 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

[...]

Not elegant, no, but I think I'd slurp the input and then loop over the substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

A pretty awful algorithm: The runtime will be proportional to the number
of substitutions times the length of the text, ie, quadratic.

More defensively written alternate suggestion:

--------
my %subs;

{
my $fh;
open($fh, '<', 'entries.txt') or die("open: $!");
%subs = map { split } <$fh>;
}

for (<DATA>) {
s|\S+|$subs{$&} // $&|ge;
print;
}

__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com
-------

That's a linear algorithm as it makes just one pass through the input
data.

NB: I didn't benchmark this and the O-difference doesn't necessarily
mean it'll be faster in practice for realistic amounts of input data. It
also won't replace results of prior replacements which may or may not be desired.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rainer Weikusat@21:1/5 to Ben Bacarisse on Sat May 15 00:13:20 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

[...]

Not elegant, no, but I think I'd slurp the input and then loop over the
substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

A pretty awful algorithm: The runtime will be proportional to the number
of substitutions times the length of the text, ie, quadratic.

I don't think that's technically quadratic, but I know what you mean.
It's pretty awful. This looked like a throw-away task, so I didn't care about the O(mn) complexity.

The first time in my life I can do an actual mathematical proof: There
are two sets involved here with lenghts n and m. The total running time
is proportional to n * m. There are two cases here:

1. n == m. In this case n * m = n * n which is obviously quadratic.

2. n < m or m < n, without less of generality, n < m is assumed. In this
case, n * m = n * n * (m / n) [, m / n > 1 because n * m > n * n]. Hence,
it's quadratic as well.

:-))

More defensively written alternate suggestion:

--------> my %subs;

{
my $fh;
open($fh, '<', 'entries.txt') or die("open: $!");
%subs = map { split } <$fh>;

(The OP had tab separated pairs)

split without arguments splits $_ on \s+. That's going to cover
tab-separated text.

}

for (<DATA>) {
s|\S+|$subs{$&} // $&|ge;

This is likely to miss some expected cases in XML data since, say, <addr mail="abc@goole.com"> won't match abc@goole.com.

It's supposed to work for the provided example. It's also going to miss addresses at the end of a sentence, eg

His email address was wookie@chewbacca.com.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Rainer Weikusat on Fri May 14 23:29:47 2021

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

[...]

Not elegant, no, but I think I'd slurp the input and then loop over the
substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

A pretty awful algorithm: The runtime will be proportional to the number
of substitutions times the length of the text, ie, quadratic.

I don't think that's technically quadratic, but I know what you mean.
It's pretty awful. This looked like a throw-away task, so I didn't care
about the O(mn) complexity.

More defensively written alternate suggestion:

--------> my %subs;

{
my $fh;
open($fh, '<', 'entries.txt') or die("open: $!");
%subs = map { split } <$fh>;

(The OP had tab separated pairs)

}

for (<DATA>) {
s|\S+|$subs{$&} // $&|ge;

This is likely to miss some expected cases in XML data since, say, <addr mail="abc@goole.com"> won't match abc@goole.com.

print;
}

__DATA__
Hello world abc@google.com this is line 1
This is the second line with a lot text cdef@yahoo.com and much more
Here is the third line xyz@gmail.com and lot of stuff here
One more line with abc@google.com
-------

That's a linear algorithm as it makes just one pass through the input
data.

NB: I didn't benchmark this and the O-difference doesn't necessarily
mean it'll be faster in practice for realistic amounts of input data. It
also won't replace results of prior replacements which may or may not be desired.

Yup. What's fast, or fast enough, is going to depend on a lot of
details. But, sure, as the number of search strings grows, looping over
them will eventually kill the performance.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Martin Vaeth@21:1/5 to Rainer Weikusat on Sat May 15 06:25:45 2021

Rainer Weikusat <rweikusat@talktalk.net> wrote:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Rainer Weikusat <rweikusat@talktalk.net> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

[...]

Not elegant, no, but I think I'd slurp the input and then loop over the >>>> substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

A pretty awful algorithm: The runtime will be proportional to the number >>> of substitutions times the length of the text, ie, quadratic.

I don't think that's technically quadratic, but I know what you mean.
It's pretty awful. This looked like a throw-away task, so I didn't care
about the O(mn) complexity.

The first time in my life I can do an actual mathematical proof: There
are two sets involved here with lenghts n and m. The total running time
is proportional to n * m. There are two cases here:

1. n == m. In this case n * m = n * n which is obviously quadratic.

So the worst case running time is quadratic in the input length,
and you are already done. (Usually, O(.) refers to the worst case.)

This is simultaneously the "average case" running time if one defines
the averaging in a natural way, but this is a bit harder to see.
(And one can argue which definition of averaging is really natural in
this example - that is, it depends about the planned use case.)

2. n < m or m < n, without less of generality, n < m is assumed. In this case, n * m = n * n * (m / n) [, m / n > 1 because n * m > n * n]. Hence, it's quadratic as well.

If you mean to say here that even in the "best case" you have quadratic running, time you are wrong:
In the "best data" case you have n=1 or m=1 (or at least bounded by
a constant), despite the input data n+m can be arbitrarily long.
So the "best case" is only linear running time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Otto J. Makela@21:1/5 to Ben Bacarisse on Sat May 15 16:58:15 2021

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

Not elegant, no, but I think I'd slurp the input and then loop over
the substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

\Q and \E ensure the quoting is correct. And I stole your \b...\b
because I'd forgotten about that! I expect the OP wants it.

I believe your algorithm might fail if the replaced strings can be
substrings of each other, depending on the order they are presented?

OP's question of course didn't have any such cases, but since we're
talking algorithms here, it'd be nice if also the edge cases worked.
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Otto J. Makela on Sat May 15 16:07:30 2021

om@iki.fi (Otto J. Makela) writes:

Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:

Not elegant, no, but I think I'd slurp the input and then loop over
the substitutions:

while (<$subs>) {
chomp;
my ($k, $s) = split /\t/;
$content =~ s/\b\Q$k\E\b/$s/g;
}

\Q and \E ensure the quoting is correct. And I stole your \b...\b
because I'd forgotten about that! I expect the OP wants it.

I believe your algorithm might fail if the replaced strings can be
substrings of each other, depending on the order they are presented?

Yes, but we don't even know if the \b...\b is correct so I think that's
too fine a point for the specific case.

OP's question of course didn't have any such cases, but since we're
talking algorithms here, it'd be nice if also the edge cases worked.

Sure, but as already pointed out, if we are talking algorithms I don't
think you'd want to do it this way. Mine was a quick-and-dirty get it
done now solution.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gamo@21:1/5 to All on Sun May 16 23:53:54 2021

El 14/5/21 a las 20:22, Wasell escribió:

#!/usr/bin/perl
{local$/;$s=<DATA>};@ARGV='entries.txt';for(map{[split' ']}<>){$s
=~s/\Q$_->[0]/$_->[1]/g};open$g,'>updated.xml';print$g $s;
__DATA__

It's a mistery for me why do you use split' '
instead of the more golfer split"\t" Could you explain?
Thanks!

--
http://gamo.sdf-eu.org/
perl -E 'say "[U]ndo or [c]ontinue? (y/N) ";'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gamo@21:1/5 to All on Mon May 17 03:50:04 2021

El 17/5/21 a las 3:42, Randal L. Schwartz escribió:

"gamo" == gamo <gamo@telecable.es> writes:

gamo> It's a mistery for me why do you use split' '
gamo> instead of the more golfer split"\t" Could you explain?

One char in the string instead of two?

Oh, yes, sorry. I didn't know if the quest was
about spacing or typping. Anyway, I think that
the obfuscation could be done in any language,
and the possibility of being concise is not
a fault of the lang as I read.

--
http://gamo.sdf-eu.org/
perl -E 'say "[W]ant a [m]isunderstood? (X/y) ";'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Randal L. Schwartz@21:1/5 to All on Sun May 16 18:42:28 2021

"gamo" == gamo <gamo@telecable.es> writes:

gamo> It's a mistery for me why do you use split' '
gamo> instead of the more golfer split"\t" Could you explain?

One char in the string instead of two?

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/> Perl/Dart/Flutter consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Briels
  Tue Apr 23 20:54:03 2024
  from Uk via SSH
- Cronus
  Tue Apr 23 19:46:51 2024
  from Provo, Ut via SSH
- Keyop
  Tue Apr 23 19:40:37 2024
  from Huddersfield, West Yorkshire via SSH
- Guest
  Wed Apr 24 01:40:10 2024
  from A via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	23:25:25
Calls:	6,646
Calls today:	1
Files:	12,191
Messages:	5,327,626

Need a little help

Who's Online

Recent Visitors

System Info