Re: [Gossip] tidying up mbox files

2006-08-13 Thread Jeff Breidenbach

Thanks!

Also, one of the people with slightly-broken mbox files suggested this:

perl -i -p -e '/^From / && !/\d\d:\d\d:\d\d \d\d\d\d$/ && s/(.+)/>$1/' A_*

I'm continuously amazed at both Perl, and the people whose brains are
capable of understanding it. :)

-Jeff

___
Discussion list for The Mail Archive
[email protected]
http://jab.org/cgi-bin/mailman/listinfo/gossip


Re: [Gossip] tidying up mbox files

2006-08-13 Thread Earl Hood
On August 12, 2006 at 13:28, "Jeff Breidenbach" wrote:

> The majority of mbox files I've been handed do not escape "From" like
> they should, and this causes problems on M-A's end; inc from the nmh
> suite gets unhappy and starts trashing messages. Are there any
> recommendations for an mbox2mbox converter that will clean up
> these wayward almost-but-not-quite-mbox files?

Depends on how the bogus "From" lines are structured.

In mhonarc, the MSGSEP resource can be set to provide a stricter
check, which generally gets around most cases of unescaped "From "s.

For your case, a simple Perl script can be used to do what you
want.  Maybe something like:

  #!/usr/bin/perl
  my $msgsep =
qr/^From\s+(?:"[^"]+"@\S+|\S+)\s+\S+\s+\S+\s+\d+\s+\d+:\d+:\d+\s+\d+/;
  while (<>) {
if (!/^From / || !/$msgsep/) {
  print STDOUT $_;
  next;
}
print STDOUT '>'.$_;
  }

If you call the above "escapefrom", invoke like the following:

  escapefrom mbox > escaped-mbox

Then run a diff to see how well it worked.

The main limitation is when messages include mbox from lines
in their bodies unescaped.  In this case, it requires a human to
determine if the line indicates a new message of it is part of
an existing one.

If your MDA creates a "From " line that is unique to your site, you
can modify the above regex to just match that.

--ewh

___
Discussion list for The Mail Archive
[email protected]
http://jab.org/cgi-bin/mailman/listinfo/gossip


[Gossip] tidying up mbox files

2006-08-12 Thread Jeff Breidenbach

Hi all,

When someone wants to import a bunch of messages into an archive,
the provide an mbox file. The mbox file format is simple, but has at
least one gotcha.

  In  order  to  avoid misinterpretation of lines in message bodies which
  begin with the four characters "From", followed by a  space  character,
  the  mail  delivery  agent  must quote any occurrence of "From " at the
  start of a body line.

The majority of mbox files I've been handed do not escape "From" like
they should, and this causes problems on M-A's end; inc from the nmh
suite gets unhappy and starts trashing messages. Are there any
recommendations for an mbox2mbox converter that will clean up
these wayward almost-but-not-quite-mbox files?

Jeff

___
Discussion list for The Mail Archive
[email protected]
http://jab.org/cgi-bin/mailman/listinfo/gossip