head	1.13;
access;
symbols;
locks; strict;
comment	@# @;


1.13
date	2016.01.26.02.22.57;	author dyoung;	state Exp;
branches;
next	1.12;
commitid	TcUPVdKNFlwdWnSy;

1.12
date	2015.12.03.03.29.01;	author dyoung;	state Exp;
branches;
next	1.11;
commitid	KERKKFm8qHpI2sLy;

1.11
date	2015.12.02.23.39.51;	author dyoung;	state Exp;
branches;
next	1.10;
commitid	TzsqL4yA7qqYHqLy;

1.10
date	2015.09.23.19.32.34;	author dyoung;	state Exp;
branches;
next	1.9;
commitid	exic1zXtILEHEpCy;

1.9
date	2015.09.22.01.12.09;	author dyoung;	state Exp;
branches;
next	1.8;
commitid	tbaMkOcWCFfgBbCy;

1.8
date	2015.09.14.02.58.17;	author dyoung;	state Exp;
branches;
next	1.7;
commitid	O988RZ6i630AraBy;

1.7
date	2015.09.11.02.12.57;	author dyoung;	state Exp;
branches;
next	1.6;
commitid	edBwCEt5dcE2iMAy;

1.6
date	2015.09.11.01.50.42;	author dyoung;	state Exp;
branches;
next	1.5;
commitid	ptM4knOhaM6kaMAy;

1.5
date	2015.09.02.22.45.47;	author dyoung;	state Exp;
branches;
next	1.4;
commitid	dclRljt7Q5uToJzy;

1.4
date	2015.09.02.22.43.17;	author dyoung;	state Exp;
branches;
next	1.3;
commitid	EwjsMg946Wp5oJzy;

1.3
date	2015.08.31.01.58.23;	author dyoung;	state Exp;
branches;
next	1.2;
commitid	a1XZsFEXv03Vymzy;

1.2
date	2015.08.22.05.08.48;	author dyoung;	state Exp;
branches;
next	1.1;
commitid	hWHunK8RVFy5Udyy;

1.1
date	2015.08.10.21.10.59;	author dyoung;	state Exp;
branches;
next	;
commitid	dALcDv1uBdaPALwy;


desc
@@


1.13
log
@Bring NetBSD CVS up-to-date with my private Subversion repository.

In dt/core.c, delete dead scratch_t members L1 and L2.  Pull the gaps
count out of subcell_t and add to scratch_t members G[0..1] and Gclocc
to hold the gaps count when/where I use it.  This gives me back a bit of
speed, but I'm still not back to the previous best: affine gap penalties
are expensive.

Add a dummy 'test' target to it/Makefile.  Really need to come up with
some tests for IT.  Some version of the DT tests oughta do the trick.

Thanks, Thomas Klausner (wiz@@NetBSD.org), for the heads-up about some
Makefile problems.  Fix them: recurse on 'test' target.  Make 'test'
target depend on dependall.

Fix/improve usage messages.

Disable debug assertions for now.

Make 'make cleandir' remove *.gcov.

In dt/sym.c, don't dereference a NULL pointer trying to emit a symbol
that is too long.  XXX ought to revisit this.

Add the hierarchical-logging library that I developed for CUWiN and
start to use it to switch debugging printfs on and off.

Repair the 'tags' rule, making it write tags to .CURDIR instead of
.OBJDIR.

Add two new scanners, one for whitespace up to and including end-of-line
(EOL), and one for whitespace other than whitespace at EOL.  Bring
expected test results up-to-date with these beneficial changes to
whitespace handling.

Add mk/helpers.mk that provides utilities PRINTOBJDIR and PRINTOBJDIROF.

Use relative paths like 'dt/core.h' for ARFE header files.  Add
CPPFLAGS+=-I$(.CURDIR)/.. to all of the makefiles to make this work.

Add DPINCS to makefiles that were missing it.

Factor out some initialization and argument parsing; put it into
arfe_parse_options().

Improve file_to_slice() error reporting: print the filename concerned.

Add a utility function, cloccs_are_equal(), that returns true if and
only if the string comprising the left-hand clocc is equal to the string
comprising the right-hand clocc.  Use it here and there, especially in
the `tt' implementation.

Tweak the costs for opening and extending gaps, and tweak clocc_score(),
to get more desirable outputs from dt, it, and tt.  Update tests to
match new expectations.

Add instrumentation that prints a sub-table of the table computed by
findsplitn() as an HTML table.

Fix some bugs in table initialization in findsplitn() & count_records()
that were found with the help of the new instrumentation.

For count_records() debug output, record and print a few generations of
record boundaries.

Add a new member to clocc_t, the number of potential counterpart
clocc_t's in a counterpart template, and for each match-template clocc,
count the number of counterparts it has in the transform template---see
count_match_occ_transform_counterparts().  Eventually I will use the
counterparts to tweak scores for clocc-alignment between inputs and
match templates in tt.

Fix a bug in emit_transformed_text() where it wasn't considering all of
the match clocc_t's in a hash bucket.

Expand cloccs_t and the size of the clocc_t hash tables, too.  XXX needs
to be dynamic!

In dt, it, and tt (?), instead of running findlcs(), run count_records()
and quit if the -c option is passed.  This is just a convenience while I
test count_records(), which is a work in progress.

Fix some NULL pointer dereferences in the decimal- and hexadecimal-number
scanners.

Run the really long-running dt tests last instead of first.

Make tt/testit.sh run the tests on its command line or else all of the
tests in tests/.

In tt, allocate the class occurrences for the transform template on the
heap instead of the stack, so that we don't run out of stack space
and segfault.

Add a test for 'tt' that extracts the packet input/output statistics
from ifconfig -v output and nothing else.

Fix a bug in count_match_occ_transform_counterparts() where I was
calling TAILQ_NEXT() on an uninitialized element.  I'm not sure how I
missed the bug before, since it crashes 'tt' in its first test every
time.

Pick up some changes from the NetBSD CVS repository by wiz@@.  Change
__attribute__((__noreturn__)) to __dead.  Prepare for committing to
NetBSD CVS by adding a .cvsignore or two, making Subversion ignore some
CVS/ directories, etc.
@
text
@$ARFE: README 323 2016-01-25 20:29:30Z dyoung $
$NetBSD: README,v 1.12 2015/12/03 03:29:01 dyoung Exp $

DT---(d)ifferentiate (t)ext---finds a longest common subsequence (LCS)
of two texts where the numbers, "symbols" starting with a letter or
underscore followed by zero or more letters, numbers, or underscores,
and IPv4 addresses are "wild": a span of decimal digits in the first
text will match any decimal-digits span in the second text.  An IPv4
number in the first text will likewise match any IPv4 number in the
second text.  Symbols match symbols.  Currently, DT detects whole
hexadecimal numbers (with and without an 0x-prefix) and decimal numbers.
When DT emits the LCS, it replaces matched pairs of decimal numbers with
their difference, and matched pairs of hexadecimal numbers with their
bitwise-AND combination.  DT replaces matched pairs of IP numbers with
the smallest subnet that contains both.  DT copies the first symbol in a
matched pair to its output.

DT arose from the author's desire to examine the rate of change of
network statistics from several sources (ifconfig -va, netstat -s, a
network daemon's log dumps) without writing several one-off programs.
The experience of writing and using DT inspired the ARFE concept.

See ../README for more information about ARFE.
@


1.12
log
@Bring READMEs up to date.  Update $ARFE$.
@
text
@d1 2
a2 2
$ARFE: README 303 2015-12-03 03:26:34Z dyoung $
$NetBSD: README,v 1.11 2015/12/02 23:39:51 dyoung Exp $
@


1.11
log
@Executive summary: ARFE now understands C-like symbols.  I'm using a
different algorithm to figure out where to subdivide the longest common
subsequence search.  I'm actually computing an edit distance instead of
the longest common sequence, but the algorithms are duals so there's not
much practical difference.  I've discarded some tests, and added at least
one new one.

Qualify %d for ptrdiff_t, %td.  Quiets compilation on 64-bit Darwin.

Change the type of the dynamic program cells from size_t to cell_t.  For
now, a cell_t is just a struct containing a size_t score.

Add algq(), a routine for finding k such that

        lcs(A[1:m/2], B[1:k]) | lcs(A[m/2+1:m], B[k+1:n]) = lcs(A, B).

algq() is based on the function Half(i, j) defined in Jeff Erickson's
(jeffe@@cs.illinois.edu) lecture notes on advanced dynamic programming.
See http://jeffe.cs.illinois.edu/teaching/algorithms/.

Add to the Makefile (commented out) lines for tracking code coverage.
Use gcov <source file> to see the coverage.

Make the cleandir target remove gcov(1)-related files.

Delete a bunch of dead code and the now unnecessary argument to algc(),
expected_lcs.

We don't need backwards subslices any more, so get rid of that.  When
I got rid of the slice_t member `backward' and all of its uses, GCC
inlined clocc_ends_at() with a really bad effect on performance (>10s
on elmendorf for the t/netstat-s.[01] test, instead of <8s).  I marked
clocc_ends_at() __noinline for a net performance gain.  The gcc version
is (NetBSD nb2 20110806) 4.5.3, btw.

Disable the dbg_assert()s for more reliable performance comparisons.

Rename algq's splitn argument to splitnp since that's my convention
for arguments of that kind.

In algq(), don't get(A, i) m x n times, just get(A, i) m times.

Lightly constify.

Provide __noinline on non-NetBSD systems.

Sprinkle the $ARFE$ keyword.

Rename algc -> findlcs, algq -> findsplitn.

Make Subversion fill $ARFE$ in macaddr.h.

Extract the tags target from {dt,it,tt}/Makefile, put it
in ./Makefile.inc.

We only ever call findsplitn(..., true), so get rid of the do_clocc
argument.

Add an experimental routine, count_records(), that tries to count the
records in its second slice_t argument, using the first slice_t argument
as record template.

Change the class-occurrence (clocc_t) score, clocc_score(), to one plus
the minimum length of the class occurrences, from one plus the product
of the class occurrences' lengths.  This speeds things up a bit.

Simplify findsplitn() by pulling common statements out from if-else
branches, et cetera.

In clocc_starts_in_slice_at(), pass the wlenp argument to
clocc_starts_at().  Nothing passed a non-NULL wlenp to
clocc_starts_in_slice_at(), so this doesn't make any functional
difference.

Compute the edit distance instead of the longest common subsequence.
The one algorithm is a dual of the other.  I may find it easier to
add to the edit distance algorithm improvements like affine gap
penalties, hence the change.

Snapshot of work in progress.  These changes make things quite a bit
slower!  Add affine gap penalties.

Bring count_records() in line with findsplitn(), adding affine gap
penalties.  Update the instrumentation.  Count up the number of gaps
accumulated.

XXX This change makes 'dt netstat-s.0 netstat-s.1' more than twice as
XXX slow as it used to be, owing largely (I think) to the increase in
XXX size of a cell_t, where three ssize_t's track the number of gaps.

Exit with a message and error return code if we run out of slots for
class occurrences.  The class-occurrence array is still statically
allocated---yech.  I'm going to fix it one of these days, I promise.

Stop detecting occurrences of class "string" (KIND_STRING), which
consisted of the names 'abe', 'ada', and 'daria'.  Remove the tests
related to that.

Start detecting occurrences of class "symbol" (KIND_SYMBOL), which
resemble C symbol names: they start with a letter of the alphabet or
underscore.  Following characters are letters, numbers, or underscore.
Update tests to match: the netstat and ifconfig tests produce much more
sensible results, now.  Delete the 'quack<number>quack' tests, since
the symbol detector matches the entire string, now, and the tests don't
stand for any practical use-case.

Add test #5 to tt, which demonstrates how one can use a symbol in the
match template to match a symbol in the input for reproduction in the
transform template.
@
text
@d1 2
a2 2
$ARFE: README 264 2015-10-08 22:28:01Z dyoung $
$NetBSD: README,v 1.10 2015/09/23 19:32:34 dyoung Exp $
d5 6
a10 4
of two texts where the numbers and IPv4 addresses are "wild": a span
of decimal digits in the first text will match any decimal-digits span
in the second text.  An IPv4 number in the first text will likewise
match any IPv4 number in the second text.  Currently, DT detects whole
d15 2
a16 1
the smallest subnet that contains both.
@


1.10
log
@Extract a subroutine, clocc_score(), that uses the lengths of two
class-occurrences (cloccs) to assign their match a score.

Add $NetBSD$, $ARFE$, and BSD license to core.c.

Update $NetBSD$

Remove a dangling comment from tt/tt.c.
@
text
@d1 2
a2 2
$ARFE: README 258 2015-09-23 19:31:17Z dyoung $
$NetBSD: README,v 1.9 2015/09/22 01:12:09 dyoung Exp $
@


1.9
log
@Give dt, it, and tt custom main() routines instead of using a bunch
of conditionals.  Put the core algorithms and data structures into
core.[ch].  Retire dt.h (it became core.h).  Update Makefiles to suit
and alphabetize SRCS.
@
text
@d1 2
a2 2
$ARFE: README 251 2015-09-22 01:09:13Z dyoung $
$NetBSD: README,v 1.8 2015/09/14 02:58:17 dyoung Exp $
@


1.8
log
@Make some changes that let this build and run properly on 64-bit hosts
and on Mac OS X.
@
text
@d1 2
a2 2
$ARFE: README 245 2015-09-11 02:13:21Z dyoung $
$NetBSD: README,v 1.7 2015/09/11 02:12:57 dyoung Exp $
@


1.7
log
@CVS/ is not a test directory, don't try to run a test there.

Update $ARFE$.
@
text
@d1 2
a2 2
$ARFE: README 243 2015-09-11 01:57:04Z dyoung $
$NetBSD: README,v 1.6 2015/09/11 01:50:42 dyoung Exp $
@


1.6
log
@Add a new tool, tt, that transforms its input based on the transform
exemplified by a match/transform-template pair.

Add a data detector for MAC addresses.  Update expected test outputs.
@
text
@d1 2
a2 2
$ARFE: README 236 2015-09-02 22:47:33Z dyoung $
$NetBSD: README,v 1.5 2015/09/02 22:45:47 dyoung Exp $
@


1.5
log
@Commit latest $ARFE$.
@
text
@d1 2
a2 2
$ARFE: README 235 2015-09-02 22:44:54Z dyoung $
$NetBSD: README,v 1.4 2015/09/02 22:43:17 dyoung Exp $
@


1.4
log
@Add $ARFE$, $NetBSD$, and licenses at the top of various files.

Factor the hexadecimal parser out of dt.c.  Put it in hex.[ch].  Start
an IPv4 parser.

Write an IPv4 parser in dt/ipv4.[ch] and start using it.  Reorganize
#includes in dt.c and free the hex parser after it's used.  Update the
expected test results for the IPv4 parser.

In the READMEs, describe the hexadecimal data detection and functions.
Describe how IPv4 addresses are treated.
@
text
@d1 2
a2 2
$ARFE: README 233 2015-09-02 22:33:54Z dyoung $
$NetBSD: README,v 1.3 2015/08/31 01:58:23 dyoung Exp $
@


1.3
log
@In the READMEs, describe the hexadecimal data detection and functions.

Update $ARFE$ in dt.c.
@
text
@d1 2
a2 2
$ARFE: README 225 2015-08-31 01:57:28Z dyoung $
$NetBSD$
d5 9
a13 6
of two texts where the numbers are "wild": a span of decimal digits in
the first text will match any decimal-digits span in the second text.
Currently, DT detects whole hexadecimal numbers (with and without an
0x-prefix) and decimal numbers.  When DT emits the LCS, it replaces
matched pairs of decimal numbers with their difference, and matched
pairs of hexadecimal numbers with their bitwise-AND combination.
@


1.2
log
@Locate in both inputs hexadecimal numbers starting 0x and make them
"wild" in the alignments dt computes.  In dt, bitwise-AND the 0x-hex
numbers.  In it, bitwise-OR them.  Take care not to match a hexadecimal
with a decimal or vice versa!

TBD: identify hexadecimals that don't start 0x.

Remove a little dead code.

Split HB_DEBUG into HB_DEBUG and HB_ASSERT.  The latter just enables the
assertions.

Update old test results for the new treatment of 0x-hexadecimal.  Add
some new tests.
@
text
@d1 2
a2 1
$ARFE: README 216 2015-08-22 05:04:28Z dyoung $
d5 6
a10 4
of two texts where decimal numbers are "wild": a span of decimal digits
in the first text will match a digits span in the second text.  Then it
emits the LCS, replacing matched pairs of decimal numbers with their
difference.
@


1.1
log
@Commit the beginnings of ARFE.

ARFE is a suite of tools for processing record- and field-oriented
digital texts.  ARFE strives to make a useful set of automatic
text-processing functions available at a level of abstraction that both
invites use by lay people and frees programmers from painstakingly
specifying input and output forms.  ARFE stands for (A)d hoc (R)ecord
and (F)ield (E)xtraction.  It is pronounced "arf!"
@
text
@d1 2
@