head	1.6;
access;
symbols;
locks; strict;
comment	@# @;


1.6
date	2015.12.02.23.39.51;	author dyoung;	state Exp;
branches;
next	1.5;
commitid	TzsqL4yA7qqYHqLy;

1.5
date	2015.09.11.01.50.43;	author dyoung;	state Exp;
branches;
next	1.4;
commitid	ptM4knOhaM6kaMAy;

1.4
date	2015.09.02.22.43.17;	author dyoung;	state Exp;
branches;
next	1.3;
commitid	EwjsMg946Wp5oJzy;

1.3
date	2015.08.28.21.44.43;	author dyoung;	state Exp;
branches;
next	1.2;
commitid	doyQMW6a1aISd5zy;

1.2
date	2015.08.22.05.08.48;	author dyoung;	state Exp;
branches;
next	1.1;
commitid	hWHunK8RVFy5Udyy;

1.1
date	2015.08.10.21.10.59;	author dyoung;	state Exp;
branches;
next	;
commitid	dALcDv1uBdaPALwy;


desc
@@


1.6
log
@Executive summary: ARFE now understands C-like symbols.  I'm using a
different algorithm to figure out where to subdivide the longest common
subsequence search.  I'm actually computing an edit distance instead of
the longest common sequence, but the algorithms are duals so there's not
much practical difference.  I've discarded some tests, and added at least
one new one.

Qualify %d for ptrdiff_t, %td.  Quiets compilation on 64-bit Darwin.

Change the type of the dynamic program cells from size_t to cell_t.  For
now, a cell_t is just a struct containing a size_t score.

Add algq(), a routine for finding k such that

        lcs(A[1:m/2], B[1:k]) | lcs(A[m/2+1:m], B[k+1:n]) = lcs(A, B).

algq() is based on the function Half(i, j) defined in Jeff Erickson's
(jeffe@@cs.illinois.edu) lecture notes on advanced dynamic programming.
See http://jeffe.cs.illinois.edu/teaching/algorithms/.

Add to the Makefile (commented out) lines for tracking code coverage.
Use gcov <source file> to see the coverage.

Make the cleandir target remove gcov(1)-related files.

Delete a bunch of dead code and the now unnecessary argument to algc(),
expected_lcs.

We don't need backwards subslices any more, so get rid of that.  When
I got rid of the slice_t member `backward' and all of its uses, GCC
inlined clocc_ends_at() with a really bad effect on performance (>10s
on elmendorf for the t/netstat-s.[01] test, instead of <8s).  I marked
clocc_ends_at() __noinline for a net performance gain.  The gcc version
is (NetBSD nb2 20110806) 4.5.3, btw.

Disable the dbg_assert()s for more reliable performance comparisons.

Rename algq's splitn argument to splitnp since that's my convention
for arguments of that kind.

In algq(), don't get(A, i) m x n times, just get(A, i) m times.

Lightly constify.

Provide __noinline on non-NetBSD systems.

Sprinkle the $ARFE$ keyword.

Rename algc -> findlcs, algq -> findsplitn.

Make Subversion fill $ARFE$ in macaddr.h.

Extract the tags target from {dt,it,tt}/Makefile, put it
in ./Makefile.inc.

We only ever call findsplitn(..., true), so get rid of the do_clocc
argument.

Add an experimental routine, count_records(), that tries to count the
records in its second slice_t argument, using the first slice_t argument
as record template.

Change the class-occurrence (clocc_t) score, clocc_score(), to one plus
the minimum length of the class occurrences, from one plus the product
of the class occurrences' lengths.  This speeds things up a bit.

Simplify findsplitn() by pulling common statements out from if-else
branches, et cetera.

In clocc_starts_in_slice_at(), pass the wlenp argument to
clocc_starts_at().  Nothing passed a non-NULL wlenp to
clocc_starts_in_slice_at(), so this doesn't make any functional
difference.

Compute the edit distance instead of the longest common subsequence.
The one algorithm is a dual of the other.  I may find it easier to
add to the edit distance algorithm improvements like affine gap
penalties, hence the change.

Snapshot of work in progress.  These changes make things quite a bit
slower!  Add affine gap penalties.

Bring count_records() in line with findsplitn(), adding affine gap
penalties.  Update the instrumentation.  Count up the number of gaps
accumulated.

XXX This change makes 'dt netstat-s.0 netstat-s.1' more than twice as
XXX slow as it used to be, owing largely (I think) to the increase in
XXX size of a cell_t, where three ssize_t's track the number of gaps.

Exit with a message and error return code if we run out of slots for
class occurrences.  The class-occurrence array is still statically
allocated---yech.  I'm going to fix it one of these days, I promise.

Stop detecting occurrences of class "string" (KIND_STRING), which
consisted of the names 'abe', 'ada', and 'daria'.  Remove the tests
related to that.

Start detecting occurrences of class "symbol" (KIND_SYMBOL), which
resemble C symbol names: they start with a letter of the alphabet or
underscore.  Following characters are letters, numbers, or underscore.
Update tests to match: the netstat and ifconfig tests produce much more
sensible results, now.  Delete the 'quack<number>quack' tests, since
the symbol detector matches the entire string, now, and the tests don't
stand for any practical use-case.

Add test #5 to tt, which demonstrates how one can use a symbol in the
match template to match a symbol in the input for reproduction in the
transform template.
@
text
@wm0: flags=   0<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu    0
	capabilities=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	enabled=2bf80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Tx,UDP6CSUM_Tx>
	address: 00:0a:0b:cd:01:ef
	media: Ethernet autoselect (   0baseT full-duplex)
	status: active
	input:     112 packets,      11380 bytes,      11 multicasts
	output:      60 packets,      11440 bytes,    0 multicasts
	inet 10.0.1.17 netmask 0xffffff00 broadcast 10.0.1.255
	inet6 fe80::20a:bff:fecd:1ef%wm0 prefixlen  0 scopeid 0x1


@


1.5
log
@Add a new tool, tt, that transforms its input based on the transform
exemplified by a match/transform-template pair.

Add a data detector for MAC addresses.  Update expected test outputs.
@
text
@d2 2
a3 2
	capabilities=2bf80<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
	enabled=2bf80<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
d10 1
a10 1
	inet0 fe80::20a:bff:fecd:1ef%wm0 prefixlen  0 scopeid 0x1
@


1.4
log
@Add $ARFE$, $NetBSD$, and licenses at the top of various files.

Factor the hexadecimal parser out of dt.c.  Put it in hex.[ch].  Start
an IPv4 parser.

Write an IPv4 parser in dt/ipv4.[ch] and start using it.  Reorganize
#includes in dt.c and free the hex parser after it's used.  Update the
expected test results for the IPv4 parser.

In the READMEs, describe the hexadecimal data detection and functions.
Describe how IPv4 addresses are treated.
@
text
@d4 1
a4 1
	address:  0: a: b:cd: 0:ef
@


1.3
log
@Update expected test results for the improvements to hexadecimal
detection.

Update $Id$ in dt.c.
@
text
@d9 1
a9 1
	inet  0.0.0. 0 netmask 0xffffff00 broadcast  0.0.0.  0
@


1.2
log
@Locate in both inputs hexadecimal numbers starting 0x and make them
"wild" in the alignments dt computes.  In dt, bitwise-AND the 0x-hex
numbers.  In it, bitwise-OR them.  Take care not to match a hexadecimal
with a decimal or vice versa!

TBD: identify hexadecimals that don't start 0x.

Remove a little dead code.

Split HB_DEBUG into HB_DEBUG and HB_ASSERT.  The latter just enables the
assertions.

Update old test results for the new treatment of 0x-hexadecimal.  Add
some new tests.
@
text
@d2 3
a4 3
	capabilities=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
	enabled=0bf 0<TSO0,IP0CSUM_Rx,IP0CSUM_Tx,TCP0CSUM_Rx,TCP0CSUM_Tx,UDP0CSUM_Rx,UDP0CSUM_Tx,TCP0CSUM_Tx,UDP0CSUM_Tx>
	address:  0:0a:0b:cd: 0:ef
d10 1
a10 1
	inet0 fe 0:: 0a:bff:fecd:0ef%wm0 prefixlen  0 scopeid 0x1
@


1.1
log
@Commit the beginnings of ARFE.

ARFE is a suite of tools for processing record- and field-oriented
digital texts.  ARFE strives to make a useful set of automatic
text-processing functions available at a level of abstraction that both
invites use by lay people and frees programmers from painstakingly
specifying input and output forms.  ARFE stands for (A)d hoc (R)ecord
and (F)ield (E)xtraction.  It is pronounced "arf!"
@
text
@d9 2
a10 2
	inet  0.0.0. 0 netmask 0xffffff 0 broadcast  0.0.0.  0
	inet0 fe 0:: 0a:bff:fecd:0ef%wm0 prefixlen  0 scopeid 0x0
@

