head 1.13; access; symbols; locks; strict; comment @# @; 1.13 date 2016.01.26.02.22.57; author dyoung; state Exp; branches; next 1.12; commitid TcUPVdKNFlwdWnSy; 1.12 date 2015.12.03.03.29.01; author dyoung; state Exp; branches; next 1.11; commitid KERKKFm8qHpI2sLy; 1.11 date 2015.12.02.23.39.51; author dyoung; state Exp; branches; next 1.10; commitid TzsqL4yA7qqYHqLy; 1.10 date 2015.09.23.19.32.34; author dyoung; state Exp; branches; next 1.9; commitid exic1zXtILEHEpCy; 1.9 date 2015.09.22.01.12.09; author dyoung; state Exp; branches; next 1.8; commitid tbaMkOcWCFfgBbCy; 1.8 date 2015.09.14.02.58.17; author dyoung; state Exp; branches; next 1.7; commitid O988RZ6i630AraBy; 1.7 date 2015.09.11.02.12.57; author dyoung; state Exp; branches; next 1.6; commitid edBwCEt5dcE2iMAy; 1.6 date 2015.09.11.01.50.42; author dyoung; state Exp; branches; next 1.5; commitid ptM4knOhaM6kaMAy; 1.5 date 2015.09.02.22.45.47; author dyoung; state Exp; branches; next 1.4; commitid dclRljt7Q5uToJzy; 1.4 date 2015.09.02.22.43.17; author dyoung; state Exp; branches; next 1.3; commitid EwjsMg946Wp5oJzy; 1.3 date 2015.08.31.01.58.23; author dyoung; state Exp; branches; next 1.2; commitid a1XZsFEXv03Vymzy; 1.2 date 2015.08.22.05.08.48; author dyoung; state Exp; branches; next 1.1; commitid hWHunK8RVFy5Udyy; 1.1 date 2015.08.10.21.10.59; author dyoung; state Exp; branches; next ; commitid dALcDv1uBdaPALwy; desc @@ 1.13 log @Bring NetBSD CVS up-to-date with my private Subversion repository. In dt/core.c, delete dead scratch_t members L1 and L2. Pull the gaps count out of subcell_t and add to scratch_t members G[0..1] and Gclocc to hold the gaps count when/where I use it. This gives me back a bit of speed, but I'm still not back to the previous best: affine gap penalties are expensive. Add a dummy 'test' target to it/Makefile. Really need to come up with some tests for IT. Some version of the DT tests oughta do the trick. Thanks, Thomas Klausner (wiz@@NetBSD.org), for the heads-up about some Makefile problems. Fix them: recurse on 'test' target. Make 'test' target depend on dependall. Fix/improve usage messages. Disable debug assertions for now. Make 'make cleandir' remove *.gcov. In dt/sym.c, don't dereference a NULL pointer trying to emit a symbol that is too long. XXX ought to revisit this. Add the hierarchical-logging library that I developed for CUWiN and start to use it to switch debugging printfs on and off. Repair the 'tags' rule, making it write tags to .CURDIR instead of .OBJDIR. Add two new scanners, one for whitespace up to and including end-of-line (EOL), and one for whitespace other than whitespace at EOL. Bring expected test results up-to-date with these beneficial changes to whitespace handling. Add mk/helpers.mk that provides utilities PRINTOBJDIR and PRINTOBJDIROF. Use relative paths like 'dt/core.h' for ARFE header files. Add CPPFLAGS+=-I$(.CURDIR)/.. to all of the makefiles to make this work. Add DPINCS to makefiles that were missing it. Factor out some initialization and argument parsing; put it into arfe_parse_options(). Improve file_to_slice() error reporting: print the filename concerned. Add a utility function, cloccs_are_equal(), that returns true if and only if the string comprising the left-hand clocc is equal to the string comprising the right-hand clocc. Use it here and there, especially in the `tt' implementation. Tweak the costs for opening and extending gaps, and tweak clocc_score(), to get more desirable outputs from dt, it, and tt. Update tests to match new expectations. Add instrumentation that prints a sub-table of the table computed by findsplitn() as an HTML table. Fix some bugs in table initialization in findsplitn() & count_records() that were found with the help of the new instrumentation. For count_records() debug output, record and print a few generations of record boundaries. Add a new member to clocc_t, the number of potential counterpart clocc_t's in a counterpart template, and for each match-template clocc, count the number of counterparts it has in the transform template---see count_match_occ_transform_counterparts(). Eventually I will use the counterparts to tweak scores for clocc-alignment between inputs and match templates in tt. Fix a bug in emit_transformed_text() where it wasn't considering all of the match clocc_t's in a hash bucket. Expand cloccs_t and the size of the clocc_t hash tables, too. XXX needs to be dynamic! In dt, it, and tt (?), instead of running findlcs(), run count_records() and quit if the -c option is passed. This is just a convenience while I test count_records(), which is a work in progress. Fix some NULL pointer dereferences in the decimal- and hexadecimal-number scanners. Run the really long-running dt tests last instead of first. Make tt/testit.sh run the tests on its command line or else all of the tests in tests/. In tt, allocate the class occurrences for the transform template on the heap instead of the stack, so that we don't run out of stack space and segfault. Add a test for 'tt' that extracts the packet input/output statistics from ifconfig -v output and nothing else. Fix a bug in count_match_occ_transform_counterparts() where I was calling TAILQ_NEXT() on an uninitialized element. I'm not sure how I missed the bug before, since it crashes 'tt' in its first test every time. Pick up some changes from the NetBSD CVS repository by wiz@@. Change __attribute__((__noreturn__)) to __dead. Prepare for committing to NetBSD CVS by adding a .cvsignore or two, making Subversion ignore some CVS/ directories, etc. @ text @$ARFE: README 323 2016-01-25 20:29:30Z dyoung $ $NetBSD: README,v 1.12 2015/12/03 03:29:01 dyoung Exp $ DT---(d)ifferentiate (t)ext---finds a longest common subsequence (LCS) of two texts where the numbers, "symbols" starting with a letter or underscore followed by zero or more letters, numbers, or underscores, and IPv4 addresses are "wild": a span of decimal digits in the first text will match any decimal-digits span in the second text. An IPv4 number in the first text will likewise match any IPv4 number in the second text. Symbols match symbols. Currently, DT detects whole hexadecimal numbers (with and without an 0x-prefix) and decimal numbers. When DT emits the LCS, it replaces matched pairs of decimal numbers with their difference, and matched pairs of hexadecimal numbers with their bitwise-AND combination. DT replaces matched pairs of IP numbers with the smallest subnet that contains both. DT copies the first symbol in a matched pair to its output. DT arose from the author's desire to examine the rate of change of network statistics from several sources (ifconfig -va, netstat -s, a network daemon's log dumps) without writing several one-off programs. The experience of writing and using DT inspired the ARFE concept. See ../README for more information about ARFE. @ 1.12 log @Bring READMEs up to date. Update $ARFE$. @ text @d1 2 a2 2 $ARFE: README 303 2015-12-03 03:26:34Z dyoung $ $NetBSD: README,v 1.11 2015/12/02 23:39:51 dyoung Exp $ @ 1.11 log @Executive summary: ARFE now understands C-like symbols. I'm using a different algorithm to figure out where to subdivide the longest common subsequence search. I'm actually computing an edit distance instead of the longest common sequence, but the algorithms are duals so there's not much practical difference. I've discarded some tests, and added at least one new one. Qualify %d for ptrdiff_t, %td. Quiets compilation on 64-bit Darwin. Change the type of the dynamic program cells from size_t to cell_t. For now, a cell_t is just a struct containing a size_t score. Add algq(), a routine for finding k such that lcs(A[1:m/2], B[1:k]) | lcs(A[m/2+1:m], B[k+1:n]) = lcs(A, B). algq() is based on the function Half(i, j) defined in Jeff Erickson's (jeffe@@cs.illinois.edu) lecture notes on advanced dynamic programming. See http://jeffe.cs.illinois.edu/teaching/algorithms/. Add to the Makefile (commented out) lines for tracking code coverage. Use gcov to see the coverage. Make the cleandir target remove gcov(1)-related files. Delete a bunch of dead code and the now unnecessary argument to algc(), expected_lcs. We don't need backwards subslices any more, so get rid of that. When I got rid of the slice_t member `backward' and all of its uses, GCC inlined clocc_ends_at() with a really bad effect on performance (>10s on elmendorf for the t/netstat-s.[01] test, instead of <8s). I marked clocc_ends_at() __noinline for a net performance gain. The gcc version is (NetBSD nb2 20110806) 4.5.3, btw. Disable the dbg_assert()s for more reliable performance comparisons. Rename algq's splitn argument to splitnp since that's my convention for arguments of that kind. In algq(), don't get(A, i) m x n times, just get(A, i) m times. Lightly constify. Provide __noinline on non-NetBSD systems. Sprinkle the $ARFE$ keyword. Rename algc -> findlcs, algq -> findsplitn. Make Subversion fill $ARFE$ in macaddr.h. Extract the tags target from {dt,it,tt}/Makefile, put it in ./Makefile.inc. We only ever call findsplitn(..., true), so get rid of the do_clocc argument. Add an experimental routine, count_records(), that tries to count the records in its second slice_t argument, using the first slice_t argument as record template. Change the class-occurrence (clocc_t) score, clocc_score(), to one plus the minimum length of the class occurrences, from one plus the product of the class occurrences' lengths. This speeds things up a bit. Simplify findsplitn() by pulling common statements out from if-else branches, et cetera. In clocc_starts_in_slice_at(), pass the wlenp argument to clocc_starts_at(). Nothing passed a non-NULL wlenp to clocc_starts_in_slice_at(), so this doesn't make any functional difference. Compute the edit distance instead of the longest common subsequence. The one algorithm is a dual of the other. I may find it easier to add to the edit distance algorithm improvements like affine gap penalties, hence the change. Snapshot of work in progress. These changes make things quite a bit slower! Add affine gap penalties. Bring count_records() in line with findsplitn(), adding affine gap penalties. Update the instrumentation. Count up the number of gaps accumulated. XXX This change makes 'dt netstat-s.0 netstat-s.1' more than twice as XXX slow as it used to be, owing largely (I think) to the increase in XXX size of a cell_t, where three ssize_t's track the number of gaps. Exit with a message and error return code if we run out of slots for class occurrences. The class-occurrence array is still statically allocated---yech. I'm going to fix it one of these days, I promise. Stop detecting occurrences of class "string" (KIND_STRING), which consisted of the names 'abe', 'ada', and 'daria'. Remove the tests related to that. Start detecting occurrences of class "symbol" (KIND_SYMBOL), which resemble C symbol names: they start with a letter of the alphabet or underscore. Following characters are letters, numbers, or underscore. Update tests to match: the netstat and ifconfig tests produce much more sensible results, now. Delete the 'quackquack' tests, since the symbol detector matches the entire string, now, and the tests don't stand for any practical use-case. Add test #5 to tt, which demonstrates how one can use a symbol in the match template to match a symbol in the input for reproduction in the transform template. @ text @d1 2 a2 2 $ARFE: README 264 2015-10-08 22:28:01Z dyoung $ $NetBSD: README,v 1.10 2015/09/23 19:32:34 dyoung Exp $ d5 6 a10 4 of two texts where the numbers and IPv4 addresses are "wild": a span of decimal digits in the first text will match any decimal-digits span in the second text. An IPv4 number in the first text will likewise match any IPv4 number in the second text. Currently, DT detects whole d15 2 a16 1 the smallest subnet that contains both. @ 1.10 log @Extract a subroutine, clocc_score(), that uses the lengths of two class-occurrences (cloccs) to assign their match a score. Add $NetBSD$, $ARFE$, and BSD license to core.c. Update $NetBSD$ Remove a dangling comment from tt/tt.c. @ text @d1 2 a2 2 $ARFE: README 258 2015-09-23 19:31:17Z dyoung $ $NetBSD: README,v 1.9 2015/09/22 01:12:09 dyoung Exp $ @ 1.9 log @Give dt, it, and tt custom main() routines instead of using a bunch of conditionals. Put the core algorithms and data structures into core.[ch]. Retire dt.h (it became core.h). Update Makefiles to suit and alphabetize SRCS. @ text @d1 2 a2 2 $ARFE: README 251 2015-09-22 01:09:13Z dyoung $ $NetBSD: README,v 1.8 2015/09/14 02:58:17 dyoung Exp $ @ 1.8 log @Make some changes that let this build and run properly on 64-bit hosts and on Mac OS X. @ text @d1 2 a2 2 $ARFE: README 245 2015-09-11 02:13:21Z dyoung $ $NetBSD: README,v 1.7 2015/09/11 02:12:57 dyoung Exp $ @ 1.7 log @CVS/ is not a test directory, don't try to run a test there. Update $ARFE$. @ text @d1 2 a2 2 $ARFE: README 243 2015-09-11 01:57:04Z dyoung $ $NetBSD: README,v 1.6 2015/09/11 01:50:42 dyoung Exp $ @ 1.6 log @Add a new tool, tt, that transforms its input based on the transform exemplified by a match/transform-template pair. Add a data detector for MAC addresses. Update expected test outputs. @ text @d1 2 a2 2 $ARFE: README 236 2015-09-02 22:47:33Z dyoung $ $NetBSD: README,v 1.5 2015/09/02 22:45:47 dyoung Exp $ @ 1.5 log @Commit latest $ARFE$. @ text @d1 2 a2 2 $ARFE: README 235 2015-09-02 22:44:54Z dyoung $ $NetBSD: README,v 1.4 2015/09/02 22:43:17 dyoung Exp $ @ 1.4 log @Add $ARFE$, $NetBSD$, and licenses at the top of various files. Factor the hexadecimal parser out of dt.c. Put it in hex.[ch]. Start an IPv4 parser. Write an IPv4 parser in dt/ipv4.[ch] and start using it. Reorganize #includes in dt.c and free the hex parser after it's used. Update the expected test results for the IPv4 parser. In the READMEs, describe the hexadecimal data detection and functions. Describe how IPv4 addresses are treated. @ text @d1 2 a2 2 $ARFE: README 233 2015-09-02 22:33:54Z dyoung $ $NetBSD: README,v 1.3 2015/08/31 01:58:23 dyoung Exp $ @ 1.3 log @In the READMEs, describe the hexadecimal data detection and functions. Update $ARFE$ in dt.c. @ text @d1 2 a2 2 $ARFE: README 225 2015-08-31 01:57:28Z dyoung $ $NetBSD$ d5 9 a13 6 of two texts where the numbers are "wild": a span of decimal digits in the first text will match any decimal-digits span in the second text. Currently, DT detects whole hexadecimal numbers (with and without an 0x-prefix) and decimal numbers. When DT emits the LCS, it replaces matched pairs of decimal numbers with their difference, and matched pairs of hexadecimal numbers with their bitwise-AND combination. @ 1.2 log @Locate in both inputs hexadecimal numbers starting 0x and make them "wild" in the alignments dt computes. In dt, bitwise-AND the 0x-hex numbers. In it, bitwise-OR them. Take care not to match a hexadecimal with a decimal or vice versa! TBD: identify hexadecimals that don't start 0x. Remove a little dead code. Split HB_DEBUG into HB_DEBUG and HB_ASSERT. The latter just enables the assertions. Update old test results for the new treatment of 0x-hexadecimal. Add some new tests. @ text @d1 2 a2 1 $ARFE: README 216 2015-08-22 05:04:28Z dyoung $ d5 6 a10 4 of two texts where decimal numbers are "wild": a span of decimal digits in the first text will match a digits span in the second text. Then it emits the LCS, replacing matched pairs of decimal numbers with their difference. @ 1.1 log @Commit the beginnings of ARFE. ARFE is a suite of tools for processing record- and field-oriented digital texts. ARFE strives to make a useful set of automatic text-processing functions available at a level of abstraction that both invites use by lay people and frees programmers from painstakingly specifying input and output forms. ARFE stands for (A)d hoc (R)ecord and (F)ield (E)xtraction. It is pronounced "arf!" @ text @d1 2 @