A technique for simple adaptive parsing
The technique relies on a vector quantization algorithm to evaluates the commonness ("popularity") of different passages in the text. More popular passages are taken to be delimiters; less popular ones are takes as fields. In more detail:
- Given a parameter w, the input string is transformed into a sequence of vectors of length w. Each vector is a substring of the input string at a different offset in the string.
- The vectors are further transformed into "one-hot" strings to avoid unintentional similarity between characters with similar Unicode values. Rather than representing each character with a single value in a vector, a sequence of binary values represents a character — a string is transformed into the vector resulting from concatenating such sequences for each character.
- The vectors are clustered with a vector quantization algorithm using a Euclidean distance metric. The popularity of each vector is the size of the cluster in which it is placed.
- In order to refine the popularities, several filters are applied to the clusters:
- Vectors are dropped if the distance from their cluster's defining point (trained neuron, in the case of competitive learning) is more than one standard deviation from the average of such distances. (This reduces the frequency of clusters containing mainly delimiters but some non-delimiter vectors.)
- Clusters are dropped if the standard deviation of distances from the cluster point is above a threshold value. This threshold should be varied automatically, but the current implementation fixes it at 0.001. (This step eliminates clusters containing only non-delimiter vectors.)
- Popularities are shifted such that vectors have maximum ratings if their popularities are most uniform — that is, if their popularity is near a mode. (This adjustment is based on the assumption that delimiters are distinct and that each will repeat n times where n is the number of records in the input.
- A record is a sequence of fields separated by delimiters. The present implementation assumes that records have similar delimiter sequences; the first contiguous sequence of delimiters is assumed to indicate the beginning of a record.
- Fields are identified by searching for sequences of at least w unpopular vectors.
Further reading