temper.std.regex.Sequence.map Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of temper-std Show documentation
Show all versions of temper-std Show documentation
Optional support library provided with Temper
{ "version": 3, "file": "java/std/src/main/java/temper/std/regex/Sequence.java", "sources": [ "std/regex.temper.md" ], "sourcesContent": [ "# Regex Data Model and Functionality\n\nThe structural data model for regex patterns enables direct construction, and\nthe Temper regex dialect compiles static regex text patterns to these objects.\n\nA focus here is on providing tools people can actually reach for when they need\nto do text processing. The execution should be faster on backends like Python\nthan writing raw code, and the implementation in backends like C should\napproximate what you'd like to have written manually.\n\nDue to inadequate and distinct Unicode handling in backend regex engines, the\ninitial feature set avoids character classes and properties but is still aware\nof code points. Parsing focused on limited sets of delimiters works best for\nnow.\n\nThe core feature set here focuses on both the data model and utility functions,\nsuch as matching regexes against strings.\n\n## Regex Data Model\n\nAll regexes are composed hierarchically of `Regex` nodes. Regexes are a\nsequence of component parts. For example, `/hi./` is a sequence of\n[CodePoint](#codepoint) `/h/` and `/i/` and dot `/./`.\n\nAnd perhaps the most fundamental `Regex` is the [Sequence](#sequence),\nbecause it enables multiple regex components to be strung together.\n\n```\nexport /*sealed*/ interface Regex {\n```\n\nBefore a regex is used, it must be compiled. Some helper functions compile on\nthe fly, although it is faster to reuse a pre-compiled regex.\n\n```\n // TODO(tjp, regex): Make this into a macro behind the scenes.\n // TODO(tjp, regex): `compiled\u003cT\u003e(): CompiledRegex\u003cT\u003e`\n public compiled(): CompiledRegex { new CompiledRegex(this) }\n```\n\nThe simplest use of a regular expression is if it is found in a string.\nRepeatedly calling these methods on a single `Regex` instance is inefficient.\nBetter for reuse is to compile in advance.\n\n```\n public found(text: String): Boolean { compiled().found(text) }\n```\n\nYou can also return match details or perform text replacement. The returned\ngroups map contains an entry for each key in the order defined in the regex\npattern. If no \u0022full\u0022 group is defined, one is added automatically to capture\nthe full matched text.\n\nIn the future, we intend to support customized match types with fields to match\ncapture groups, statically checked where possible.\n\n```\n // TODO(tjp, regex): Also macro because reification.\n\n public find(text: String): Map\u003cString, Group\u003e | Bubble {\n compiled().find(text)\n }\n\n public replace(\n text: String, format: fn (Map\u003cString, Group\u003e): String\n ): String {\n compiled().replace(text, format)\n }\n```\n\nThat's it for what you can do with regex patterns in Temper today, but there's\nmuch more to say on what kinds of regexes can be built.\n\n```\n}\n```\n\n## Regex Item Types\n\nA `Regex` is composed of a potential variety of subtypes.\n\n### Groups\n\nMultiple types of groups exist:\n\n- [Capture](#capture) `/(?\u003cname\u003e...)/` to remember match groups for later use.\n- Non-capturing group syntax `/(?:...)/`, which is simply a [Regex](#regex)\n instance in the data model.\n\n### Capture\n\nTODO(tjp, regex): Change to named captures only!\n\n`Capture` is a [group](#groups) that remembers the matched text for later\naccess. Temper supports only named matches, with current intended syntax\n`/(?name = ...)/`.\n\n```\nexport class Capture extends Regex {\n public name: String;\n public /*early*/ item: Regex;\n}\n```\n\n### CodePart\n\nA component of a [CodeSet][#codeset], aka character class, which applies to a\nsubset of regex data types.\n\nHere, \u0022code\u0022 is short for \u0022code point\u0022 although \u0022char\u0022 might work better,\ndepending on expectations.\n\n```\nexport /*sealed*/ interface CodePart extends Regex {}\n```\n\n### CodePoints\n\nOne or more verbatim code points, where the sequence matters if within a\n[Regex](#regex) or not if within a [CodeSet](#codeset). Some escapes in\ntextual regex source, such as `/\\t/`, can be stored as raw code points.\n\nThe `String` here can enable more efficient storage than individual code\npoints, although the source text may require non-capture grouping. For example,\n`/(?:abc)?/` optionally matches the string `\u0022abc\u0022`, whereas `/abc?/` matches\n`\u0022ab\u0022` with an optional `\u0022c\u0022`.\n\n```\nexport class CodePoints extends CodePart {\n public value: String;\n}\n```\n\n### Specials\n\nA number of special match forms exist. In the data model, these are empty\nclasses.\n\n- `.` - `Dot` In default mode, matches any Unicode code point except newline.\n- `^` - `Begin` in default mode matches zero-length at the beginning of a\n string.\n- `\u0024` - `End` in default mode matches zero-length at the end of a string.\n- `\\b` - `WordBoundary` matches zero-length at the boundary between word and\n non-word code points. More sophisticated Unicode compliance is TBD.\n- `\\s` (negated as `\\S`) - `Space` matches any horizontal space code point.\n Details are TBD.\n- `\\w` (negated as `\\W`) - `Word` matches any word code point. Details are TBD.\n This is currently defined in terms of old ASCII definitions because those are\n clear. Perhaps this will stay that way, and Unicode properties like `\\p{L}`\n will be used for human language needs.\n- `\\X` - `GraphemeCluster` might not be supported, but [here is some discussion\n of how to implement it](\n https://github.com/rust-lang/regex/issues/54#issuecomment-661905060).\n\n\u003cdetails\u003e\n\n```\nexport /*sealed*/ interface Special extends Regex {}\nexport let Begin: Special = do { class Begin extends Special {}; new Begin() };\nexport let Dot: Special = do { class Dot extends Special {}; new Dot() };\nexport let End: Special = do { class End extends Special {}; new End() };\n// TODO(tjp, regex): We can't easily support this at present across backends.\n// export let GraphemeCluster = do {\n// class GraphemeCluster extends Special {}; new GraphemeCluster()\n// };\nexport let WordBoundary: Special = do {\n class WordBoundary extends Special {}; new WordBoundary()\n};\n\nexport /*sealed*/ interface SpecialSet extends CodePart \u0026 Special {}\nexport let Digit: SpecialSet = do {\n class Digit extends SpecialSet {}; new Digit()\n};\nexport let Space: SpecialSet = do {\n class Space extends SpecialSet {}; new Space()\n};\nexport let Word: SpecialSet = do {\n class Word extends SpecialSet {}; new Word()\n};\n```\n\n\u003c/details\u003e\n\n### CodeRange\n\nA code point range matches any code point in its inclusive bounds, such as\n`/[a-c]/`. In source, `-` is included in a code set either by escaping or by\nincluding it as the first or last character. A `CodeRange` is usually contained\ninside a [CodeSet](#codeset), and syntactically always is.\n\n```\nexport class CodeRange extends CodePart {\n public min: Int;\n public max: Int;\n}\n```\n\n### CodeSet\n\nA set of code points, any of which can match, such as `/[abc]/` matching any of\n`\u0022a\u0022`, `\u0022b\u0022`, or `\u0022c\u0022`. Alternatively, a negated set is the inverse of the code\npoints given, such as `/[^abc]/`, matching any code point that's not any of\nthese. This is also often called a character class.\n\nFurther, a subset of [specials](#specials) can also be used in code sets. A\nnegated code set of just a special set often has custom syntax. For example,\nnon-space can be said as either `/[^\\s]/` or `/\\S/`.\n\n```\nexport class CodeSet extends Regex {\n public items: List\u003cCodePart\u003e;\n public negated: Boolean = false;\n}\n```\n\n### Or\n\n`Or` matches any one of multiple options, such as `/ab|cd|e*/`.\n\n```\nexport class Or extends Regex {\n public /*early*/ items: List\u003cRegex\u003e;\n}\n```\n\n### Repeat\n\n`Repeat` matches from an minimum to a maximum number of repeats of a\nsubregex. This can be represented in regex source in a number of ways:\n\n- `?` matches 0 or 1.\n- `*` matches 0 or more.\n- `+` matches 1 or more.\n- `{m}` matches exactly `m` repetitions.\n- `{m,n}` matches between `m` and `n`. Missing `n` is a max of infinity. For\n example, `{0,1}` is equivalent to `?`, and `{1,}` is equivalent to `+`.\n\nBy default, repetitions are greedy, matching as many repetitions as possible.\nIn regex source, any of the above can have `?` appended to indicated reluctant\n(aka non-greedy), matching as few repetitions as possible.\n\n```\nexport class Repeat extends Regex {\n public /*early*/ item: Regex;\n public min: Int;\n public max: Int | Null; // where null means infinite\n public reluctant: Boolean = false;\n}\n```\n\nWe also have convenience builders.\n\n```\nexport let entire(item: Regex): Regex {\n new Sequence([Begin, item, End])\n}\n\nexport let oneOrMore(item: Regex, reluctant: Boolean = false): Repeat {\n { item, min: 1, max: null, reluctant }\n}\n\nexport let optional(item: Regex, reluctant: Boolean = false): Repeat {\n { item, min: 0, max: 1, reluctant }\n}\n```\n\n### Sequence\n\n`Sequence` strings along multiple other regexes in order.\n\n```\nexport class Sequence extends Regex {\n public /*early*/ items: List\u003cRegex\u003e;\n}\n```\n\n## Match Objects\n\nFor detailed match results, call `find` on a regex to get a `Map` object from\n`String` keys to `Group` values.\n\n```\n// TODO Go back to a `Match` object with `groups` as a member so we can also\n// TODO easily return metadata alongside groups? Or is simpler better?\n// export class Match { // interface ... \u003cT = Map\u003cString, Group\u003e\u003e {\n// public let groups: Map\u003cString, Group\u003e;\n// }\n\nexport class Group {\n public let name: String;\n public let value: String;\n public let codePointsBegin: Int;\n}\n```\n\n## Compiled Regex Objects\n\nThe compiled form of a regex is mostly opaque, but it can be cached for more\nefficient reuse than working from a source [Regex](#regex-data-model).\n\n\u003cdetails\u003e\n\n```\n// Provides a workaround for access to std/regex from extension methods.\nclass RegexRefs {\n public let codePoints: CodePoints = new CodePoints(\u0022\u0022);\n public let group: Group = { name: \u0022\u0022, value: \u0022\u0022, codePointsBegin: 0 }\n public let orObject: Or = new Or([]);\n}\n\nlet regexRefs = new RegexRefs();\n```\n\n\u003c/details\u003e\n\n```\n// TODO(tjp, regex): Generate subtypes of this interface later.\nexport class CompiledRegex { // interface ... \u003cT\u003e {\n```\n\nThe source `Regex` data is still available on compiled objects in case it's\nneeded for composition or other purposes.\n\n```\n public let data: Regex;\n\n public constructor(data: Regex) {\n this.data = data;\n compiled = compileFormatted(format());\n }\n```\n\nA compiled regex exposes many of the same capabilities as `Regex`, but they are\nmore efficient to use repeatedly.\n\n```\n public found(text: String): Boolean { compiledFound(compiled, text) }\n\n public find(text: String): Map\u003cString, Group\u003e | Bubble {\n compiledFind(compiled, text, regexRefs)\n }\n\n public replace(\n text: String, format: fn (Map\u003cString, Group\u003e): String\n ): String {\n compiledReplace(compiled, text, format, regexRefs)\n }\n```\n\nTODO(tjp, regex): Public method for replace with named references.\nTODO(tjp, regex): Any static checking?\n\n\u003cdetails\u003e\n\n```\n let compiled: AnyValue;\n\n // Extension functions on some backends need the private `compiled` value\n // passed in directly.\n @connected(\u0022CompiledRegex::compiledFound\u0022)\n compiledFound(compiled: AnyValue, text: String): Boolean;\n\n @connected(\u0022CompiledRegex::compiledFind\u0022)\n compiledFind(\n compiled: AnyValue, text: String, regexRefs: RegexRefs\n ): Map\u003cString, Group\u003e | Bubble;\n\n @connected(\u0022CompiledRegex::compileFormatted\u0022)\n compileFormatted(formatted: String): AnyValue;\n\n @connected(\u0022CompiledRegex::compiledReplace\u0022)\n compiledReplace(\n compiled: AnyValue,\n text: String,\n format: fn (Map\u003cString, Group\u003e): String,\n regexRefs: RegexRefs,\n ): String;\n\n @connected(\u0022CompiledRegex::format\u0022)\n format(): String { new RegexFormatter().format(data) }\n```\n\n\u003c/details\u003e\n\n```\n}\n```\n\n## Private implementation matters\n\nSome regex logic can be shared across backends. These features aren't directly\nexported to the user, however.\n\nThe intent is that these support features only get included in compiled Temper\ncode if usage depends on dynamically constructed regexes. If all regex building\nis done as stable values, we hope to generated backend compiled regexes purely\nat Temper compile time.\n\n### RegexFormatter\n\n\u003cdetails\u003e\n\n```\nclass RegexFormatter {\n let out: ListBuilder\u003cString\u003e = new ListBuilder\u003cString\u003e();\n\n public format(regex: Regex): String {\n pushRegex(regex)\n out.join(\u0022\u0022) { (x);; x }\n }\n\n pushRegex(regex: Regex): Void {\n match (regex) {\n // Aggregate types.\n is Capture -\u003e pushCapture(regex);\n is CodePoints -\u003e pushCodePoints(regex, false);\n is CodeRange -\u003e pushCodeRange(regex);\n is CodeSet -\u003e pushCodeSet(regex);\n is Or -\u003e pushOr(regex);\n is Repeat -\u003e pushRepeat(regex);\n is Sequence -\u003e pushSequence(regex);\n // Specials.\n // Some of these will need to be customized on future backends.\n Begin -\u003e out.add(\u0022^\u0022);\n Dot -\u003e out.add(\u0022.\u0022);\n End -\u003e out.add(\u0022\u0024\u0022);\n WordBoundary -\u003e out.add(\u0022\\\\b\u0022);\n // Special sets.\n Digit -\u003e out.add(\u0022\\\\d\u0022);\n Space -\u003e out.add(\u0022\\\\s\u0022);\n Word -\u003e out.add(\u0022\\\\w\u0022);\n // ...\n }\n }\n\n pushCapture(capture: Capture): Void {\n out.add(\u0022(\u0022);\n // TODO(tjp, regex): Consistent name validation rules for all backends???\n // TODO(tjp, regex): Validate here or in `Capture` constructor???\n // TODO(tjp, regex): Validate here or where against reused names???\n pushCaptureName(out, capture.name);\n pushRegex(capture.item);\n out.add(\u0022)\u0022);\n }\n\n @connected(\u0022RegexFormatter::pushCaptureName\u0022)\n pushCaptureName(out: ListBuilder\u003cString\u003e, name: String): Void {\n // All so far except Python use this form.\n out.add(\u0022?\u003c\u0024{name}\u003e\u0022);\n }\n\n pushCode(code: Int, insideCodeSet: Boolean): Void {\n // Expose private property to extension.\n pushCodeTo(out, code, insideCodeSet);\n // TODO(tjp, regex): Implement more in Temper once we can.\n // if (escapeCodes[code] \u0026\u0026 false) {\n // out.add(\u0022\\\\\u0022);\n // // TODO(tjp, regex): How to convert back to strings?\n // }\n }\n\n @connected(\u0022RegexFormatter::pushCodeTo\u0022)\n pushCodeTo(out: ListBuilder\u003cString\u003e, code: Int, insideCodeSet: Boolean): Void;\n\n pushCodePoints(codePoints: CodePoints, insideCodeSet: Boolean): Void {\n for (\n var slice = codePoints.value.codePoints;\n !slice.isEmpty;\n slice = slice.advance(1)\n ) {\n pushCode(slice.read(), insideCodeSet);\n }\n }\n\n pushCodeRange(codeRange: CodeRange): Void {\n out.add(\u0022[\u0022);\n pushCodeRangeUnwrapped(codeRange);\n out.add(\u0022]\u0022);\n }\n\n pushCodeRangeUnwrapped(codeRange: CodeRange): Void {\n pushCode(codeRange.min, true);\n out.add(\u0022-\u0022);\n pushCode(codeRange.max, true);\n }\n\n pushCodeSet(codeSet: CodeSet): Void {\n let adjusted = adjustCodeSet(codeSet, regexRefs);\n match (adjusted) {\n is CodeSet -\u003e do {\n out.add(\u0022[\u0022);\n if (adjusted.negated) {\n out.add(\u0022^\u0022);\n }\n for (var i = 0; i \u003c adjusted.items.length; i += 1) {\n pushCodeSetItem(adjusted.items[i]);\n }\n out.add(\u0022]\u0022);\n }\n else -\u003e pushRegex(adjusted);\n }\n }\n\n @connected(\u0022RegexFormatter::adjustCodeSet\u0022)\n adjustCodeSet(codeSet: CodeSet, regexRefs: RegexRefs): Regex { codeSet }\n\n pushCodeSetItem(codePart: CodePart): Void {\n match (codePart) {\n is CodePoints -\u003e pushCodePoints(codePart, true);\n is CodeRange -\u003e pushCodeRangeUnwrapped(codePart);\n is SpecialSet -\u003e pushRegex(codePart);\n }\n }\n\n pushOr(or: Or): Void {\n if (!or.items.isEmpty) {\n out.add(\u0022(?:\u0022);\n // TODO(tjp, regex): See #822. Until `this` works better, no this in funs.\n // TODO(tjp, regex): So just manually loop here. Sometimes faster, anyway?\n pushRegex(or.items[0]);\n for (var i = 1; i \u003c or.items.length; i += 1) {\n out.add(\u0022|\u0022);\n pushRegex(or.items[i]);\n }\n out.add(\u0022)\u0022);\n }\n }\n\n pushRepeat(repeat: Repeat): Void {\n // Always wrap the main sub-pattern here to make life easy\n out.add(\u0022(?:\u0022);\n pushRegex(repeat.item);\n out.add(\u0022)\u0022);\n // Then add the repetition part.\n let min = repeat.min;\n let max = repeat.max;\n if (false) {\n } else if (min == 0 \u0026\u0026 max == 1) {\n out.add(\u0022?\u0022);\n } else if (min == 0 \u0026\u0026 max == null) {\n out.add(\u0022*\u0022);\n } else if (min == 1 \u0026\u0026 max == null) {\n out.add(\u0022+\u0022);\n } else {\n out.add(\u0022{\u0024{min.toString()}\u0022);\n if (min != max) {\n out.add(\u0022,\u0022);\n if (max != null) {\n out.add(max.as\u003cInt\u003e().toString());\n }\n }\n out.add(\u0022}\u0022);\n }\n if (repeat.reluctant) {\n out.add(\u0022?\u0022);\n }\n }\n\n pushSequence(sequence: Sequence): Void {\n // TODO(tjp, regex): Foreach loop/function would be nice.\n for (var i = 0; i \u003c sequence.items.length; i += 1) {\n pushRegex(sequence.items[i]);\n }\n }\n\n // Put this here instead of the data model for now because I'm not sure this\n // makes sense to be part of the public api right now.\n public maxCode(codePart: CodePart): Int | Null {\n match (codePart) {\n is CodePoints -\u003e do {\n // Iterating code points is the hardest of the current cases.\n let value = codePart.value;\n if (value.isEmpty) {\n null\n } else {\n // My kingdom for a fold, or even just a max, in builtins.\n var max = 0;\n for (\n var slice = value.codePoints;\n !slice.isEmpty;\n slice = slice.advance(1)\n ) {\n let next = slice.read();\n if (next \u003e max) {\n max = next;\n }\n }\n max\n }\n }\n // Others below are easy for now.\n is CodeRange -\u003e codePart.max;\n Digit -\u003e \u00229\u0022.codePoints.read();\n Space -\u003e \u0022 \u0022.codePoints.read();\n Word -\u003e \u0022z\u0022.codePoints.read();\n // Actually unexpected, ever, but eh.\n else -\u003e null;\n }\n }\n}\n```\n\n\u003c/details\u003e\n" ], "names": [ "Sequence", "Regex", "items" ], "mappings": "AA+Qa,cACwB,CAAA,AADxB,GACwB,CAAA,AADxB,KACwB,CAAA;AADxB,WACwB,CAAA,AADxB,IACwB,CAAA,AADxB,IACwB,CAAA;AADxB,YACwB,MAAA,AADxB,CAAAA,QACwB,WAAA,AADxB,CAAAC,KACwB,EAAA;AAAlB,gBAAkB,AAAX,KAAW,AAAX,CAAA,AAAPA,KAAkB,CAAA,AAAlB,CAAAC,KAAkB,CAAA;AADD,UAAA,AAAvB,CAAAF,QAAA,CACwB,AAAX,IAAW,AAAX,CAAAC,KAAW,CAAA,AAAlB,CAAAC,UAAkB,CAAA,AADD;AACjB,aAAAA,KAAK,AAAL,EAAK,AAAL,CAAAA,UAAK;AAAA,KAAa;AAAlB,gBAAAD,KAAA;AAAA,oBAAAC,KAAA;AAAA;AAAA" }
© 2015 - 2025 Weber Informatics LLC | Privacy Policy