public abstract class TregexPattern extends Object implements Serializable
tgrep
and tgrep2
. However, unlike these
tree pattern matching systems, but like Unix grep
, there is no pre-indexing of the data to be searched.
Rather there is a linear scan through the trees where matches are sought.
As a result, matching is slower, but a TregexPattern can be applied
to an arbitrary set of trees at runtime in a processing pipeline without pre-indexing.
TregexPattern instances can be matched against instances of the Tree
class.
The main(java.lang.String[])
method can be used to find matching nodes of a treebank from the command line.
/^MW/ < IN
.
We then create a pattern, find matches in a given tree, and process
those matches as follows:
// Create a reusable pattern object
TregexPattern patternMW = TregexPattern.compile("/^MW/ < IN");
// Run the pattern on one particular tree
TregexMatcher matcher = patternMW.matcher(tree);
// Iterate over all of the subtrees that matched
while (matcher.findNextMatchingNode()) {
Tree match = matcher.getMatch();
// do what we want to do with the subtree
match.pennPrint();
}
Symbol | Meaning |
---|---|
A << B | A dominates B |
A >> B | A is dominated by B |
A < B | A immediately dominates B |
A > B | A is immediately dominated by B |
A $ B | A is a sister of B (and not equal to B) |
A .. B | A precedes B |
A . B | A immediately precedes B |
A ,, B | A follows B |
A , B | A immediately follows B |
A <<, B | B is a leftmost descendant of A |
A <<- B | B is a rightmost descendant of A |
A >>, B | A is a leftmost descendant of B |
A >>- B | A is a rightmost descendant of B |
A <, B | B is the first child of A |
A >, B | A is the first child of B |
A <- B | B is the last child of A |
A >- B | A is the last child of B |
A <` B | B is the last child of A |
A >` B | A is the last child of B |
A <i B | B is the ith child of A (i > 0) |
A >i B | A is the ith child of B (i > 0) |
A <-i B | B is the ith-to-last child of A (i > 0) |
A >-i B | A is the ith-to-last child of B (i > 0) |
A <: B | B is the only child of A |
A >: B | A is the only child of B |
A <<: B | A dominates B via an unbroken chain (length > 0) of unary local trees. |
A >>: B | A is dominated by B via an unbroken chain (length > 0) of unary local trees. |
A $++ B | A is a left sister of B (same as $.. for context-free trees) |
A $-- B | A is a right sister of B (same as $,, for context-free trees) |
A $+ B | A is the immediate left sister of B (same as $. for context-free trees) |
A $- B | A is the immediate right sister of B (same as $, for context-free trees) |
A $.. B | A is a sister of B and precedes B |
A $,, B | A is a sister of B and follows B |
A $. B | A is a sister of B and immediately precedes B |
A $, B | A is a sister of B and immediately follows B |
A <+(C) B | A dominates B via an unbroken chain of (zero or more) nodes matching description C |
A >+(C) B | A is dominated by B via an unbroken chain of (zero or more) nodes matching description C |
A .+(C) B | A precedes B via an unbroken chain of (zero or more) nodes matching description C |
A ,+(C) B | A follows B via an unbroken chain of (zero or more) nodes matching description C |
A <<# B | B is a head of phrase A |
A >># B | A is a head of phrase B |
A <# B | B is the immediate head of phrase A |
A ># B | A is the immediate head of phrase B |
A == B | A and B are the same node |
A <= B | A and B are the same node or A is the parent of B |
A : B | [this is a pattern-segmenting operator that places no constraints on the relationship between A and B] |
A <... { B ; C ; ... } | A has exactly B, C, etc as its subtree, with no other children. |
AbstractTreebankLanguagePack.getBasicCategoryFunction()
.
Note that Label description regular expressions are matched as find()
,
as in Perl/tgrep, not as matches()
;
you need to use ^
or $
to constrain matches to
the ends of strings.
(S < VP < NP)
means
"an S over a VP and also over an NP".
Nodes can be grouped using parentheses '(' and ')'
as in S < (NP $++ VP)
to match an S
over an NP, where the NP has a VP as a right sister.
So, if instead what you want is an S above a VP above an NP, you must write
"S < (VP < NP)
".
B
"follows" node A
if B
or one of its ancestors is a right sibling of A
or one
of its ancestors. Node B
"immediately follows" node
A
if B
follows A
and there
is no node C
such that B
follows
C
and C
follows A
.
A
dominates B
through an unbroken
chain of unary local trees only if A
is also
unary. (A (B))
is a valid example that matches
A <<: B
C
, the description
C
cannot be a full Tregex expression, but only an
expression specifying the name of the node. Negation of this
description is allowed.
==
has the same precedence as the other relations, so the expression
A << B == A << C
associates as
(((A << B) == A) << C)
, not as
((A << B) == (A << C))
. (Both expressions are
equivalent, of course, but this is just an example.)
(NP < NN | < NNS)
will match an NP node dominating either
an NN or an NNS. (NP > S & $++ VP)
matches an NP that
is both under an S and has a VP as a right sister.
Expressions stop evaluating as soon as the result is known. For
example, if the pattern is NP=a | NNP=b
and the NP
matches, then variable b
will not be assigned even if
there is an NNP in the tree.
Relations can be grouped using brackets '[' and ']'. So the expression
NP [< NN | < NNS] & > S
matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without
brackets, & takes precedence over |, and equivalent operators are
left-associative. Also note that & is the default combining operator if the
operator is omitted in a chain of relations, so that the two patterns are equivalent:
As another example,(S < VP < NP)
(S < VP & < NP)
(VP < VV | < NP % NP)
can be written explicitly as (VP [< VV | [< NP & % NP] ] )
(NP !< NNP)
matches only NPs not dominating
an NNP. Label descriptions can also be negated with '!'
:
(NP < !NNP|NNS)
matches NPs dominating some node
that is not an NNP or an NNS.
@
symbol. For example
(@NP < @/NN.?/)
This can only be used for individual nodes;
if you want all nodes to use the basic category, it would be more efficient
to use a TreeNormalizer
to remove functional
tags before passing the tree to the TregexPattern.
S : NPmatches only those S nodes in trees that also have an NP node.
(NP < NNP=name)
will match an NP dominating an NNP
and after a match is found, the map can be queried with the
name to retreived the matched node using TregexMatcher.getNode(String o)
with (String) argument "name" (TregexParseException
to be thrown. Named nodes
(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
matches only an NP dominating exactly the four node sequence
NP , NP ,
-- the mother NP cannot have any other
daughters. Multiple backreferences are allowed. If the node w/ no
node description does not refer to a previously named node, there
will be no error, the expression simply will not match anything.
Another way to refer to previously named nodes is with the "link" symbol: '~'.
A link is like a backreference, except that instead of having to be equal to the
referred node, the current node only has to match the label of the referred to node.
A link cannot have a node description, i.e. the '~' symbol must immediately follow a
relation symbol.
<#
, >#
, <<#
,
and >>#
, and also
the Function mapping from labels to Basic Category tags can be
chosen by using a TregexPatternCompiler
.
/ <regex-stuff> /#<group-number>%<variable-name>
For example, the pattern (designed for Penn Treebank trees)
@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty
.
A | B
will not work.
/(.*)/#1%foo
and
/(.*)/#1%bar
. You might then want to write a pattern
that matches the concatenation of these patterns,
/(.*)(.*)/#1%foo#2%bar
, but that will not work.
Modifier and Type | Class and Description |
---|---|
static class |
TregexPattern.TRegexTreeReaderFactory |
Modifier and Type | Method and Description |
---|---|
static TregexPattern |
compile(String tregex)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
static void |
main(String[] args)
Prints out all matches of a tree pattern on each tree in the path.
|
TregexMatcher |
matcher(Tree t)
Get a
TregexMatcher for this pattern on this tree. |
TregexMatcher |
matcher(Tree t,
HeadFinder headFinder)
Get a
TregexMatcher for this pattern on this tree. |
String |
pattern() |
void |
prettyPrint()
Print a multi-line representation of the pattern illustrating
it's syntax to System.out.
|
void |
prettyPrint(PrintStream ps)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
void |
prettyPrint(PrintWriter pw)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
static TregexPattern |
safeCompile(String tregex,
boolean verbose)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
abstract String |
toString() |
public TregexMatcher matcher(Tree t)
TregexMatcher
for this pattern on this tree.t
- a tree to match onpublic TregexMatcher matcher(Tree t, HeadFinder headFinder)
TregexMatcher
for this pattern on this tree. Any Relations which use heads of trees should use the provided HeadFinder.t
- a tree to match onheadFinder
- a HeadFinder to use when matchingpublic static TregexPattern compile(String tregex)
TregexPatternCompiler
object.tregex
- the pattern stringTregexParseException
- if the string does not parsepublic static TregexPattern safeCompile(String tregex, boolean verbose)
TregexPatternCompiler
object.
Rather than throwing an exception when the string does not parse,
simply returns null.tregex
- the pattern stringverbose
- whether to log errors when the string doesn't parsepublic String pattern()
public abstract String toString()
public void prettyPrint(PrintWriter pw)
public void prettyPrint(PrintStream ps)
public void prettyPrint()
public static void main(String[] args) throws IOException
java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-filter] [-h <node-name>]]* pattern filepath
Arguments:
pattern
: the tree
pattern which optionally names some set of nodes (i.e., gives it the "handle") =name
(for some arbitrary
string "name")
filepath
: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
-C
suppresses printing of matches, so only the
number of matches is printed.
-w
causes the whole of a tree that matches to be printed.
-f
causes the filename to be printed.
-i <filename>
causes the pattern to be matched to be read from <filename>
rather than the command line. Don't specify a pattern when this option is used.
-o
Specifies that each tree node can be reported only once as the root of a match (by default a node will
be printed once for every way the pattern matches).
-s
causes trees to be printed all on one line (by default they are pretty printed).
-n
causes the number of the tree in which the match was found to be
printed before every match.
-u
causes only the label of each matching node to be printed, not complete subtrees.
-t
causes only the yield (terminal words) of the selected node to be printed (or the yield of the whole tree, if the -w
option is used).
-encoding <charset_encoding>
option allows specification of character encoding of trees..
-h <node-handle>
If a -h
option is given, the root tree node will not be printed. Instead,
for each node-handle
specified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the
-h
option multiple times on a single command line.
-hf <headfinder-class-name>
use the specified HeadFinder
class to determine headship relations.
-hfArg <string>
pass a string argument in to the HeadFinder
class's constructor. -hfArg
can be used multiple times to pass in multiple arguments.
-trf <TreeReaderFactory-class-name>
use the specified TreeReaderFactory
class to read trees from files.
-e <extension>
Only attempt to read files with the given extension. If not provided, will attempt to read all files.-v
print every tree that contains no matches of the specified pattern, but print no matches to the pattern.
-x
Instead of the matched subtree, print the matched subtree's identifying number as defined in tgrep2:a
unique identifier for the subtree and is in the form s:n, where s is an integer specifying
the sentence number in the corpus (starting with 1), and n is an integer giving the order
in which the node is encountered in a depth-first search starting with 1 at top node in the
sentence tree.
-extract <tree-file>
extracts the subtree s:n specified by code from the specified tree-file.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-extractFile <code-file> <tree-file>
extracts every subtree specified by the subtree codes in
code-file
, which must appear exactly one per line, from the specified tree-file
.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-filter
causes this to act as a filter, reading tree input from stdin
-T
causes all trees to be printed as processed (for debugging purposes). Otherwise only matching nodes are printed.
-macros <filename>
filename with macro substitutions to use. file with tab separated lines original-tab-replacement
IOException