8000 dot_parser.py: Why does ID not have the `float_number` production rule? · Issue #436 · pydot/pydot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

dot_parser.py: Why does ID not have the float_number production rule? #436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
loonatick-src opened this issue Nov 8, 2024 · 4 comments · May be fixed by #420
Open

dot_parser.py: Why does ID not have the float_number production rule? #436

loonatick-src opened this issue Nov 8, 2024 · 4 comments · May be fixed by #420

Comments

@loonatick-src
Copy link
loonatick-src commented Nov 8, 2024

From the DOT language reference:

An ID is one of the following:

  • Any string of alphabetic ([a-zA-Z\200-\377]) characters, underscores ('_') or digits([0-9]), not beginning with a digit;
  • a numeral [-]?(.[0-9]⁺ | [0-9]⁺(.[0-9]*)? );
  • any double-quoted string ("...") possibly containing escaped quotes (")¹;
  • an HTML string (<...>).

An ID is just a string; the lack of quote characters in the first two forms is just for simplicity. There is no semantic difference between abc_2 and "abc_2", or between 2.34 and "2.34". Obviously, to use a keyword as an ID, it must be quoted.

I see that that the second rule (numeral) is omitted from ID dot_parser.py and there is instead a righthand ID that satisfies the complete definition. Is there a particular reason for this?

@loonatick-src loonatick-src changed the title dot_parser.py: Why does ID not have a float_number the float_number production rule? dot_parser.py: Why does ID not have the float_number production rule? Nov 8, 2024
@ferdnyc
Copy link
Member
ferdnyc commented Nov 8, 2024

A lot of things in dot_parser are Very Legacy™, its parsing of identifiers being one of them.

the ID definition is actually based on identifier, which is defined earlier in the grammar as:

identifier = Word(
    pyparsing_unicode.BasicMultilingualPlane.alphanums + "_."
).setName("identifier")

The addition of _ and . allows it to support most floating-point values, though (a) it can't be a negative value, and (b) nothing enforces the fact that an alphanumeric string can't start with a digit.

I suspect it was defined that way partly because ID was also used for both sides of the port definition, and the part of port after the colon can't be a floating-point value. This part of the parser definition would break, if ID were fully defined to accept all legal graphviz syntax:

port = (
    Group(Group(colon + ID) + Group(colon + ID))
    | Group(Group(colon + ID))
).setName("port")
node_id = ID + Optional(port)

But you're correct that the parser definition excludes valid graphviz syntax, as a result.

Both pydot and graphviz will accept this:

graph G { 1.14 [color=red]; }

but pydot will balk at this perfectly valid graph, which graphviz also accepts:

graph G { -1.14 [color=red]; }

...The - at the start of -1.14 breaks dot_parser.py. So, that's yet another parser bug; fun!

@ferdnyc
Copy link
Member
ferdnyc commented Nov 8, 2024

I suspect it was defined that way partly because ID was also used for both sides of the port definition, and the part of port after the colon can't be a floating-point value.

Actually, that's not true — the part immediately after the colon can be, the compass_pt that follows it can't. The fact that ID is used to parse compass_pt, when compass_pt can only ever truly be one of ~9 fixed strings, is definitely a legacy parser bug. It seems like whoever wrote that part of the grammar just punted on the port definition.

@ferdnyc
Copy link
Member
ferdnyc commented Nov 8, 2024

Part of the issue is that dot_parser.py wasn't based on the graphviz grammar presented in the current documentation; it was based on the one from this PDF (which, despite its 2015 datestamp, I believe dates back about a decade earlier) — you'll notice it much more closely resembles the logic in dot_parser, despite not matching the actual grammar used by graphviz today.

https://graphviz.org/pdf/dotguide.pdf#page=34

...id is never even DEFINED in that grammar. At all! It's given a vague and unhelpful textual description: "An id is any alphanumeric string not beginning with a digit, but possibly including underscores; or a number; or any quoted string possibly containing escaped quotes."

(We recently ripped out dot_parser's support for things like this:

port → port-location [port-angle] | port-angle [port-location]
port-location → ’:’ id | ’:’ ’(’ id ’,’ id ’)’
port-angle → ’@’ id

Because, despite being documented in the grammar, port-angle and the tuple form of port-location were never actually implemented anywhere in graphviz, so they were just complicating the parser for no reason.)

loonatick-src added a commit to loonatick-src/graph-diff that referenced this issue Nov 8, 2024
@ferdnyc ferdnyc linked a pull request Dec 6, 2024 that will close this issue
@ferdnyc
Copy link
Member
ferdnyc commented Dec 15, 2024

Once I actually get it right (since it makes somewhat more sweeping changes to the parser definition, intended to shore up our support for unquoting quoted strings when possible — like graphviz itself does — among other things), this will also be fixed by #420.

Also, @loonatick-src, if you're using a patched pydot (which it appears that you are), and assuming it's the version from loonatick-src/pydot@bb33fbe, I'd highly recommend also picking up commits 83c6b87 and aff9908 from my WIP #420 PR, or reimplementing them in your branch.

Those two commits:

  1. Enforce, in the parser, that identifiers must start with an alpha character or an underscore.
  2. Correct the parser definition of float_number. The OneOrMore in the existing definition turned out to be a problem when parsing dotted-number strings (like version numbers). Because this:
    float_number = Combine(
        Optional(minus) + OneOrMore(Word(nums + "."))
    )
    ...will attempt to treat -3.4.5 or 1.2.3 as floats, which they most certainly are not. The updated definition uses:
    float_number = Combine(
        Optional("-")
        + (
            "." - Word(nums)
            | Word(nums) + Optional("." + Optional(Word(nums)))
        )
    )
    ...which is much more correct, as it will only permit these forms: #, -#, .#, -.#, #., -#., #.#, -#.# (where # is any series of digits, regardless of length).

The - joiner for grammar elements, instead of +, is also something I've been trying to use where appropriate — it represents a requirement that the element on the left, when encountered, must be followed by the element on the right.

IOW, when Optional("-") + ("." + Word(nums)) encounters something like -.a, it parses the -, followed by ., and then because the next token (a) doesn't match Word(nums), it'll backtrack and start trying other grammar rules to find one that can successfully parse all of the tokens in -.a.

By instead using Optional("-") + "." - Word(nums), it says that a . (optionally preceded by a -) must be followed by a token matching Word(nums), or the result is a parse error. Makes the grammar less ambiguous, since there's no scenario in which something like .red or -.= would be valid input in the same location where a float is accepted.

(I have another commit that does the same thing with the = sign when parsing attributes.) In an attribute list, after an ID the = sign itself is optional (node1 [some_flag]; is a valid node statement), but if it's present it must be followed by another ID. And in attribute assignments, a non-optional = is also required to be followed by an ID. Which ensures the parser won't try to find some way to parse either of the inner lines of this graph as valid:

graph G {
  some_node [color=, style=box];
  rank=;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0