8000 Enhance parser and name-lookup with better quoting by ferdnyc · Pull Request #420 · pydot/pydot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Enhance parser and name-lookup with better quoting #420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

ferdnyc
Copy link
Member
@ferdnyc ferdnyc commented Sep 20, 2024

This is Pydot 4.0 work for sure, since its changes are pretty sweeping

Stripping quotes during parsing

In #413 I mentioned that it would enhance round-tripping of our graphs through dot and back again if the parser unquoted input strings where possible. This PR starts the work of doing that.

Because not all strings can be unquoted, instead of enabling unquote_results in pyparsing.QuotedString(), a new parse action possibly_unquote is attached to the double_quoted_string parser element that handles making that decision. It makes sure that any quoted strings stay quoted that need to: double-quoted strings containing what would otherwise parse as an HTML-like string, or as an ID with port values, etc. If it will confuse the pydot code when sent in unquoted, it's left quoted. Otherwise, it's unquoted, which matches dot's own behavior in parsing.

In addition, the lookup functions get_node(), get_subgraph(), and get_edge() are enhanced to also try the quoted versions of their ID arguments, if passed unquoted strings.1 So, we're better protected both coming and going.

The make_quoted utility function will no longer translate newlines and carriage returns into \n and \r — it's unnecessary, as dot quoted strings can contain the raw control characters without issue, and our parser can consume them as well. So, leaving them be will further enhance our ability to round-trip graphs through graphviz commands with better fidelity.

Fixes #413

Notes

  1. For get_edge() in particular, that became extremely complex, because each endpoint has to be separately checked for a possible quoted form, and tested with or without a quoted form of the other endpoint.
    So, for example, get_edge('abc', 'def') will try:

    • In a directed graph,
      1. ('abc', 'def')
      2. ('"abc"', '"def"')
    • In an undirected graph,
      1. ('abc', 'def')
      2. ('"abc"', '"def"')
      3. ('def', 'abc')
      4. ('"def"', '"abc"')

    But when one side is already quoted, the quoted form of the other side still has to be tested as well. So, get_edge('<<i>html</i>>', 'def') will try:

    • In a directed graph,
      1. ('<<i>html</i>>', 'def')
      2. ('<<i>html</i>>', '"def"')
    • In an undirected graph,
      1. ('<<i>html</i>>', 'def')
      2. ('<<i>html</i>>', '"def"')
      3. ('def', '<<i>html</i>>')
      4. ('"def"', '<<i>html</i>>')

    That code is ugly, but I dropped in a long-ass explanatory comment which will hopefully help.

Improved numeric parsing in dot_parser

  1. Negative-value floats will be recognized as such — previously, we didn't allow a leading - sign.
  2. Floats are included in the grammar for ID values, and are legal unquoted in all places where graphviz allows them.
  3. The parser now correctly requires that unquoted identifiers start with a non-digit, meaning an alpha character or _. This was already true in the quoting logic, but now the parser does the right thing too.
  4. The idea of a "right-hand ID" (that is, an ID after an equals sign, in an assignment) is eliminated — IDs are IDs are IDs. There was never supposed to be any difference in the grammar.

The upshot of all those changes is that, in main:

>>> import pydot.dot_parser as dp; gp = dp.graph_definition()
>>> # This was previously an error
>>> g = gp.parse_string(
...     'graph G { -5 [color=red, length=-4.5, width=5]; }')[0]
>>> # And this was previously...
>>> g = gp.parse_string(
...     'graph G { "-5" [color=red, length="-4.5", width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
"-5" [color=red, length="-4.5", width=5];
}

On this branch:

>>> import pydot.dot_parser as dp; gp = dp.graph_definition()
>>> g = gp.parse_string(
...     'graph G { -5 [color=red, length=-4.5, width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
-5 [color=red, length=-4.5, width=5];
}
>>> g = gp.parse_string(
...     'graph G { "-5" [color=red, length="-4.5", width=5]; }'
... )[0]
>>> print(g.to_string())
graph G {
-5 [color=red, length=-4.5, width=5];
}

Much more in line with graphviz.

Due to these changes, this PR now...

Fixes: #436

Move constants to pydot.constants

A new file constants.py is added to pydot, containing all of the CONSTANT_LIST_VARIABLES previously defined in pydot.core. (The actual reason for this move is, I also need the DOT_KEYWORDS list in dot_parser now.)

Because all of the constants in pydot.constants are currently imported into pydot.core, the __init__.py file's from pydot.core import * import will still pick them up, which means that things like pydot.NODE_ATTRIBUTES and pydot.OUTPUT_FORMATS will still be defined even after this change.

However, I'm tempted to change that (e.g. by having pydot.core use from pydot import constants, and refer to the constants as constants.CONSTANT_NAME), in order to get the constants lists out of the root pydot namespace. They were never really intended to be referenced from the pydot. level.

@ferdnyc
Copy link
Member Author
ferdnyc commented Sep 20, 2024

Aargh, Well, that failure's for a completely different reason: I forgot you can't do this in Python 3.8:

with (
    open(some_file) as in1,
    open(some_file) as in2
):

(It can be done, but it can't be wrapped in parens and made multi-line until Python 3.10.)

dot_parser has always "cheated" our quoting logic by passing in
values already quoted. That means, for all of the graphs we run
through it in testing, it's not really testing our quoting logic
at all. Remove the quotes from quoted values during parsing, so
that our quoting logic is tested, and so that users who run
graphs through `dot -Tdot` and parse them back in won't get
data with all of the identifiers changed.
If TEST_ERROR_DIR is set when running the tests, and a failure is
encountered in one of the file-driven tests, write out both the
original file and the `.to_string()` for the parsed version.
@ferdnyc ferdnyc force-pushed the parser-suppress-quotes branch from 1382663 to 1037179 Compare September 30, 2024 11:29
@ferdnyc
Copy link
Member Author
ferdnyc commented Sep 30, 2024

Oh! Turns out, the b53.dot failure was due to the missing . quoting from #419, since I started this branch before that was merged. As soon as I rebased this on master, the tests went back to all-green.

This still shouldn't be merged until we're sure we're ready to start development on Pydot 4.0.0, but it's no longer a WIP as it's now passing all tests, same as main. (All adjusted tests, since the expected results of some of the API tests has to be modified for the new get_foo() logic.)

@ferdnyc ferdnyc marked this pull request as ready for review September 30, 2024 18:04
@ferdnyc
Copy link
Member Author
ferdnyc commented Sep 30, 2024

Oh, right, but I still can't do this in Python < 3.10.

with (
    open(some_file) as in1,
    open(some_file) as in2
):

Copy link
github-actions bot commented Sep 30, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/pydot
  constants.py
  core.py 1007, 1038, 1068, 1134
  dot_parser.py 421
  utils.py 30, 94, 109, 123, 125, 127, 143
  test
  test_api.py
  test_pydot.py 49, 54
Project Total  

This report was generated by python-coverage-comment-action

Copy link
github-actions bot commented Oct 6, 2024

This pull request has conflicts, please resolve those so that the changes can be evaluated.

Copy link

All conflicts have been resolved, thanks!

As long as the strings are quoted, both graphviz and our parser can
handle embedded newlines, so there's no reason to escape them. Makes
round-trips through graphviz commands less disruptive.
Copy link

This pull request has conflicts, please resolve those so that the changes can be evaluated.

@ferdnyc ferdnyc force-pushed the parser-suppress-quotes branch from 955c43e to a1f522f Compare January 29, 2025 04:47
@ferdnyc
Copy link
Member Author
ferdnyc commented Jan 29, 2025

I think this might be done. Even if there's more I could do, it's big enough already that I'd rather save anything further for a separate PR.

Oh, I need to add tests for the whole make-sure-the-parser-doesn't-unquote-keyword-names thing I spotted, and fixed, above. Which I'd rather put in test_parser.py, so I think I'll wait until #452 is merged (which creates that file).

Copy link

This pull request has conflicts, please resolve those so that the changes can be evaluated.

Copy link

All conflicts have been resolved, thanks!

Copy link
github-actions bot commented Apr 5, 2025

This pull request has conflicts, please resolve those so that the changes can be evaluated.

@github-actions github-actions bot removed the conflicts label Apr 5, 2025
Copy link
github-actions bot commented Apr 5, 2025

All conflicts have been resolved, thanks!

@ferdnyc
Copy link
Member Author
ferdnyc commented Apr 5, 2025

I think this branch is now good, but I'm not sure how reviewable it is anymore. I may have to re-build it on top of the current main branch HEAD, to make it comprehensible.

Copy link

This pull request has conflicts, please resolve those so that the changes can be evaluated.

@Peiffap
Copy link
Peiffap commented May 29, 2025

Hi! I've been going through the blame for the fix to #396, and I've ended up on this draft PR. I'm looking at this stuff because in networkx/networkx#7648, someone found that long node names can break stuff. My understanding is that the changes in this PR should enable us to fix that issue as well (and that until they land, there's not much we can do), but I'm nowhere near qualified enough to say for sure. If you have the time, could you take a quick look at that?

@lkk7
Copy link
Member
lkk7 commented May 30, 2025

@Peiffap I'm surprised to see this issue hanging around for a year, it was never reported here in pydot. Thanks for reporting.

I'm not sure how up-to-date this PR is. It could be outdated.
I'll personally only have time to look at the problem on Saturday/Sunday, but maybe @ferdnyc will see it more quickly

And I'm assuming you've tried with the newest pydot (v4), right?

@Peiffap
Copy link
Peiffap commented May 30, 2025

Yes, I'm on v4.0.0.
No problem! Do you want me to create an issue for this? (Edit: #483.)

I'm not sure how up to date it is either - but at least GitHub thinks it can be cleanly merged, which is a nice start.
In the meantime, I've found a pretty minimal workaround/fix, in networkx/networkx#8079, but I'm not sure the "graph nodes are the last nodes in Q.get_nodes(), in their original order" hypothesis always holds. Perhaps one of you can take a look and let me know when you have the time!

@lkk7
Copy link
Member
lkk7 commented May 30, 2025

Yes, you can create an issue, or I'll create it, whatever. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dot_parser.py: Why does ID not have the float_number production rule? The pydot dot_parser, and round-tripping through dot
3 participants
0