8000 matches missing from figures__xrefs view · Issue #5 · wikipathways/pathway-figure-ocr · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

matches missing from figures__xrefs view #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlexanderPico opened this issue Apr 10, 2018 · 6 comments
Open

matches missing from figures__xrefs view #5

AlexanderPico opened this issue Apr 10, 2018 · 6 comments
8000
Assignees
Labels
bug Something isn't working

Comments

@AlexanderPico
Copy link
Member

In this example, Cyclin E/A is successfully matched, added to success.txt and match_attempts, but it's missing from figures__xrefs. Here are the results from a query against match_attempts:

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=769 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |   word    | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+-----------+---------------------+-------------------------------------+--------+------------------
                6 |       769 | p16       |              475262 | -n stop                             | 475262 | p16
                6 |       769 | INK4      |              473787 | -n stop                             | 473787 | INK4
                6 |       769 | Mol       |              475266 | -n stop                             | 475266 | MOL
                6 |       769 | CDK       |              463989 | -n stop                             | 463989 | CDK
                6 |       769 | SCF       |              464414 | -n stop                             | 464414 | SCF
                6 |       769 | CDK2      |              464337 | -n stop                             | 464337 | CDK2
                6 |       769 | Suv39H1   |              475294 | -n stop                             | 475294 | SUV39H1
                6 |       769 | SIN3A     |              475295 | -n stop                             | 475295 | SIN3A
                6 |       769 | CyclinE/A |              475305 | -n stop -n nfkc -n deburr -m expand | 475305 | CYCLINA
                6 |       769 | CyclinE/A |              464335 | -n stop -n nfkc -n deburr -m expand | 464335 | CYCLINE
                6 |       769 | E2F/1/2/3 |              463979 | -n stop -n nfkc -n deburr -m expand | 463979 | E2F
                6 |       769 | DHFR      |              475308 | -n stop                             | 475308 | DHFR
                6 |       769 | PCNA      |              475309 | -n stop                             | 475309 | PCNA
                6 |       769 | H2A       |              475310 | -n stop                             | 475310 | H2A

Everything is pulled into the view just fine except for the two CyclinE/A columns. I'm guessing there is some sort of unique criteria being applied to the word column in the construction of the view?? Though it's odd that it's excluding both and not just one, right?

@AlexanderPico AlexanderPico added the bug Something isn't working label Apr 10, 2018
@AlexanderPico
Copy link
Member Author
AlexanderPico commented Apr 10, 2018

Somehow, in contrast to the example above, this case with two words that are identical behaves just fine and the individual hits, AKT1 and AKT2, are properly included in figures__xrefs, so it's not a simply matter of excluding non-unique words...

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=566 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |       566 | PI3K   |              462515 | -n stop                             | 462515 | PI3K
                6 |       566 | Akt1/2 |              464705 | -n stop -n nfkc -n deburr -m expand | 464705 | AKT1
                6 |       566 | Akt1/2 |              465819 | -n stop -n nfkc -n deburr -m expand | 465819 | AKT2
                6 |       566 | JNK2   |              465589 | -n stop                             | 465589 | JNK2
                6 |       566 | CIDEA  |              472387 | -n stop                             | 472387 | CIDEA
                6 |       566 | CIDEC  |              472388 | -n stop                             | 472388 | CIDEC

@AlexanderPico
Copy link
8000
Member Author
AlexanderPico commented Apr 12, 2018

Another case where match did NOT get pulled into figures__xrefs view, "CyclinD1":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=2026 and transformed_word not like 'dummy%' limit 100;

 ocr_processor_id | figure_id |   word   | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+----------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |      2026 | CB1      |              476682 | -n stop                                                              | 476682 | CB1
                6 |      2026 | PI3K     |              462515 | -n stop                                                              | 462515 | PI3K
                6 |      2026 | GSK-3β   |              462915 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 462915 | GSK3
                6 |      2026 | D1       |              463085 | -n stop                                                              | 463085 | D1
                6 |      2026 | CyclinD1 |              464644 | -n stop                                                              | 464644 | CYCLIND1

@AlexanderPico
Copy link
Member Author
AlexanderPico commented Apr 12, 2018

And another, "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=1875 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |      word       | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+-----------------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |       958 | PI3K/AKTpathway |              462515 | -n stop -n nfkc -n deburr -m expand                                  | 462515 | PI3K
                6 |       958 | PI3K/AKT        |              462522 | -n stop -n nfkc -n deburr -m expand                                  | 462522 | AKT
                6 |       958 | p38             |              462651 | -n stop                                                              | 462651 | p38
                6 |       958 | JNK             |              462633 | -n stop                                                              | 462633 | JNK
                6 |       958 | ERK             |              462776 | -n stop                                                              | 462776 | ERK
                6 |       958 | ROS             |              463928 | -n stop                                                              | 463928 | ROS
                6 |       958 | mTOR            |              463184 | -n stop                                                              | 463184 | MTOR
                6 |       958 | NF-KB           |              462632 | -n stop                                                              | 462632 | NF-KB
                6 |       958 | XIAP            |              463990 | -n stop                                                              | 463990 | XIAP
                6 |       958 | -(PTEN          |              463396 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 463396 | PTEN

@AlexanderPico
Copy link
Member Author
AlexanderPico commented Apr 12, 2018

Another case with "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=3247 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |      3247 | RXFP2  |              505535 | -n stop                             | 505535 | RXFP2
                6 |      3247 | Akt    |              462522 | -n stop                             | 462522 | AKT
                6 |      3247 | PYK2   |              469751 | -n stop                             | 469751 | PYK2
                6 |      3247 | AC     |              462893 | -n stop                             | 462893 | AC
                6 |      3247 | CRAF   |              470309 | -n stop                             | 470309 | CRAF
                6 |      3247 | PKA    |              463347 | -n stop                             | 463347 | PKA
                6 |      3247 | IkBa   |              467857 | -n stop                             | 467857 | IKBA
 
8000
               6 |      3247 | PKC    |              463219 | -n stop                             | 463219 | PKC
                6 |      3247 | NF-KB  |              462632 | -n stop                             | 462632 | NF-KB
                6 |      3247 | MEK1/2 |              463892 | -n stop -n nfkc -n deburr -m expand | 463892 | MEK1
                6 |      3247 | MEK1/2 |              463893 | -n stop -n nfkc -n deburr -m expand | 463893 | MEK2
                6 |      3247 | ERK1/2 |              462520 | -n stop -n nfkc -n deburr -m expand | 462520 | ERK1
                6 |      3247 | ERK1/2 |              462521 | -n stop -n nfkc -n deburr -m expand | 462521 | ERK2

...but why does it matching before having the hyphen removed?? The lexicon only contains "NFKB".

@ariutta
Copy link
Member
ariutta commented Apr 13, 2018

The symbols table doesn't contain anything starting with "CYCLIN":

SELECT * FROM symbols WHERE symbol LIKE 'CYC%';

(edit: but does have items starting with "Cyclin")

@ariutta
Copy link
Member
ariutta commented Apr 14, 2018

Turns out it was the non-alphanumeric characters like dashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0