8000 Use regular expressions to parse image data text files. by erictzeng · Pull Request #1971 · BVLC/caffe · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Use regular expressions to parse image data text files. #1971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erictzeng
8000
Copy link
Contributor

Fixes #1951.

This pull request consists of two changes:

  1. Rather than the brittle ifstream method of parsing image data files, this pull request uses regular expressions for more robust matching.
  2. Previously, the parsing code was duplicated across two files, tools/convert_imageset.cpp and src/caffe/layers/image_data_layer.cpp. This pull request pulls that common code out into a new function in src/caffe/util/io.cpp for ease of maintenance.

More details follow.

Each line of the input text file is matched against the following regular expression:

\h*("?)(.+?)\1\h+(\d+)\h*

Feel free to play around with an interactive version so you can test it out and see what it matches. This regular expression handles a lot of cases that would've been difficult to handle using the previous naive approach. It captures whitespace within a filename, and enables quoting of filenames in case for some insane reason you have a space at the beginning of a file name.

Some concrete examples of really degenerate cases that will parse correctly:

file name with spaces.jpg 1
" file_name_with_leading_space.jpg" 2
file_name_with_"_symbol.jpg 3
" really disgusting " file  ""name  .jpg" 4

One drawback is that this introduces boost_regex as an additional dependency. However, since we already require Boost, this seems like an acceptable tradeoff.

Implementation-wise, this pull request should be complete, though it's lacking tests, which I will get around to writing at some point in the near future.

@shelhamer
Copy link
Member

@erictzeng this looks right -- thanks for fixing the brittle format -- but I think you need to update the travis script to install boost regex: https://github.com/BVLC/caffe/blob/master/scripts/travis/travis_install.sh.

@bchu
Copy link
Contributor
bchu commented Mar 30, 2016

Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

convert_imageset doesn't handle file names with spaces
3 participants
0