Tags: vuamitom/goose
Tags
Major: DefaultOutputFormatter#getFormattedText now unescapes HTML inc… …luding all HTML Entities Minor: I have begun to convert the usage of DefaultOutputFormatter so that you only use a single method: getFormattedText(Element topNode) Bug fixes: * clean by class name was too restrictive and removed actual content elements, modified the list of names to only remove classes that end in "meta" instead of just containing the word "meta" * Modified DefaultDocumentCleaner#cleanBadTags to only select from within the body element to avoid removing it. * Added a helper method for removing nodes to handle cases where the node's parentNode is null (already removed). This was previously throwing an IllegalArgumentException from within jSoup and thus failing the extraction.
PreviousNext