Tika Html File Extraction
To extract content of HTML file, Tika uses HtmlParser. HtmlParser is a class which is used to extract content and metadata of an HTML file. This class is located into org.apache.tika.parser.html package. It contains constructors and methods that are tabled below.
Tika HtmlParser Constructor
Constructor | Description |
---|---|
public HtmlParser() | It is used to create instance of the class. |
public HtmlParser(EncodingDetector encodingDetector) | It creates instance of HtmlParser class by taking instance of EncodingDetector class . |
Tika HtmlParser Methods
Method | Description |
---|---|
public Set<MediaType> getSupportedTypes(ParseContext context) | It returns the set of media types supported by this parser when used with the given parse context. |
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException | It parses a document stream into a sequence of XHTML SAX events. |
protected String mapSafeElement(String name) | It is used to Map safe HTML element names to semantic XHTML equivalents. |
protected boolean isDiscardElement(String name) | It checks whether all content within the given HTML element should be discarded instead of including it in the parse output. |
public String mapSafeAttribute(String elementName, String attributeName) | It uses the HtmlMapper mechanism to customize the HTML mapping. |
@Field public void setExtractScripts(boolean extractScripts) | It determines whether or not to extract contents in script entities. |
public boolean getExtractScripts() | It is used to get extracted script. |
0 comments:
Post a Comment