Tika Html File Extraction

To extract content of HTML file, Tika uses HtmlParser. HtmlParser is a class which is used to extract content and metadata of an HTML file. This class is located into org.apache.tika.parser.html package. It contains constructors and methods that are tabled below.

Tika HtmlParser Constructor

Constructor	Description
public HtmlParser()	It is used to create instance of the class.
public HtmlParser(EncodingDetector encodingDetector)	It creates instance of HtmlParser class by taking instance of EncodingDetector class .

Tika HtmlParser Methods

Method	Description
public Set<MediaType> getSupportedTypes(ParseContext context)	It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException	It parses a document stream into a sequence of XHTML SAX events.
protected String mapSafeElement(String name)	It is used to Map safe HTML element names to semantic XHTML equivalents.
protected boolean isDiscardElement(String name)	It checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
public String mapSafeAttribute(String elementName, String attributeName)	It uses the HtmlMapper mechanism to customize the HTML mapping.
@Field public void setExtractScripts(boolean extractScripts)	It determines whether or not to extract contents in script entities.
public boolean getExtractScripts()	It is used to get extracted script.

Pin it

About Mariano

I'm Ethan Mariano a software engineer by profession and reader/writter by passion.I have good understanding and knowledge of AngularJS, Database, javascript, web development, digital marketing and exploring other technologies related to Software development.

Tika Html File Extraction

Tika Html File Extraction

Tika HtmlParser Constructor

Tika HtmlParser Methods

About Mariano

0 comments:

Featured post

Political Full Forms List

Recent comments

Tika Html File Extraction

Tika Html File Extraction

Tika HtmlParser Constructor

Tika HtmlParser Methods

About Mariano

RELATED POSTS

0 comments:

Featured post

Political Full Forms List