Tika Html File Extraction - LearnHowToCode SarkariResult.com Interview Questions and Answers LearnHowToCodeOnline
Tika Html File Extraction

Tika Html File Extraction

Tika Html File Extraction

To extract content of HTML file, Tika uses HtmlParser. HtmlParser is a class which is used to extract content and metadata of an HTML file. This class is located into org.apache.tika.parser.html package. It contains constructors and methods that are tabled below.

Tika HtmlParser Constructor

ConstructorDescription
public HtmlParser()It is used to create instance of the class.
public HtmlParser(EncodingDetector encodingDetector)It creates instance of HtmlParser class by taking instance of EncodingDetector class .

Tika HtmlParser Methods

MethodDescription
public Set<MediaType> getSupportedTypes(ParseContext context)It returns the set of media types supported by this parser when used with the given parse context.
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaExceptionIt parses a document stream into a sequence of XHTML SAX events.
protected String mapSafeElement(String name)It is used to Map safe HTML element names to semantic XHTML equivalents.
protected boolean isDiscardElement(String name)It checks whether all content within the given HTML element should be discarded instead of including it in the parse output.
public String mapSafeAttribute(String elementName, String attributeName)It uses the HtmlMapper mechanism to customize the HTML mapping.
@Field public void setExtractScripts(boolean extractScripts)It determines whether or not to extract contents in script entities.
public boolean getExtractScripts()It is used to get extracted script.


About Mariano

I'm Ethan Mariano a software engineer by profession and reader/writter by passion.I have good understanding and knowledge of AngularJS, Database, javascript, web development, digital marketing and exploring other technologies related to Software development.

0 comments:

Featured post

Political Full Forms List

Acronym Full Form MLA Member of Legislative Assembly RSS Really Simple Syndication, Rashtriya Swayamsevak Sangh UNESCO United Nations E...

Powered by Blogger.