For a future Android project, it will be necessary to obtain structured textual content from any document format.
Apache Tika - a content analysis toolkit
The data is extracted from the source document using a parser API.
An initial search for feasibility on the Android platform showed that there is a lot of interest and several solutions are being discussed.
Since our team lacks sufficient expertise for the realisation, the following questions go to the community:
It is well known that some parser solutions already exist here in the forum. However, those are mostly specifically designed for a certain file format. In this case, a solution is actually requested to cover as many formats as possible.
The research for the most optimal solution led to this product:
Apache Tika - a content analysis toolkit
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
The data is extracted from the source document using a parser API.
The Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.B4X:void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
Since our team lacks sufficient expertise for the realisation, the following questions go to the community:
a) Do I see it correctly that it is basically possible to use Apache Tika in Android apps?
b) Would it be possible to create a B4X-wrap for such a seemingly large package?
b) Would it be possible to create a B4X-wrap for such a seemingly large package?