Data extraction is the process of collecting unstructured data of various types from numerous different sources. Much of this data can be unorganized or of non-standard structure, but still, using different data processing techniques, it is possible to refine and consolidate most of the data to be able to transform, store and use it in subsequent stages. The data can be stored on-site, on a cloud-based storage or in a hybrid solution combining on-site and cloud storages.
Data extraction is the first step in both ETL (extract, transform, load) and ELT (extract, load, transform) processes, which can themselves be part of a complete data integration strategy. Data extraction is core to any data transformation, integration, or refining process and can be useful for virtually any company in any sector or industry for many different purposes such as storage transfer, merging of databases, further analysis of data, or merely archival purposes.
Zextras External Content Extractor
The external data extractor is capable of detecting and extracting text and metadata from a myriad of different file types (including PPT, XLS, and PDF) and parsing them through a single interface; so the functionality of the external content extractor is not limited to the extraction of unstructured data, as it also takes a step further to parse and transform the data making it ready to use for different purposes such as content analysis, archiving, search engine indexing and storage transfer.
Apache Tika is a potent framework for the extraction and analysis of data, made by the Apache Software Foundation. It has been used extensively in academic research and by various major corporations for the analysis of large amounts of content and the transformation of data into common formats using information retrieval techniques.
Tika is written in Java and provides a Java library, server, and command-line tools so it can be used from other programming languages. It is equipped with the potential for the identification of over 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. Tika also provides content and metadata extraction and language identification capabilities for the more popular formats.
Why Apache Tika?
The Tika library used by Zextras uses the same Java Virtual Machine (JVM) as the mailbox. You can have multiple Tika servers that index content separately from the mailbox, so that if a Tika server were to crash, the mailbox JVM would remain intact.
The Tika server can be run as a docker container, on the same server as the mailbox, or on any separate server accessible by Zimbra.
You can consult Zextras Suite Documentation for technical information on how to add and use a Tika server.