Apache Tika Disable Tesseract

8/9/2019

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.^[1] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

Tika
Stable release
Developer(s)	Apache Software Foundation
Repository
Written in	Java
Operating system	Cross-platform
Type	Search and indexAPI
License	Apache License 2.0
Website	tika.apache.org

Apache Tika Disable Tesseract Youtube

Apache Tika Disable Tesseract Youtube

History[edit]

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by JÃ©rÃ´me Charron, Chris Mattmann and Jukka Zitting.^[2] In 2011 Chris Mattmann and Jukka Zitting released the Manning book 'Tika in Action', and the project released version 1.0.

Features[edit]

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,^[3] Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages.^[4] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses[edit]

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),^[5] Goldman Sachs,^[6]NASA and academic researchers^[7] and by major content management systems including Drupal,^[8] and Alfresco (software)^[9] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques. Mods for hotline miami 2.

On April 4, 2016^[10]Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

References[edit]

^'Apache Tika'. Retrieved 2016-04-15.
^'Tika Proposal'. Retrieved 2016-04-15.
^'The Apache Software Foundation'. Apache Tika formats page. Retrieved 16 April 2016.
^'API Bindings for Tika'. Apache Tika. Retrieved 2016-04-17.
^'FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICOÂ®'. FICOÂ® | Decisions. Archived from the original on 2016-06-03. Retrieved 2016-04-15.
^'Goldman Sachs Puts Elasticsearch To Work - InformationWeek'. InformationWeek. Retrieved 2017-06-21.
^'Studying polar data with the help of Apache Tika'. Opensource.com. Retrieved 2016-04-15.
^'Text Extract for Drupal using Tika | Drupal.org'. www.drupal.org. Retrieved 2016-04-15.
^'Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki'. wiki.alfresco.com. Retrieved 2016-04-15.
^Fox-Brewster, Thomas. 'From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers'. Forbes. Retrieved 2016-04-15.

Retrieved from 'https://en.wikipedia.org/w/index.php?title=Apache_Tika&oldid=901978408'

Nested Class Summary

Nested Classes
Modifier and Type Class and Description
static class TesseractOCRConfig.OUTPUT_TYPE

Constructor Summary

Constructors
Constructor and Description
TesseractOCRConfig()
TesseractOCRConfig(java.io.InputStream is)
Loads properties from InputStream and then tries to close InputStream.

Method Summary

All MethodsInstance MethodsConcrete Methods
Modifier and Type	Method and Description
`boolean`	`getApplyRotation()`
`java.lang.String`	`getColorspace()`
`int`	`getDensity()`
`int`	`getDepth()`
`java.lang.String`	`getFilter()`
`java.lang.String`	`getImageMagickPath()`
`java.lang.String`	`getLanguage()`
`int`	`getMaxFileSizeToOcr()`
`int`	`getMinFileSizeToOcr()`
`TesseractOCRConfig.OUTPUT_TYPE`	`getOutputType()`
`java.lang.String`	`getPageSegMode()`
`boolean`	`getPreserveInterwordSpacing()`
`int`	`getResize()`
`java.lang.String`	`getTessdataPath()`
`java.lang.String`	`getTesseractPath()`
`int`	`getTimeout()`
`int`	`isEnableImageProcessing()`
`void`	`setApplyRotation(boolean applyRotation)` Sets whether or not a rotation value should be calculated and passed to ImageMagick.
`void`	`setColorspace(java.lang.String colorspace)`
`void`	`setDensity(int density)`
`void`	`setDepth(int depth)`
`void`	`setEnableImageProcessing(int enableImageProcessing)` Set the value to true if processing is to be enabled.
`void`	`setFilter(java.lang.String filter)`
`void`	`setImageMagickPath(java.lang.String ImageMagickPath)` Set the path to the ImageMagick executable, needed if it is not on system path.
`void`	`setLanguage(java.lang.String language)`
`void`	`setMaxFileSizeToOcr(int maxFileSizeToOcr)` Set maximum file size to submit file to ocr.
`void`	`setMinFileSizeToOcr(int minFileSizeToOcr)`
`void`	`setOutputType(TesseractOCRConfig.OUTPUT_TYPE outputType)` Set output type from ocr process.
`void`	`setPageSegMode(java.lang.String pageSegMode)`
`void`	`setPreserveInterwordSpacing(boolean preserveInterwordSpacing)` Whether or not to maintain interword spacing.
`void`	`setResize(int resize)`
`void`	`setTessdataPath(java.lang.String tessdataPath)` Set the path to the 'tessdata' folder, which contains language files and config files.
`void`	`setTesseractPath(java.lang.String tesseractPath)` Set the path to the Tesseract executable, needed if it is not on system path.
`void`	`setTimeout(int timeout)` Set maximum time (seconds) to wait for the ocring process to terminate.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf Minecraft pe military base map download.

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

SanthoshSanthosh

Is this question similar to what you get asked at work? Learn more about asking and sharing private information with your coworkers using Stack Overflow for Teams.

Browse other questions tagged javaocrtesseractapache-tika or ask your own question.

Permalink

Join GitHub today

GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.

Find file Copy path

Cannot retrieve contributors at this time

* Licensed to the Apache Software Foundation (ASF) under one or more

* contributor license agreements. See the NOTICE file distributed with

* this work for additional information regarding copyright ownership.

* The ASF licenses this file to You under the Apache License, Version 2.0

* (the 'License'); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an 'AS IS' BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

packageorg.apache.tika.parser.ocr;

importorg.apache.commons.exec.CommandLine;

importorg.apache.commons.exec.DefaultExecutor;

importorg.apache.commons.exec.PumpStreamHandler;

importorg.apache.commons.io.FileUtils;

importorg.apache.commons.io.IOUtils;

importorg.apache.commons.lang3.SystemUtils;

importorg.apache.tika.config.Field;

importorg.apache.tika.config.Initializable;

importorg.apache.tika.config.InitializableProblemHandler;

importorg.apache.tika.config.Param;

importorg.apache.tika.exception.TikaConfigException;

importorg.apache.tika.exception.TikaException;

importorg.apache.tika.io.TemporaryResources;

importorg.apache.tika.io.TikaInputStream;

importorg.apache.tika.metadata.Metadata;

importorg.apache.tika.mime.MediaType;

importorg.apache.tika.mime.MediaTypeRegistry;

importorg.apache.tika.parser.AbstractParser;

importorg.apache.tika.parser.CompositeParser;

importorg.apache.tika.parser.ParseContext;

importorg.apache.tika.parser.Parser;

importorg.apache.tika.parser.external.ExternalParser;

importorg.apache.tika.parser.image.ImageParser;

importorg.apache.tika.parser.image.TiffParser;

importorg.apache.tika.parser.jpeg.JpegParser;

importorg.apache.tika.sax.OfflineContentHandler;

importorg.apache.tika.sax.XHTMLContentHandler;

importorg.apache.tika.utils.XMLReaderUtils;

importorg.slf4j.Logger;

importorg.slf4j.LoggerFactory;

importorg.xml.sax.Attributes;

importorg.xml.sax.ContentHandler;

importorg.xml.sax.SAXException;

importorg.xml.sax.helpers.DefaultHandler;

importjavax.imageio.ImageIO;

importjava.awt.Image;

importjava.awt.image.BufferedImage;

importjava.io.ByteArrayOutputStream;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileOutputStream;

importjava.io.IOException;

importjava.io.InputStream;

importjava.io.InputStreamReader;

importjava.io.OutputStreamWriter;

importjava.io.Reader;

importjava.nio.charset.Charset;

importjava.nio.file.Files;

importjava.nio.file.Paths;

importjava.nio.file.StandardCopyOption;

importjava.util.ArrayList;

importjava.util.Arrays;

importjava.util.Collections;

importjava.util.HashMap;

importjava.util.HashSet;

importjava.util.List;

importjava.util.Locale;

importjava.util.Map;

importjava.util.Set;

importjava.util.concurrent.Callable;

importjava.util.concurrent.ExecutionException;

importjava.util.concurrent.FutureTask;

importjava.util.concurrent.TimeUnit;

importjava.util.concurrent.TimeoutException;

import staticjava.nio.charset.StandardCharsets.UTF_8;

/**

* TesseractOCRParser powered by tesseract-ocr engine. To enable this parser,

* create a {@link TesseractOCRConfig} object and pass it through a

* ParseContext. Tesseract-ocr must be installed and on system path or the path

* to its root folder must be provided:

*

* TesseractOCRConfig config = new TesseractOCRConfig();

* //Needed if tesseract is not on system path

* config.setTesseractPath(tesseractFolder);

* parseContext.set(TesseractOCRConfig.class, config);

*

publicclassTesseractOCRParserextendsAbstractParserimplementsInitializable {

privatestaticfinalLoggerLOG=LoggerFactory.getLogger(TesseractOCRParser.class);

privatestaticvolatilebooleanHAS_WARNED=false;

privatestaticfinalObject[] LOCK=newObject[0];

privatestaticfinallong serialVersionUID =-8167538283213097265L;

privatestaticfinalSet<MediaType>SUPPORTED_TYPES=Collections.unmodifiableSet(

newHashSet<>(Arrays.asList(newMediaType[]{

MediaType.image('png'), MediaType.image('jpeg'), MediaType.image('tiff'),

MediaType.image('bmp'), MediaType.image('gif'), MediaType.image('jp2'),

MediaType.image('jpx'), MediaType.image('x-portable-pixmap')

})));

privatefinalTesseractOCRConfig defaultConfig =newTesseractOCRConfig();

privatestaticMap<String,Boolean>TESSERACT_PRESENT=newHashMap<>();

privatestaticMap<String,Boolean>IMAGE_MAGICK_PRESENT=newHashMap<>();

@Override

publicSet<MediaType>getSupportedTypes(ParseContextcontext) {

// If Tesseract is installed, offer our supported image types

TesseractOCRConfig config = context.get(TesseractOCRConfig.class, defaultConfig);

if (hasTesseract(config)) {

returnSUPPORTED_TYPES;

}

// Otherwise don't advertise anything, so the other image parsers

// can be selected instead

returnCollections.emptySet();

}

privatevoidsetEnv(TesseractOCRConfigconfig, ProcessBuilderpb) {

String tessdataPrefix ='TESSDATA_PREFIX';

Map<String, String> env = pb.environment();

if (!config.getTessdataPath().isEmpty()) {

env.put(tessdataPrefix, config.getTessdataPath());

}

elseif(!config.getTesseractPath().isEmpty()) {

env.put(tessdataPrefix, config.getTesseractPath());

}

publicbooleanhasTesseract(TesseractOCRConfigconfig) {

// Fetch where the config says to find Tesseract

String tesseract = config.getTesseractPath() + getTesseractProg();

// Have we already checked for a copy of Tesseract there?

if (TESSERACT_PRESENT.containsKey(tesseract)) {

returnTESSERACT_PRESENT.get(tesseract);

}

//prevent memory bloat

if (TESSERACT_PRESENT.size() >100) {

TESSERACT_PRESENT.clear();

}

//check that the parent directory exists

if (! config.getTesseractPath().isEmpty() &&

!Files.isDirectory(Paths.get(config.getTesseractPath()))) {

TESSERACT_PRESENT.put(tesseract, false);

returnfalse;

}

// Try running Tesseract from there, and see if it exists + works

String[] checkCmd = { tesseract };

boolean hasTesseract =ExternalParser.check(checkCmd);

TESSERACT_PRESENT.put(tesseract, hasTesseract);

return hasTesseract;

}

privatebooleanhasImageMagick(TesseractOCRConfigconfig) {

// Fetch where the config says to find ImageMagick Program

StringImageMagick= getImageMagickPath(config);

// Have we already checked for a copy of ImageMagick Program there?

if (IMAGE_MAGICK_PRESENT.containsKey(ImageMagick)) {

returnIMAGE_MAGICK_PRESENT.get(ImageMagick);

}

//prevent memory bloat

if (IMAGE_MAGICK_PRESENT.size() >100) {

IMAGE_MAGICK_PRESENT.clear();

}

//check that directory exists

if (!config.getImageMagickPath().isEmpty() &&

!Files.isDirectory(Paths.get(config.getImageMagickPath()))) {

IMAGE_MAGICK_PRESENT.put(ImageMagick, false);

returnfalse;

}

if (SystemUtils.IS_OS_WINDOWS&& config.getImageMagickPath().isEmpty()) {

LOG.warn('Must specify path for imagemagick on Windows OS to avoid accidental confusion with convert.exe');

IMAGE_MAGICK_PRESENT.put(ImageMagick, false);

returnfalse;

}

// Try running ImageMagick program from there, and see if it exists + works

String[] checkCmd = { ImageMagick };

boolean hasImageMagick =ExternalParser.check(checkCmd);

IMAGE_MAGICK_PRESENT.put(ImageMagick, hasImageMagick);

return hasImageMagick;

}

privateStringgetImageMagickPath(TesseractOCRConfigconfig) {

return config.getImageMagickPath() + getImageMagickProg();

}

staticbooleanhasPython() {

// check if python is installed and it has the required dependencies for the rotation program to run

boolean hasPython =false;

TemporaryResources tmp =null;

try {

tmp =newTemporaryResources();

File importCheck = tmp.createTemporaryFile();

String prg ='import numpy, matplotlib, skimage, _tkinter';

OutputStreamWriter out =newOutputStreamWriter(newFileOutputStream(importCheck), Charset.forName('UTF-8'));

out.write(prg);

out.close();

Process p =Runtime.getRuntime().exec('python '+ importCheck.getAbsolutePath());

if (p.waitFor() 0) {

hasPython =true;

}

} catch (Exception e) {

} finally {

IOUtils.closeQuietly(tmp);

}

return hasPython;

}

publicvoidparse(Imageimage, ContentHandlerhandler, Metadatametadata, ParseContextcontext) throwsIOException,

SAXException, TikaException {

TemporaryResources tmp =newTemporaryResources();

FileOutputStream fos =null;

TikaInputStream tis =null;

try {

int w = image.getWidth(null);

int h = image.getHeight(null);

BufferedImage bImage =newBufferedImage(w, h, BufferedImage.TYPE_INT_RGB);

File file = tmp.createTemporaryFile();

fos =newFileOutputStream(file);

ImageIO.write(bImage, 'png', fos);

tis =TikaInputStream.get(file);

parse(tis, handler, metadata, context);

} finally {

tmp.dispose();

if (tis !=null)

tis.close();

if (fos !=null)

fos.close();

}

@Override

publicvoidparse(InputStreamstream, ContentHandlerhandler, Metadatametadata, ParseContextparseContext)

throwsIOException, SAXException, TikaException {

TesseractOCRConfig config = parseContext.get(TesseractOCRConfig.class, defaultConfig);

// If Tesseract is not on the path with the current config, do not try to run OCR

// getSupportedTypes shouldn't have listed us as handling it, so this should only

// occur if someone directly calls this parser, not via DefaultParser or similar

if (! hasTesseract(config))

return;

TemporaryResources tmp =newTemporaryResources();

try {

TikaInputStream tikaStream =TikaInputStream.get(stream, tmp);

//trigger the spooling to a tmp file if the stream wasn't

//already a TikaInputStream that contained a file

tikaStream.getPath();

//this is the text output file name specified on the tesseract

//commandline. The actual output file name will have a suffix added.

File tmpOCROutputFile = tmp.createTemporaryFile();

// Temporary workaround for TIKA-1445 - until we can specify

// composite parsers with strategies (eg Composite, Try In Turn),

// always send the image onwards to the regular parser to have

// the metadata for them extracted as well

_TMP_IMAGE_METADATA_PARSER.parse(tikaStream, newDefaultHandler(), metadata, parseContext);

XHTMLContentHandler xhtml =newXHTMLContentHandler(handler, metadata);

xhtml.startDocument();

parse(tikaStream, tmpOCROutputFile, parseContext, xhtml, config);

xhtml.endDocument();

} finally {

tmp.dispose();

}

/**

* Use this to parse content without starting a new document.

* This appends SAX events to xhtml without re-adding the metadata, body start, etc.

* @param stream inputstream

* @param xhtml handler

* @param config TesseractOCRConfig to use for this parse

* @throws IOException

* @throws SAXException

* @throws TikaException

* @deprecated use {@link #parseInline(InputStream, XHTMLContentHandler, ParseContext, TesseractOCRConfig)}

publicvoidparseInline(InputStreamstream, XHTMLContentHandlerxhtml, TesseractOCRConfigconfig)

throwsIOException, SAXException, TikaException {

parseInline(stream, xhtml, newParseContext(), config);

}

/**

* Use this to parse content without starting a new document.

* This appends SAX events to xhtml without re-adding the metadata, body start, etc.

* @param stream inputstream

* @param xhtml handler

* @param config TesseractOCRConfig to use for this parse

* @throws IOException

* @throws SAXException

* @throws TikaException

publicvoidparseInline(InputStreamstream, XHTMLContentHandlerxhtml, ParseContextparseContext,

TesseractOCRConfigconfig)

throwsIOException, SAXException, TikaException {

// If Tesseract is not on the path with the current config, do not try to run OCR

// getSupportedTypes shouldn't have listed us as handling it, so this should only

// occur if someone directly calls this parser, not via DefaultParser or similar

if (! hasTesseract(config))

return;

TemporaryResources tmp =newTemporaryResources();

try {

TikaInputStream tikaStream =TikaInputStream.get(stream, tmp);

File tmpImgFile = tmp.createTemporaryFile();

parse(tikaStream, tmpImgFile, parseContext, xhtml, config);

} finally {

tmp.dispose();

}

/**

* This method is used to process the image to an OCR-friendly format.

* @param scratchFile input image to be processed

* @param config TesseractOCRconfig class to get ImageMagick properties

* @throws IOException if an input error occurred

* @throws TikaException if an exception timed out

privatevoidprocessImage(FilescratchFile, TesseractOCRConfigconfig) throwsIOException, TikaException {

// fetch rotation script from resources

InputStream in = getClass().getResourceAsStream('rotation.py');

TemporaryResources tmp =newTemporaryResources();

File rotationScript = tmp.createTemporaryFile();

Files.copy(in, rotationScript.toPath(), StandardCopyOption.REPLACE_EXISTING);

CommandLine commandLine =newCommandLine('python');

String[] args = {'-W',

'ignore',

rotationScript.getAbsolutePath(),

'-f',

scratchFile.getAbsolutePath()};

commandLine.addArguments(args, true);

String angle ='0';

DefaultExecutor executor =newDefaultExecutor();

ByteArrayOutputStream outputStream =newByteArrayOutputStream();

PumpStreamHandler streamHandler =newPumpStreamHandler(outputStream);

executor.setStreamHandler(streamHandler);

// determine the angle of rotation required to make the text horizontal

if(config.getApplyRotation() && hasPython()) {

try {

executor.execute(commandLine);

String tmpAngle = outputStream.toString('UTF-8').trim();

//verify that you've gotten a numeric value out

Double.parseDouble(tmpAngle);

angle = tmpAngle;

} catch(Exception e) {

}

// process the image - parameter values can be set in TesseractOCRConfig.properties

commandLine =newCommandLine(getImageMagickPath(config));

args =newString[]{

'-density', Integer.toString(config.getDensity()),

'-depth ', Integer.toString(config.getDepth()),

'-colorspace', config.getColorspace(),

'-filter', config.getFilter(),

'-resize', config.getResize() +'%',

'-rotate', angle,

scratchFile.getAbsolutePath(),

scratchFile.getAbsolutePath()

};

commandLine.addArguments(args, true);

try {

executor.execute(commandLine);

} catch(Exception e) {

}

tmp.close();

}

privatevoidparse(TikaInputStreamtikaInputStream, FiletmpOCROutputFile, ParseContextparseContext,

XHTMLContentHandlerxhtml, TesseractOCRConfigconfig)

throwsIOException, SAXException, TikaException {

File tmpTxtOutput =null;

try {

File input = tikaInputStream.getFile();

long size = tikaInputStream.getLength();

if (size >= config.getMinFileSizeToOcr() && size <= config.getMaxFileSizeToOcr()) {

// Process image if ImageMagick Tool is present

if(config.isEnableImageProcessing() 1&& hasImageMagick(config)) {

// copy the contents of the original input file into a temporary file

// which will be preprocessed for OCR

TemporaryResources tmp =newTemporaryResources();

try {

File tmpFile = tmp.createTemporaryFile();

FileUtils.copyFile(input, tmpFile);

processImage(tmpFile, config);

doOCR(tmpFile, tmpOCROutputFile, config);

} finally {

if (tmp !=null) {

tmp.dispose();

}

} else {

doOCR(input, tmpOCROutputFile, config);

}

// Tesseract appends the output type (.txt or .hocr) to output file name

tmpTxtOutput =newFile(tmpOCROutputFile.getAbsolutePath() +'.'+

config.getOutputType().toString().toLowerCase(Locale.US));

if (tmpTxtOutput.exists()) {

try (InputStream is =newFileInputStream(tmpTxtOutput)) {

if (config.getOutputType().equals(TesseractOCRConfig.OUTPUT_TYPE.HOCR)) {

extractHOCROutput(is, parseContext, xhtml);

} else {

extractOutput(is, xhtml);

}

} finally {

if (tmpTxtOutput !=null) {

tmpTxtOutput.delete();

}

/**

* no-op

* @param params params to use for initialization

* @throws TikaConfigException

@Override

publicvoidinitialize(Map<String, Param>params) throwsTikaConfigException {

}

@Override

publicvoidcheckInitialization(InitializableProblemHandlerproblemHandler)

throwsTikaConfigException {

//this will incorrectly trigger for people who turn off Tesseract

//by sending in a bogus tesseract path via a custom TesseractOCRConfig.

//TODO: figure out how to solve that.

if (! hasWarned()) {

if (hasTesseract(defaultConfig)) {

problemHandler.handleInitializableProblem(this.getClass().getName(),

'Tesseract OCR is installed and will be automatically applied to image files unlessn'+

'you've excluded the TesseractOCRParser from the default parser.n'+

'Tesseract may dramatically slow down content extraction (TIKA-2359).n'+

'As of Tika 1.15 (and prior versions), Tesseract is automatically called.n'+

'In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.');

warn();

}

// TIKA-1445 workaround parser

privatestaticParser_TMP_IMAGE_METADATA_PARSER=newCompositeImageParser();

privatestaticclassCompositeImageParserextendsCompositeParser {

privatestaticfinallong serialVersionUID =-2398203346206381382L;

privatestaticList<Parser> imageParsers =Arrays.asList(newParser[]{

newImageParser(), newJpegParser(), newTiffParser()

});

CompositeImageParser() {

super(newMediaTypeRegistry(), imageParsers);

}

/**

* Run external tesseract-ocr process.

* @param input

* File to be ocred

* @param output

* File to collect ocr result

* @param config

* Configuration of tesseract-ocr engine

* @throws TikaException

* if the extraction timed out

* @throws IOException

* if an input error occurred

privatevoiddoOCR(Fileinput, Fileoutput, TesseractOCRConfigconfig) throwsIOException, TikaException {

ArrayList<String> cmd =newArrayList<>(Arrays.asList(

config.getTesseractPath() + getTesseractProg(), input.getPath(), output.getPath(), '-l',

config.getLanguage(), '--psm', config.getPageSegMode()

));

for (Map.Entry<String, String> entry : config.getOtherTesseractConfig().entrySet()) {

cmd.add('-c');

cmd.add(entry.getKey() +'='+ entry.getValue());

}

cmd.addAll(Arrays.asList(

'-c', 'page_separator='+ config.getPageSeparator(),

'-c',

(config.getPreserveInterwordSpacing())?'preserve_interword_spaces=1':'preserve_interword_spaces=0',

config.getOutputType().name().toLowerCase(Locale.US)

));

ProcessBuilder pb =newProcessBuilder(cmd);

setEnv(config, pb);

finalProcess process = pb.start();

process.getOutputStream().close();

InputStream out = process.getInputStream();

InputStream err = process.getErrorStream();

logStream('OCR MSG', out, input);

logStream('OCR ERROR', err, input);

FutureTask<Integer> waitTask =newFutureTask<>(newCallable<Integer>() {

publicIntegercall() throwsException {

return process.waitFor();

}

});

Thread waitThread =newThread(waitTask);

waitThread.start();

try {

waitTask.get(config.getTimeout(), TimeUnit.SECONDS);

} catch (InterruptedException e) {

waitThread.interrupt();

process.destroy();

Thread.currentThread().interrupt();

thrownewTikaException('TesseractOCRParser interrupted', e);

} catch (ExecutionException e) {

// should not be thrown

} catch (TimeoutException e) {

waitThread.interrupt();

process.destroy();

thrownewTikaException('TesseractOCRParser timeout', e);

}

/**

* Reads the contents of the given stream and write it to the given XHTML

* content handler. The stream is closed once fully processed.

* @param stream

* Stream where is the result of ocr

* @param xhtml

* XHTML content handler

* @throws SAXException

* if the XHTML SAX events could not be handled

* @throws IOException

* if an input error occurred

privatevoidextractOutput(InputStreamstream, XHTMLContentHandlerxhtml) throwsSAXException, IOException {

xhtml.startElement('div', 'class', 'ocr');

try (Reader reader =newInputStreamReader(stream, UTF_8)) {

char[] buffer =newchar[1024];

for (int n = reader.read(buffer); n !=-1; n = reader.read(buffer)) {

if (n >0) {

xhtml.characters(buffer, 0, n);

}

xhtml.endElement('div');

}

privatevoidextractHOCROutput(InputStreamis, ParseContextparseContext,

XHTMLContentHandlerxhtml) throwsTikaException, IOException, SAXException {

if (parseContext null) {

parseContext =newParseContext();

}

xhtml.startElement('div', 'class', 'ocr');

XMLReaderUtils.parseSAX(is, newOfflineContentHandler(newHOCRPassThroughHandler(xhtml)), parseContext);

xhtml.endElement('div');

}

/**

* Starts a thread that reads the contents of the standard output or error

* stream of the given process to not block the process. The stream is closed

* once fully processed.

privatevoidlogStream(finalStringlogType, finalInputStreamstream, finalFilefile) {

newThread() {

publicvoidrun() {

Reader reader =newInputStreamReader(stream, UTF_8);

StringBuilder out =newStringBuilder();

char[] buffer =newchar[1024];

try {

for (int n = reader.read(buffer); n !=-1; n = reader.read(buffer))

out.append(buffer, 0, n);

} catch (IOException e) {

} finally {

IOUtils.closeQuietly(stream);

}

LOG.debug('{}', out);

}

}.start();

}

staticStringgetTesseractProg() {

returnSystem.getProperty('os.name').startsWith('Windows') ?'tesseract.exe':'tesseract';

}

staticStringgetImageMagickProg() {

returnSystem.getProperty('os.name').startsWith('Windows') ?'convert.exe':'convert';

}

privatestaticclassHOCRPassThroughHandlerextendsDefaultHandler {

privatefinalContentHandler xhtml;

publicstaticfinalSet<String>IGNORE= unmodifiableSet(

'html', 'head', 'title', 'meta', 'body');

publicHOCRPassThroughHandler(ContentHandlerxhtml) {

this.xhtml = xhtml;

}

/**

* Starts the given element. Table cells and list items are automatically

* indented by emitting a tab character as ignorable whitespace.

@Override

publicvoidstartElement(

Stringuri, Stringlocal, Stringname, Attributesattributes)

throwsSAXException {

if (!IGNORE.contains(name)) {

xhtml.startElement(uri, local, name, attributes);

}

/**

* Ends the given element. Block elements are automatically followed

* by a newline character.

@Override

publicvoidendElement(Stringuri, Stringlocal, Stringname) throwsSAXException {

if (!IGNORE.contains(name)) {

xhtml.endElement(uri, local, name);

}

/**

* @see <ahref='https://issues.apache.org/jira/browse/TIKA-210'>TIKA-210</a>

@Override

publicvoidcharacters(char[] ch, intstart, intlength) throwsSAXException {

xhtml.characters(ch, start, length);

}

privatestaticSet<String>unmodifiableSet(String.. elements) {

returnCollections.unmodifiableSet(

newHashSet<>(Arrays.asList(elements)));

}

protectedbooleanhasWarned() {

if (HAS_WARNED) {

returntrue;

}

synchronized (LOCK) {

if (HAS_WARNED) {

returntrue;

}

returnfalse;

}

protectedvoidwarn() {

HAS_WARNED=true;

}

@Field

publicvoidsetTesseractPath(StringtesseractPath) {

defaultConfig.setTesseractPath(tesseractPath);

}

@Field

publicvoidsetTessdataPath(StringtessdataPath) {

defaultConfig.setTessdataPath(tessdataPath);

}

@Field

publicvoidsetLanguage(Stringlanguage) {

defaultConfig.setLanguage(language);

}

@Field

publicvoidsetPageSegMode(StringpageSegMode) {

defaultConfig.setPageSegMode(pageSegMode);

}

@Field

publicvoidsetMinFileSizeToOcr(longminFileSizeToOcr) {

defaultConfig.setMinFileSizeToOcr(minFileSizeToOcr);

}

@Field

publicvoidsetTimeout(inttimeout) {

defaultConfig.setTimeout(timeout);

}

@Field

publicvoidsetOutputType(StringoutputType) {

defaultConfig.setOutputType(outputType);

}

@Field

publicvoidsetPreserveInterwordSpacing(booleanpreserveInterwordSpacing) {

defaultConfig.setPreserveInterwordSpacing(preserveInterwordSpacing);

}

@Field

publicvoidsetEnableImageProcessing(intenableImageProcessing) {

defaultConfig.setEnableImageProcessing(enableImageProcessing);

}

@Field

publicvoidsetImageMagickPath(StringimageMagickPath) {

defaultConfig.setImageMagickPath(imageMagickPath);

}

@Field

publicvoidsetDensity(intdensity) {

defaultConfig.setDensity(density);

}

@Field

publicvoidsetDepth(intdepth) {

defaultConfig.setDepth(depth);

}

@Field

publicvoidsetColorspace(Stringcolorspace) {

defaultConfig.setColorspace(colorspace);

}

@Field

publicvoidsetFilter(Stringfilter) {

defaultConfig.setFilter(filter);

}

@Field

publicvoidsetResize(intresize) {

defaultConfig.setResize(resize);

}

@Field

publicvoidsetApplyRotation(booleanapplyRotation) {

defaultConfig.setApplyRotation(applyRotation);

}

publicTesseractOCRConfiggetDefaultConfig() {

return defaultConfig;

}

Copy lines
Copy permalink

I'm James. This is my year of travel.

Apache Tika Disable Tesseract

Apache Tika Disable Tesseract Youtube

History[edit]

Features[edit]

Notable uses[edit]

See also[edit]

References[edit]

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Browse other questions tagged javaocrtesseractapache-tika or ask your own question.

Join GitHub today

Author

Archives

Categories