ge.tess4j.tess4j.4.4.0.source-code.readme.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of tess4j Show documentation

# Tess4J ## Description: A Java JNA wrapper for Tesseract OCR API. Tess4J is released and distributed under the Apache License, v2.0. ## Features: The library provides optical character recognition (OCR) support for: TIFF, JPEG, GIF, PNG, and BMP image formats Multi-page TIFF images PDF document format

There is a newer version: 5.13.0

Show newest version



    
    Tess4J - Java Wrapper for Tesseract OCR API


    
        
            Tess4J
        
        
            DESCRIPTION
        
        
            Tess4J is a JNA wrapper for Tesseract OCR
                API; it provides character recognition support for common image formats,
            multi-page images, and PDF documents. The library has been developed and tested
            on Windows and Linux.
        
        
            Tess4J is released and distributed under the 
                Apache License, v2.0. Its official homepage is at 
                    http://tess4j.sourceforge.net.
        
        
            SOFTWARE REQUIREMENTS
        
        
            Java Runtime Environment, 
                JNA, and JAI-ImageIO
            are required. Apache Maven and 
                JUnit are used for program building and unit testing. The Tesseract DLLs
            were built with VS2019 (v142) and therefore depend on the 
                Visual C++ 2019 Redistributable Packages.
        
        
            INSTRUCTIONS
        
        
            Tesseract 4.1.0 and Leptonica 1.78 (via Lept4J) 32- and 64-bit
            DLLs, language data for English, and sample images are bundled with the library.
            Language data packs for
            Tesseract should be decompressed and placed into the tessdata folder.
        
        
            The Linux shared object library (libtesseract.so) equivalent to the
            DLL is available in Tesseract 4.1.0, which can be built from the source with the instructions given in Tesseract Wiki.
        
        
            To unit test, at the command line, execute:
        
        
            
                mvn test
            
        
        
            Support for PDF documents is available through either 
			GPL Ghostscript, which should be installed and included
            in system path, or PDFBox, if Ghostscript is not available.
        
        
            Images to be OCRed should be scanned at resolution from at least 200 DPI (dot per
            inch) to 400 DPI in monochrome (black&white) or grayscale. Scanning at higher
            resolutions will not necessarily result in better recognition accuracy. The actual
            success rates depend greatly on the quality of the scanned image. The typical settings
            for scanning are 300 DPI and 1 bpp (bit per pixel) black&white or 8 bpp grayscale
            uncompressed TIFF or PNG format. PNG is usually smaller in size than other image
            formats and still keeps high quality due to its employing lossless data compression
            algorithms; TIFF has the advantage of the ability to contain multiple images (pages)
            in a file.
        
        
            Several built-in functions are also provided for merging several images or PDF files
            into a single one for convenient OCR operations, or for splitting a PDF file into
            smaller ones if it is too large, which can cause out-of-memory exceptions.
        
        
            CODE EXAMPLES
        
        
            The following code example shows common usage of the library. Make sure tessdata
            folder is populated with appropriate language data files and the .jar
            files are in the classpath. On Windows, the DLLs will be automatically extracted
            from tess4j.jar to the default temporary directory and loaded.
        
        
            package net.sourceforge.tess4j.example;

import java.io.File;
import net.sourceforge.tess4j.*;

public class TesseractExample {
    public static void main(String[] args) {
        // ImageIO.scanForPlugins(); // for server environment
        File imageFile = new File("eurotext.tif");
        ITesseract instance = new Tesseract(); // JNA Interface Mapping
        // ITesseract instance = new Tesseract1(); // JNA Direct Mapping
        // File tessDataFolder = LoadLibs.extractTessResources("tessdata"); // Maven build only; only English data bundled
        // instance.setDatapath(tessDataFolder.getPath());

        try {
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }
}

        
        
            DOCUMENTATIONS
        
        
            Please visit the website for the library's documentations