The Extractor

Content


About the Extractor

The Extractor tool is part of the XCL Suite developed by the Cologne team. The Extractor obtains content as well as technical properties of files (such as format, width, height, size, or the number of colours) through the use of the eXtensible Characterisation Extraction Language (XCEL). The characteristics of each file object are saved in an XCDL document.

Extractor currently supports following file formats:

  • PNG, TIFF, DOCX, PDF, SVG, CGM, MP3 and WAV (extraction by Extractor parser)
  • BMP, PBM, PCD, PCX, PICT, PPM, PSD,TGA, XBM and XPM (extraction by ImageMagick wrapping. If you would like to extract these files with the Extractor software, you have to install the ImageMagick software.)
  • JPEG, JP2, GIF, HTML and AIF (extraction by JHOVE wrapping. If you would like to extract these files with the Extractor software, you have to install the JAVA JRE software.)
  • DOC (extraction by Microsoft conversion tool, currently not available for public use)

More about the first steps, compilation, usage of the Extractor and the file formats which are currently supported by the Extractor tool please refer the XCL Documentation (chapter 5.2. "Extractor").


Download


Version 0.4 (Revision 848 (2010-05-31))

The Extractor together with the Comparator is also available in an 'easy-to-install' version for Windows (Download here).
To use the Extractor 0.4 on Linux and Mac please download the Extractor source code from Planets SourceForge (Download here) and compile the source on your machine. This is also possible for Windows.

New Features

  • JPEG, JP2, GIF, HTML and AIF file formats are extracted through the wrapping of JHOVE.
  • Wav and MP3 file formats are now supported by genuine XCELs.
  • SVG and CGM file formats are now supported by genuine XCELs.
  • New XCEL structures <legalValues> and <nonLegalValues> are implemented. This works faster than <validValues> and <nonValidValues>.
  • New Processings are implemented for XCEL:
    • setReference
    • registerDictionary
    • getValueFromDictionary
    • addition
    • subtraction
    • multiplication
    • division
    • deinterlace
    • getNthBit
    • synchsafeToInt
  • The Property 'fontMetric' is extracted from PDF and DOCX formats.

Version 0.3 (Revision 234(2009-05-31))

The Extractor together with the Comparator are also available in an 'easy-to-install' version for Windows (Download here).

New Features

  • The XCEL files are located in xcl/xcel and the XCEL schemas in xcl/xcel/schemas
  • better performance
  • better PDF support:
    • Extraction of hexadecimal decoded strings in content stream from PDFs.
    • Extraction of content stream of PDFs with multiple pages.
    • Image extraction from PDFs.
    • Support of PDFs generated from Microsoft Word2007 (please use testfiles under res/ooxml2pdf/).
  • -n option: NormData is not extracted
  • Adaption of new property names from XCL Ontology.
  • The properties extracted from a file format are written in XCDL in alphabetical order.
  • The special characters are permitted in the input file. If the input file name contains specific characters, these must be masked and the input file name set in quotation marks.
  • These new formats are supported by wrapping of ImageMagick tool. They are GIF, BMP, JPEG, JP2, PBM, PCD, PCX, PICT, PPM, PSD, SVG, TGA, XBM and XPM.

Version 0.2

Extractor Version 0.2 with win32 binaries and testfiles

New Features

  • -G option for GUI alpha version
  • -o option to define output location
  • Droid inclusion for one-parameter command line execution
  • imageMagick support. If imageMagick is installed and `identify -verbose` is available, you can convert imageMagick output to xcdl via rel/Extractor example.bmp (see start.bat for examples)
  • PDF font extraction alpha version (please use testfiles under res/testpdf

known issues

  • all ouput via std::cout does not work on win32 cmd. To view output you must redirect the output via rel\Extractor res\basn0g08.png > output.txt.
  • when extracting docx files the tmp directory `targetDir` will not be deleted. This must be done manually.
  • For some reasons the GUI reacts very slow on windows systems

Version 0.1

Extractor source with testfiles
Win32 executable with testfiles


Compilation/Installation for Windows

An installer is available for Windows users, see above. Alternatively, the following steps can be excuted if you decide to build, or should we say to compile the software from source code. You can build the current version using the Mingw-Qt package, available here. Assuming Xerces-C is installed in C:\lib\xercesc and zlib in C:\lib\zlib-1.2.3, type the following in the command line:
  1. cd Extractor
  2. qmake -project
    • The'qmake -project' generates an extractor.pro file located in current directory. The extractor.pro file must contain the following content (see here), i.e.: if the extractor.pro file doesn't contain the folowing information, please add it manually.
      The Header of extractor.pro file must contain the following content: QT+=xml
      CONFIG+=release
      DESTDIR=rel
      OBJECTS_DIR=rel
      TEMPLATE = app
      TARGET =
      LIBS= -L'C:\lib\xercesc\lib' -lxerces-c
      QMAKE_CXXFLAGS+=-DXML_LIBRARY
      QMAKE_CFLAGS+=-DXML_LIBRARY
    • Please make sure that the following lines were added to your DEPENDPATH and INCLUDEPATH of your project file:
      DEPENDPATH += C: \lib\xercesc\include
      INCLUDEPATH +=C: \lib\zlib-1.2.3\
      C: \lib\xercesc\include
  3. qmake
  4. make
  5. Compilation/Installation for Linux

    If you decide to build, or should we say compile, the software from source code you should follow the subsequent steps. Install GCC C++-Compiler, QT, Xerces-C and ImageMagick (see above or XCL Documentation chapter "System Requirements and Dependencies"). Assuming Xerces-C is installed in file directory /opt/xercesc, type the following in the command line:
    1. cd Extractor
    2. qmake -project QT+=xml CONFIG+=debug DESTDIR=rel OBJECTS_DIR=rel LIBS="-lxerces-c -L/opt/xercesc/lib"
      • The'qmake -project' generates an extractor.pro file located in current directory. The extractor.pro file must contain the following content (see here)
    3. qmake
    4. make


    Compilation/Installation for Mac

    Install GCC C++-Compiler, QT, Xerces-C and ImageMagick (see above or XCL Documentation chapter "System Requirements and Dependencies"). To build the Extractor on Mac OS, perform the following steps(Assuming Xerces-C is installed in file directory /opt/local):
    1. cd extractor
    2. make clean
    3. rm -rf rel/extractor.app
    4. qmake -project QT+=xml CONFIG+=debug DESTDIR=rel OBJECTS_DIR=rel LIBS="-lxerces-c -L/opt/local/lib"
    5. Add "-I /opt/local/include/" to the include section in the generated extractor.pro file manually. The extractor.pro file must contain the following content(see here).
    6. qmake
    7. make

System Requirements and Dependencies

The Extractor uses the following software and libraries:
GCC C++-Compiler
Qt (4.6.1 for Linux and Mac, 4.5.3 for Windows)
The Qt-Mingw package bundles both g++ and qt for a windows system
Xerces-C(version 2.8.0 for Linux, version 2.8.0 for Windows, version 2.8.0 for Mac via MacPorts )
zlib (1.2.3 for Linux, Windows and Mac OS)
Java JRE (1.6 for Linux, Windows and Mac OS)
ImageMagick (latest version for all operating systems)

To be able to use the Extractor software, the GCC, Qt, Xerces and zlib libraries must be installed.
Only the ImageMagick software is optional, and can be installed at a later time. The installation of ImageMagick allows the Extractor software to process larger number of file formats than without. For more about the function of ImageMagick within the Extractor software, see chapter 5.2.1.6. "XCDLs generated by Extractor and alternative sources" in the XCL Documentation.
When installing the ImageMagick software, the user should decide which optional libraries ImageMagick needs in order to support certain file formats (see here). After the ImageMagick installation, to verify ImageMagick is working properly, type 'identify -verbose [inputfile]' in an command line tool (see here). To get a complete listing of which image formats are supported on your system by ImageMagick, type 'identify -list format' (see here). The Extractor supports the file formats BMP, PBM, PCD, PCX, PICT, PPM, PSD, TGA, XBM and XPM by wrapping the ImageMagick command line tool 'identify -verbose'. If you would like to extract these files with the Extractor software, you have to install the ImageMagick software.
For more information about file formats which are currently supported by the Extractor please refer to the XCL Documentation(chapter 5.2.1.6. "XCDLs generated by Extractor and alternative sources").