Package org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser
- java.lang.Object
-
- org.apache.pdfbox.pdfparser.BaseParser
-
- org.apache.pdfbox.pdfparser.PDFParser
-
- org.apache.pdfbox.pdfparser.NonSequentialPDFParser
-
public class NonSequentialPDFParser extends PDFParser
PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading ofPDFParser
. This class can be used as aPDFParser
replacement. Firstparse()
must be called before page objects can be retrieved, e.g.getPDDocument()
. This class is a much enhanced version ofQuickParser
presented in PDFBOX-1104 by Jeremy Villalobos.
-
-
Field Summary
Fields Modifier and Type Field Description protected static int
DEFAULT_TRAIL_BYTECOUNT
protected static char[]
EOF_MARKER
EOF-marker.protected static char[]
OBJ_MARKER
obj-marker.protected SecurityHandler
securityHandler
The security handler.protected static char[]
STARTXREF_MARKER
StartXRef-marker.static java.lang.String
SYSPROP_EOFLOOKUPRANGE
static java.lang.String
SYSPROP_PARSEMINIMAL
static java.lang.String
TMP_FILE_PREFIX
-
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
isFDFDocment, xrefTrailerResolver
-
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
-
-
Constructor Summary
Constructors Constructor Description NonSequentialPDFParser(java.io.File file, RandomAccess raBuf)
Constructs parser for given file using given buffer for temporary storage.NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword)
Constructs parser for given file using given buffer for temporary storage.NonSequentialPDFParser(java.io.InputStream input)
Constructor.NonSequentialPDFParser(java.io.InputStream input, RandomAccess raBuf, java.lang.String decryptionPassword)
Constructor.NonSequentialPDFParser(java.lang.String filename)
Constructs parser for given file using memory buffer.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
decrypt(COSBase pb, int objNr, int objGenNr)
Decrypts given object.protected void
decryptDictionary(COSDictionary dict, long objNr, long objGenNr)
protected void
decryptString(COSString str, long objNr, long objGenNr)
Decrypts given COSString.protected void
deleteTempFile()
Remove the temporary file.PDPage
getPage(int pageNr)
Returns the page requested with all the objects loaded into it.int
getPageNumber()
Returns the number of pages in a document.PDDocument
getPDDocument()
This will get the PD document that was parsed.protected java.io.File
getPdfFile()
Return the pdf file.SecurityHandler
getSecurityHandler()
Returns security handler of the document ornull
if document is not encrypted orparse()
wasn't called before.protected long
getStartxrefOffset()
Looks for and parses startxref.protected void
initialParse()
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects.boolean
isLenient()
Return true if parser is lenient.protected int
lastIndexOf(char[] pattern, byte[] buf, int endOff)
Searches last appearance of pattern within buffer.void
parse()
This will parse the stream and populate the COSDocument object.protected COSStream
parseCOSStream(COSDictionary dic, RandomAccess file)
This will read a COSStream from the input stream using length attribute within dictionary.protected COSBase
parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state.protected COSBase
parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state.protected void
readPattern(char[] pattern)
Reads given pattern fromBaseParser.pdfSource
.protected void
releasePdfSourceInputStream()
Enable handling of alternative pdfSource implementation.void
setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.void
setLenient(boolean lenient)
Change the parser leniency flag.protected void
setPdfSource(long fileOffset)
SetsBaseParser.pdfSource
to start next parsing at given file offset.-
Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectory
-
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces
-
-
-
-
Field Detail
-
SYSPROP_PARSEMINIMAL
public static final java.lang.String SYSPROP_PARSEMINIMAL
- See Also:
- Constant Field Values
-
SYSPROP_EOFLOOKUPRANGE
public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
- See Also:
- Constant Field Values
-
DEFAULT_TRAIL_BYTECOUNT
protected static final int DEFAULT_TRAIL_BYTECOUNT
- See Also:
- Constant Field Values
-
EOF_MARKER
protected static final char[] EOF_MARKER
EOF-marker.
-
STARTXREF_MARKER
protected static final char[] STARTXREF_MARKER
StartXRef-marker.
-
OBJ_MARKER
protected static final char[] OBJ_MARKER
obj-marker.
-
securityHandler
protected SecurityHandler securityHandler
The security handler.
-
TMP_FILE_PREFIX
public static final java.lang.String TMP_FILE_PREFIX
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.lang.String filename) throws java.io.IOException
Constructs parser for given file using memory buffer.- Parameters:
filename
- the filename of the pdf to be parsed- Throws:
java.io.IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.File file, RandomAccess raBuf) throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.- Parameters:
file
- the pdf to be parsedraBuf
- the buffer to be used for parsing- Throws:
java.io.IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword) throws java.io.IOException
Constructs parser for given file using given buffer for temporary storage.- Parameters:
file
- the pdf to be parsedraBuf
- the buffer to be used for parsingdecryptionPassword
- password to be used for decryption- Throws:
java.io.IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.InputStream input) throws java.io.IOException
Constructor.- Parameters:
input
- input stream representing the pdf.- Throws:
java.io.IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.InputStream input, RandomAccess raBuf, java.lang.String decryptionPassword) throws java.io.IOException
Constructor.- Parameters:
input
- input stream representing the pdf.raBuf
- the buffer to be used for parsingdecryptionPassword
- password to be used for decryption.- Throws:
java.io.IOException
- If something went wrong.
-
-
Method Detail
-
setEOFLookupRange
public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default valueDEFAULT_TRAIL_BYTECOUNT
.In case system property
SYSPROP_EOFLOOKUPRANGE
is defined this value will be set on initialization but can be overwritten later.- Parameters:
byteCount
- number of trailing bytes
-
initialParse
protected void initialParse() throws java.io.IOException
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.- Throws:
java.io.IOException
- If something went wrong.
-
setPdfSource
protected final void setPdfSource(long fileOffset) throws java.io.IOException
SetsBaseParser.pdfSource
to start next parsing at given file offset.- Parameters:
fileOffset
- file offset- Throws:
java.io.IOException
- If something went wrong.
-
releasePdfSourceInputStream
protected final void releasePdfSourceInputStream() throws java.io.IOException
Enable handling of alternative pdfSource implementation.- Throws:
java.io.IOException
- If something went wrong.
-
getStartxrefOffset
protected final long getStartxrefOffset() throws java.io.IOException
Looks for and parses startxref. We first look for last '%%EOF' marker (within lastDEFAULT_TRAIL_BYTECOUNT
bytes (or range set viasetEOFLookupRange(int)
) and go back to findstartxref
.- Returns:
- the offset of StartXref
- Throws:
java.io.IOException
- If something went wrong.
-
lastIndexOf
protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.- Parameters:
pattern
- pattern to search forbuf
- buffer to search pattern inendOff
- offset (exclusive) where lookup starts at- Returns:
- start offset of pattern within buffer or
-1
if pattern could not be found
-
readPattern
protected final void readPattern(char[] pattern) throws java.io.IOException
Reads given pattern fromBaseParser.pdfSource
. Skipping whitespace at start and end.- Parameters:
pattern
- pattern to be skipped- Throws:
java.io.IOException
- if pattern could not be read
-
parse
public void parse() throws java.io.IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.
-
getPdfFile
protected java.io.File getPdfFile()
Return the pdf file.- Returns:
- the pdf file
-
isLenient
public boolean isLenient()
Return true if parser is lenient. Meaning auto healing capacity of the parser are used.- Returns:
- true if parser is lenient
-
setLenient
public void setLenient(boolean lenient) throws java.lang.IllegalArgumentException
Change the parser leniency flag. This method can only be called before the parsing of the file.- Parameters:
lenient
-- Throws:
java.lang.IllegalArgumentException
- if the method is called after parsing.
-
deleteTempFile
protected void deleteTempFile()
Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream
-
getSecurityHandler
public SecurityHandler getSecurityHandler()
Returns security handler of the document ornull
if document is not encrypted orparse()
wasn't called before.- Returns:
- the security handler.
-
getPDDocument
public PDDocument getPDDocument() throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.- Overrides:
getPDDocument
in classPDFParser
- Returns:
- The document at the PD layer.
- Throws:
java.io.IOException
- If there is an error getting the document.
-
getPageNumber
public int getPageNumber() throws java.io.IOException
Returns the number of pages in a document.- Returns:
- the number of pages.
- Throws:
java.io.IOException
- if PAGES or other needed object is missing
-
getPage
public PDPage getPage(int pageNr) throws java.io.IOException
Returns the page requested with all the objects loaded into it.- Parameters:
pageNr
- starts from 0 to the number of pages.- Returns:
- the page with the given pagenumber.
- Throws:
java.io.IOException
- If something went wrong.
-
parseObjectDynamically
protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken fromPDFParser
and reduced to parsing an indirect object.- Parameters:
obj
- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj
- iftrue
object to be parsed must not be contained within compressed stream- Returns:
- the parsed object (which is also added to document object)
- Throws:
java.io.IOException
- If an IO error occurs.
-
parseObjectDynamically
protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken fromPDFParser
and reduced to parsing an indirect object.- Parameters:
objNr
- object number of object to be parsedobjGenNr
- object generation number of object to be parsedrequireExistingNotCompressedObj
- iftrue
the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)- Returns:
- the parsed object (which is also added to document object)
- Throws:
java.io.IOException
- If an IO error occurs.
-
decryptDictionary
protected final void decryptDictionary(COSDictionary dict, long objNr, long objGenNr) throws java.io.IOException
- Parameters:
dict
- the dictionary to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
java.io.IOException
- ff something went wrong
-
decryptString
protected final void decryptString(COSString str, long objNr, long objGenNr) throws java.io.IOException
Decrypts given COSString.- Parameters:
str
- the string to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
java.io.IOException
- ff something went wrong
-
decrypt
protected final void decrypt(COSBase pb, int objNr, int objGenNr) throws java.io.IOException
Decrypts given object.- Parameters:
pb
- the object to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
java.io.IOException
- ff something went wrong
-
parseCOSStream
protected COSStream parseCOSStream(COSDictionary dic, RandomAccess file) throws java.io.IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.- Overrides:
parseCOSStream
in classBaseParser
- Parameters:
dic
- dictionary that goes with this stream.file
- file to write the stream to when reading.- Returns:
- parsed pdf stream.
- Throws:
java.io.IOException
- if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
-
-