How is information compressed in pdf?

I’m trying to read the author(title, keywords) field of a pdf file using php. I used TCPDF, while the file is parsed into objects, but the fields are not pure but contain impurities from different characters – is it possible that this is some kind of compression and if so, how to get rid of them?

The following field is parsed, although in fact it is pure:

��<�?xml version='1.0' encoding='cp1251'?><�stamps><�stamp></�stamp></�stamps>


Answer 1, authority 100%

Information in pdf can be compressed in two ways:

  • data compression (text, pictures, etc.),
  • file structure compression.

Compression algorithms are different, for text it is usually Flate or LZW. Learn more here.


Answer 2

str_replace('�', '', $data); // xD