Skip to content

Couldn't open file with special characters in filename #529

@endolith

Description

@endolith

Describe the bug
On Windows it doesn't open files with unicode in filenames.

To Reproduce

In Windows 10:

import textract
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting.pdf")
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")

The first filename opens fine, but the second fails because of the special character:

In [7]: textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
---------------------------------------------------------------------------
ShellError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\__init__.py:79, in process(filename, input_encoding, output_encoding, extension, **kwargs)
     76 # do the extraction
     78 parser = filetype_module.Parser()
---> 79 return parser.process(filename, input_encoding, output_encoding, **kwargs)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:46, in BaseParser.process(self, filename, input_encoding, output_encoding, **kwargs)
     36 """Process ``filename`` and encode byte-string with ``encoding``. This
     37 method is called by :func:`textract.parsers.process` and wraps
     38 the :meth:`.BaseParser.extract` method in `a delicious unicode
     39 sandwich <http://nedbatchelder.com/text/unipain.html>`_.
     40
     41 """
     42 # make a "unicode sandwich" to handle dealing with unknown
     43 # input byte strings and converting them to a predictable
     44 # output encoding
     45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
     47 unicode_string = self.decode(byte_string, input_encoding)
     48 return self.encode(unicode_string, output_encoding)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:29, in Parser.extract(self, filename, method, **kwargs)
     27             return self.extract_pdfminer(filename, **kwargs)
     28         else:
---> 29             raise ex
     31 elif method == 'pdfminer':
     32     return self.extract_pdfminer(filename, **kwargs)

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:21, in Parser.extract(self, filename, method, **kwargs)
     19 if method == '' or method == 'pdftotext':
     20     try:
---> 21         return self.extract_pdftotext(filename, **kwargs)
     22     except ShellError as ex:
     23         # If pdftotext isn't installed and the pdftotext method
     24         # wasn't specified, then gracefully fallback to using
     25         # pdfminer instead.
     26         if method == '' and ex.is_not_installed():

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:44, in Parser.extract_pdftotext(self, filename, **kwargs)
     42 else:
     43     args = ['pdftotext', filename, '-']
---> 44 stdout, _ = self.run(args)
     45 return stdout

File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:106, in ShellParser.run(self, args)
    104 # if pipe is busted, raise an error (unlike Fabric)
    105 if pipe.returncode != 0:
--> 106     raise exceptions.ShellError(
    107         ' '.join(args), pipe.returncode, stdout, stderr,
    108     )
    110 return stdout, stderr

ShellError: The command `pdftotext Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf -` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"Error: Couldn't open file 'Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volic.pdf'\r\n"

Expected behavior
It should extract the text from the files.

Desktop (please complete the following information):

  • OS: Windows 10
  • Textract version: 1.6.5
  • Python version 3.12.4
  • Virtual environment: yes, in conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions