-
Notifications
You must be signed in to change notification settings - Fork 659
Open
Description
Describe the bug
On Windows it doesn't open files with unicode in filenames.
To Reproduce
In Windows 10:
import textract
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting.pdf")
textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
The first filename opens fine, but the second fails because of the special character:
In [7]: textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
---------------------------------------------------------------------------
ShellError Traceback (most recent call last)
Cell In[7], line 1
----> 1 textract.process(r"Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf")
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\__init__.py:79, in process(filename, input_encoding, output_encoding, extension, **kwargs)
76 # do the extraction
78 parser = filetype_module.Parser()
---> 79 return parser.process(filename, input_encoding, output_encoding, **kwargs)
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:46, in BaseParser.process(self, filename, input_encoding, output_encoding, **kwargs)
36 """Process ``filename`` and encode byte-string with ``encoding``. This
37 method is called by :func:`textract.parsers.process` and wraps
38 the :meth:`.BaseParser.extract` method in `a delicious unicode
39 sandwich <http://nedbatchelder.com/text/unipain.html>`_.
40
41 """
42 # make a "unicode sandwich" to handle dealing with unknown
43 # input byte strings and converting them to a predictable
44 # output encoding
45 # http://nedbatchelder.com/text/unipain/unipain.html#35
---> 46 byte_string = self.extract(filename, **kwargs)
47 unicode_string = self.decode(byte_string, input_encoding)
48 return self.encode(unicode_string, output_encoding)
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:29, in Parser.extract(self, filename, method, **kwargs)
27 return self.extract_pdfminer(filename, **kwargs)
28 else:
---> 29 raise ex
31 elif method == 'pdfminer':
32 return self.extract_pdfminer(filename, **kwargs)
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:21, in Parser.extract(self, filename, method, **kwargs)
19 if method == '' or method == 'pdftotext':
20 try:
---> 21 return self.extract_pdftotext(filename, **kwargs)
22 except ShellError as ex:
23 # If pdftotext isn't installed and the pdftotext method
24 # wasn't specified, then gracefully fallback to using
25 # pdfminer instead.
26 if method == '' and ex.is_not_installed():
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\pdf_parser.py:44, in Parser.extract_pdftotext(self, filename, **kwargs)
42 else:
43 args = ['pdftotext', filename, '-']
---> 44 stdout, _ = self.run(args)
45 return stdout
File ~\anaconda3\envs\openai_experiments\Lib\site-packages\textract\parsers\utils.py:106, in ShellParser.run(self, args)
104 # if pipe is busted, raise an error (unlike Fabric)
105 if pipe.returncode != 0:
--> 106 raise exceptions.ShellError(
107 ' '.join(args), pipe.returncode, stdout, stderr,
108 )
110 return stdout, stderr
ShellError: The command `pdftotext Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volić.pdf -` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"Error: Couldn't open file 'Making Democracy Count_ How Mathematics Improves Voting, -- Ismar Volic.pdf'\r\n"Expected behavior
It should extract the text from the files.
Desktop (please complete the following information):
- OS: Windows 10
- Textract version: 1.6.5
- Python version 3.12.4
- Virtual environment: yes, in conda
Metadata
Metadata
Assignees
Labels
No labels