Feature: detect mixups between two single-byte encodings

There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.

Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".

This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.

This mixup is apparently common enough that it would be worth fixing as another special case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: detect mixups between two single-byte encodings #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: detect mixups between two single-byte encodings #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions