The Text Cleaner transforms text from AI models, PDFs, websites, and Office documents into clean, continuous text. With one click, it removes all formatting clutter like line breaks, tabs, hyphenation, special characters, and even hidden AI markers. Enjoy cleanly structured text – perfect for your projects at work, in school, or online.
What does it remove?
- Line breaks (single or multiple)
- Tabs and duplicate spaces
- Indents at the beginning of lines (spaces or tabs)
- Protected and invisible characters (e.g.
\u00A0,\u200B,\uFEFF) - All control and formatting markers (soft hyphen, full Word Joiner block U+2060–U+206F, bidi marks, invisible mathematical operators)
- AI watermark characters: StegCloak alphabet (U+200B/C/D, U+2060, U+2062), Innamark whitespace method (U+2004, U+2008, U+2009, U+202F, U+205F), Combining Grapheme Joiner, Hangul Fillers
- All Unicode space variants (Em Space, En Space, Thin Space, Ideographic Space, Narrow No-Break Space, etc.) → regular space
- Homoglyphs – similar-looking foreign characters (Cyrillic, Greek, Armenian, IPA, Lisu, and Latin-Extended letters normalized to Latin)
- Typographic symbols ("smart punctuation" such as –, —, " " ' ')
- Markdown and bullet symbols (
*,—,1.,•etc.) - HTML entities (
,&,>, decoded into plain text) - Hyphenation across lines (e.g. "Be-\nreich" → "Bereich")
- Interlinear annotation characters and object replacement characters (U+FFF9–U+FFFD)
- Variation Selectors (U+FE00–U+FE0F) and Variation Selectors Supplement (U+E0100–U+E01EF) – enable hidden data embedding per character New!
- Unicode Tag Block (U+E0000–U+E007F) – exploited for invisible prompt injection into AI assistants New!
- Non-breaking hyphen (U+2011) → regular hyphen New!
What can not be detected?
Some watermarking methods operate exclusively at the statistical or semantic level – they leave no character-level traces and cannot be removed by text cleaning:
- SynthID Text (Google DeepMind): Influences token selection during generation via a secret key list. No individual character is changed – the mark is embedded in the statistical frequency distribution of chosen words.
- Statistical token watermarking (OpenAI research, KGW method): Divides the vocabulary into "green" and "red" tokens and favors green ones during generation. Statistically detectable, but invisible at the character level.
- Semantic watermarks: Synonym selection, sentence structure variations, or discourse markers chosen according to a secret scheme. Can only be removed by fully paraphrasing the text.
Note on completeness: The detection patterns are updated regularly based on current research publications and security reports. Since AI developers' techniques evolve very rapidly, no guarantee of completeness can be given. We recommend checking the cleaned text with specialized online tools as well – for example a Unicode inspector or an AI detector – to ensure no hidden markers remain.