Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C++11

Remove Diacritical Marks in a Unicode String

4.56/5 (4 votes)
30 Nov 2010CPOL 25.5K  
With a helper CharMap class using VC2010 C++0x implementation

This contribution comes from this forum question[^], and my
unefficient answer
[^].


So we want to remove some diacritical marks[^] in a Unicode string, for instance change occurrences of àáảãạăằắẳẵặâầấẩẫậ to plain a, with the help of C++0x[^] as implemented in VC2010.


For that let's define a C array of const wchar_t* with the first character being the replacement character and the next ones being the characters to replace:


// This CODE cannot get formatted by the CP editor
const wchar_t* pchangers[] =
{
L"aàáảãạăằắẳẵặâầấẩẫậ",
L"AÀÁẢÃẠĂẰẮẲẴẶÂẦẤẨẪẬ",
L"OÒÒÓỎÕỌÔỒỐỔỖỘƠỜỚỞỠỢ",
L"EÈÉẺẼẸÊỀẾỂỄỆ",
L"UÙÚỦŨỤƯỪỨỬỮỰ",
L"IÌÍỈĨỊ",
L"YỲÝỶỸỴ",
L"DĐ",
L"oòóỏõọôồốổỗộơờớởỡợ",
L"eèéẻẽẹêềếểễệ",
L"uùúủũụưừứửữự",
L"iìíỉĩị",
L"yỳýỷỹỵ",
L"dđ"
};
// END CODE

The following CharMap class is constructed from a std::vector<std::wstring> of such strings and uses it to populate it's std::map<wchar_t, wchar_t> charmap member, with keys being characters after first and values being first character:
#include <map>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
class CharMap
{
    std::map<wchar_t, wchar_t> charmap;
public:
    CharMap(const std::vector<const std::wstring>& changers)
    {
        std::for_each(changers.begin(), changers.end(), [&](const std::wstring& changer){
            std::transform(changer.begin() + 1, changer.end(), std::inserter(charmap, charmap.end()), [&](wchar_t wc){
                return std::make_pair(wc, changer[0]);});
        });
    }
    std::wstring operator()(const std::wstring& in)
    {
        std::wstring out(in.length(), L'\0');
        std::transform(in.begin(), in.end(), out.begin(), [&](wchar_t wc) ->wchar_t {
            auto it = charmap.find(wc);
            return it == charmap.end() ? wc : it->second;});
        return out;
    }
};  // class CharMap

The std::wstring CharMap::operator()(const std::wstring& in) constructs a std::wstring out from in, changing all characters to replace in in to their replacement character in out and returns out.

Now let's just put it at work:


#include <iostream>
    
std::vector<const std::wstring> changers(pchangers, pchangers + sizeof pchangers / sizeof (wchar_t*));
int main()
{
// This CODE cannot get formatted by the CP editor

std::wcout << CharMap(changers)(L" người mình.mp3 ") << std::endl;
// END unformatted CODE
    return 0;
}

Kind of demonstration of the power of C++0x isn't it?


If you have pasting problems with Unicode strings, download the full code CharMap.zip (1 KB).


cheers,
AR

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)