loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4; fill-column: 100 -*- */
|
|
|
|
/*
|
|
|
|
* This file is part of the LibreOffice project.
|
|
|
|
*
|
|
|
|
* This Source Code Form is subject to the terms of the Mozilla Public
|
|
|
|
* License, v. 2.0. If a copy of the MPL was not distributed with this
|
|
|
|
* file, You can obtain one at http://mozilla.org/MPL/2.0/.
|
|
|
|
*/
|
2019-07-16 08:13:59 +02:00
|
|
|
#ifndef LO_CLANG_SHARED_PLUGINS
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
|
|
|
|
#include "check.hxx"
|
|
|
|
#include "plugin.hxx"
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
bool isAsciiCharacterLiteral(Expr const * expr) {
|
|
|
|
if (auto const e = dyn_cast<CharacterLiteral>(expr)) {
|
|
|
|
return e->getKind() == CharacterLiteral::Ascii;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-10-11 10:45:27 +02:00
|
|
|
class SalUnicodeLiteral final:
|
2018-08-23 14:35:15 +02:00
|
|
|
public loplugin::FilteringPlugin<SalUnicodeLiteral>
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
{
|
|
|
|
public:
|
2017-11-07 11:50:47 +01:00
|
|
|
explicit SalUnicodeLiteral(loplugin::InstantiationData const & data):
|
2018-08-23 14:35:15 +02:00
|
|
|
FilteringPlugin(data) {}
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
|
|
|
|
bool VisitCXXStaticCastExpr(CXXStaticCastExpr const * expr) {
|
|
|
|
check(expr);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool VisitCXXFunctionalCastExpr(CXXFunctionalCastExpr const * expr) {
|
|
|
|
check(expr);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool VisitCStyleCastExpr(CStyleCastExpr const * expr) {
|
|
|
|
check(expr);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-07-16 08:13:59 +02:00
|
|
|
bool preRun() override {
|
|
|
|
return compiler.getLangOpts().CPlusPlus
|
2019-07-15 18:18:57 +02:00
|
|
|
&& compiler.getPreprocessor().getIdentifierInfo(
|
2019-07-16 08:13:59 +02:00
|
|
|
"LIBO_INTERNAL_ONLY")->hasMacroDefinition();
|
|
|
|
}
|
|
|
|
|
|
|
|
void run() override {
|
|
|
|
if (preRun())
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
TraverseDecl(compiler.getASTContext().getTranslationUnitDecl());
|
|
|
|
}
|
|
|
|
|
2019-07-16 08:13:59 +02:00
|
|
|
private:
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
void check(ExplicitCastExpr const * expr) {
|
|
|
|
if (ignoreLocation(expr)
|
2017-06-02 09:38:15 +02:00
|
|
|
|| isInUnoIncludeFile(expr->getExprLoc()))
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
//TODO: '#ifdef LIBO_INTERNAL_ONLY' within UNO include files
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
2017-06-02 09:38:15 +02:00
|
|
|
for (auto t = expr->getTypeAsWritten();;) {
|
|
|
|
auto const tt = t->getAs<TypedefType>();
|
|
|
|
if (tt == nullptr) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (loplugin::TypeCheck(t).Typedef("sal_Unicode")
|
|
|
|
.GlobalNamespace())
|
|
|
|
{
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
t = tt->desugar();
|
|
|
|
}
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
auto const e1 = expr->getSubExprAsWritten();
|
2018-08-10 12:35:21 +02:00
|
|
|
auto const loc = compat::getBeginLoc(e1);
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
if (loc.isMacroID()
|
|
|
|
&& compiler.getSourceManager().isAtStartOfImmediateMacroExpansion(
|
|
|
|
loc))
|
|
|
|
{
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
auto const e2 = e1->IgnoreParenImpCasts();
|
|
|
|
if (isAsciiCharacterLiteral(e2) || isa<IntegerLiteral>(e2)) {
|
|
|
|
report(
|
|
|
|
DiagnosticsEngine::Warning,
|
|
|
|
("in LIBO_INTERNAL_ONLY code, replace literal cast to %0 with a"
|
|
|
|
" u'...' char16_t character literal"),
|
|
|
|
e2->getExprLoc())
|
|
|
|
<< expr->getTypeAsWritten() << expr->getSourceRange();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
2019-07-16 08:13:59 +02:00
|
|
|
static loplugin::Plugin::Registration<SalUnicodeLiteral> salunicodeliteral("salunicodeliteral");
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
|
2019-07-16 08:13:59 +02:00
|
|
|
} // namespace
|
|
|
|
|
|
|
|
#endif // LO_CLANG_SHARED_PLUGINS
|
loplugin:salunicodeliteral
For the c-char in the u'...' literal, the preceding commits consistently use:
* a simple-escape-sequence if the original code already used one
* \0 for U+0000
* the (\ escaped, for ' and \) source character matching U+0020..7E (even if it
is not a basic source character)
* a consistently four-digit hexadecimal-escape-sequence otherwise, \xNNNN
For non-surrogate code points, the last case could probably also use \uNNNN
universal-character-names. However, for one, it isn't quite clear to me whether
conversion of such to members of the execution chacacter set in character
literals (in translation phase 5) is implementation-specific. And for another,
the current C++ standard references the dated (no pun intended) ISO/IEC
10646-1:1993 specification, rather than the current ISO/IEC 10646:2014, and
requires that a universal-characrer-name designate a character with a specific
"character short name in ISO/IEC 10646", but I do not find a specification of a
"short name" in ISO/IEC 10646:2014 and don't have access to ISO/IEC
10646-1:1993, so am not sure whether that would e.g. cover noncharacters like
U+FFFF.
(The only exception is one occurrence of u'\x6C' in bestFitOpenSymbolToMSFont,
filter/source/msfilter/util.cxx, where it is clear from the context that the
value denotes neither a Unicode code point nor a UTF-16 code unit, but rather an
index into the Wingdings font glyph table.)
Change-Id: If36b94168428ba1e05977c370aceaa7e90131e90
2017-04-28 17:59:50 +02:00
|
|
|
|
|
|
|
/* vim:set shiftwidth=4 softtabstop=4 expandtab cinoptions=b1,g0,N-s cinkeys+=0=break: */
|