Max's Blog

Mar 26 2010

4m 的 adsl

Category: 乱up当秘笈 — ssmax @ 20:38:42

今天终于升级了4m的adsl，下载可以爽一点了，可惜btchina已经没了，哎。。。
电信的ivr做得挺强大，竟然主动拨打然后ivr做满意度调查，做FCR和KPI指标，哈哈。。。

Comments (0)

Mar 23 2010

ubuntu 的多语言环境和 locale

Category: 技术 — ssmax @ 10:48:50

前两天准备材料的时候，在虚拟机上面装了个ubuntu玩了下，装的时候选择的是英文环境，刚好要准备字符编码的材料，试了一下
shell> export LANG=zh_CN.GBK

没反应，默认是C了，
shell> locale -a
C
en_US.utf8
POSIX

看了下，果然啥都没

shell> apt-cache search language

找到一大堆，装个中文的看看先

shell> apt-get install language-pack-zh language-pack-zh-base

装好了locale -a 看看
C
en_US.utf8
POSIX
zh_CN.utf8
zh_HK.utf8
zh_SG.utf8
zh_TW.utf8

竟然全部是utf8编码的，郁闷，哈哈

shell> ls /var/lib/locales/supported.d/
local zh

原来所有系统支持的编码都放在这里

编辑 /var/lib/locales/supported.d/zh
增加
zh_CN.GBK GBK
zh_CN.GB2312 GB2312

然后执行
shell> locale-gen
或者 shell> dpkg-reconfigure locales
重新生成locale

然后看看是不是增加成功了？
shell> locale -a
C
en_US.utf8
POSIX
zh_CN.gb2312
zh_CN.gbk
zh_CN.utf8
zh_HK.utf8
zh_SG.utf8
zh_TW.utf8

Comments (0)

Mar 18 2010

Deprecate tcp_tw_{reuse,recycle}

Category: 技术 — ssmax @ 15:48:22

慎用 tcp_tw_{reuse,recycle} 内核参数
以前在我的blog里面写了解决TIME WAIT 连接过多的方法之一是设置tcp快速回收

但是最近经常爆出的一些bug表明，tcp_tw_recycle 开启的情况下，会对内网NAT出来的访问有一定影响，由于开启这个功能后，内核会认为同一个ip只会有一个timestamp生效，如果网关出来的timestamp不一样的话，服务器端就会drop掉这些tcp帧。

建议大家慎用tcp_tw_recycle 和 tcp_tw_reuse 这两个参数。具体原文如下：

We’ve recently had a long discussion about the CVE-2005-0356 time stamp denial-of-service
attack. It turned out that Linux is only vunerable to this problem when tcp_tw_recycle
is enabled (which it is not by default).
In general these two options are not really usable in today’s internet because they
make the (often false) assumption that a single IP address has a single TCP time stamp /
PAWS clock. This assumption breaks both NAT/masquerading and also opens Linux to denial
of service attacks (see the CVE description)
Due to these numerous problems I propose to remove this code for 2.6.26
Signed-off-by: Andi Kleen
Index: linux/Documentation/feature-removal-schedule.txt
===================================================================
— linux.orig/Documentation/feature-removal-schedule.txt
+++ linux/Documentation/feature-removal-schedule.txt
@@ -354,3 +354,15 @@ Why: The support code for the old firmwa
and slightly hurts runtime performance. Bugfixes for the old firmware
are not provided by Broadcom anymore.
Who: Michael Buesch
+
+—————————
+
+What: Support for /proc/sys/net/ipv4/tcp_tw_{reuse,recycle} = 1
+When: 2.6.26
+Why: Enabling either of those makes Linux TCP incompatible with masquerading and
+ also opens Linux to the CVE-2005-0356 denial of service attack. And these
+ optimizations are explicitely disallowed by some benchmarks. They also have
+ been disabled by default for more than ten years so they’re unlikely to be used
+ much. Due to these fatal flaws it doesn’t make sense to keep the code.
+Who: Andi Kleen
+
—

Comments (0)

Mar 09 2010

Mysql 全文索引的中文问题（Mediawiki搜索中文问题）

Category: 技术 — ssmax @ 15:24:59

今天翻了一下meidawiki的源代码，由于它的中文搜索不太准确，想查查原因，就看了一下它的搜索是如何实现的。

数据库是mysql，使用了全文索引表进行搜索

CREATE TABLE `searchindex` (
`si_page` int(10) unsigned NOT NULL,
`si_title` varchar(255) NOT NULL DEFAULT '',
`si_text` mediumtext NOT NULL,
UNIQUE KEY `si_page` (`si_page`),
FULLTEXT KEY `si_title` (`si_title`),
FULLTEXT KEY `si_text` (`si_text`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

mysql的FULLTEXT 对中文的支持一直不太好，如果直接用utf8字符串的话，没有分词分隔符，所以索引就没有效果，wiki通过取巧的方法，把utf8字符转换成U8xxxx进行保存，用英文空格分隔，所以就可以搜索了。

wiki的字符转换代码，比较有用，呵呵：

cat wiki/languages/classes/LanguageZh_cn.php

<?php
/**
* @addtogroup Language
*/
class LanguageZh_cn extends Language {
function stripForSearch( $string ) {
# MySQL fulltext index doesn't grok utf-8, so we
# need to fold cases and convert to hex
# we also separate characters as "words"
if( function_exists( 'mb_strtolower' ) ) {
return preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"' U8' . bin2hex( \"$1\" )",
mb_strtolower( $string ) );
} else {
list( , $wikiLowerChars ) = Language::getCaseMaps();
return preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"' U8' . bin2hex( strtr( \"\$1\", $wikiLowerChars ) )",
$string );
}
}
}

上面的代码就会把汉字转换为U8xxxx空格，然后就可以使用mysql的full text索引了，其实5.0之后的mysql可以使用utf8字符做全文索引了，但是由于分词的问题，还是需要把每个汉字用空格分开，而且要设置最小索引字符长度才行，所以还是wiki的这种方式方便。

因为它是一个汉字作为一个词，没有按顺序搜索，所以最后结果和中国人的语言习惯不太一样，其实只需要改一下源代码，使用冒号封装短语，就可以得出比较精确的结果了。

vim wiki/includes/SearchMySQL4.php

找到以下代码

if( $this-&gt;strictMatching &amp;&amp; ($terms[1] == '') ) {
$terms[1] = '+';
}
$searchon .= $terms[1] . $wgContLang-&gt;stripForSearch( $terms[2] );

修改为

if( $this-&gt;strictMatching &amp;&amp; ($terms[1] == '') ) {
// $terms[1] = '+';
$terms[1] = '+"';
}
$searchon .= $terms[1] . $wgContLang-&gt;stripForSearch( $terms[2] ) . '"';

即可精确搜索。

已经修改仅是针对旧版本的wiki，如果是最新版本的话，已经自带了对应的逻辑，如果对这个结果还不满意的话，那只能用外部的扩展自己实现Mysql索引的这一块了，如果是公共网站的话最简单的方法就是用google搜索+site标签指定。呵呵。

附上新版的SearchMySQL.php中这部分的代码，写得比较隐晦，哈哈

		if( preg_match_all( '/([-+<>~]?)(([' . $lc . ']+)(\*?)|"[^"]*")/',
			  $filteredText, $m, PREG_SET_ORDER ) ) {
			foreach( $m as $bits ) {
				@list( /* all */, $modifier, $term, $nonQuoted, $wildcard ) = $bits;
				
				if( $nonQuoted != '' ) {
					$term = $nonQuoted;
					$quote = '';
				} else {
					$term = str_replace( '"', '', $term );
					$quote = '"';
				}

看上去中文的会匹配 $quote = ‘”‘; 这个结果，就是自动帮你加上引号了。

Comments (8)

4m 的 adsl

ubuntu 的多语言环境和 locale

Deprecate tcp_tw_{reuse,recycle}

Mysql 全文索引的中文问题 （Mediawiki搜索中文问题）

Mysql 全文索引的中文问题（Mediawiki搜索中文问题）