Mar 09 2010

Mysql 全文索引的中文问题 (Mediawiki搜索中文问题)

Tag: 技术ssmax @ 15:24:59

今天翻了一下meidawiki的源代码,由于它的中文搜索不太准确,想查查原因,就看了一下它的搜索是如何实现的。

数据库是mysql,使用了全文索引表进行搜索

CREATE TABLE `searchindex` (
`si_page` int(10) unsigned NOT NULL,
`si_title` varchar(255) NOT NULL DEFAULT ”,
`si_text` mediumtext NOT NULL,
UNIQUE KEY `si_page` (`si_page`),
FULLTEXT KEY `si_title` (`si_title`),
FULLTEXT KEY `si_text` (`si_text`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

mysql的FULLTEXT 对中文的支持一直不太好,如果直接用utf8字符串的话,没有分词分隔符,所以索引就没有效果,wiki通过取巧的方法,把utf8字符转换成U8xxxx进行保存,用英文空格分隔,所以就可以搜索了。

wiki的字符转换代码,比较有用,呵呵:

cat wiki/languages/classes/LanguageZh_cn.php

/**
* @addtogroup Language
*/
class LanguageZh_cn extends Language {
function stripForSearch( $string ) {
# MySQL fulltext index doesn't grok utf-8, so we
# need to fold cases and convert to hex
# we also separate characters as "words"
if( function_exists( 'mb_strtolower' ) ) {
return preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"' U8' . bin2hex( \"$1\" )",
mb_strtolower( $string ) );
} else {
list( , $wikiLowerChars ) = Language::getCaseMaps();
return preg_replace(
"/([\\xc0-\\xff][\\x80-\\xbf]*)/e",
"' U8' . bin2hex( strtr( \"\$1\", \$wikiLowerChars ) )",
$string );
}
}
}

上面的代码就会把汉字转换为U8xxxx空格,然后就可以使用mysql的full text索引了,其实5.0之后的mysql可以使用utf8字符做全文索引了,但是由于分词的问题,还是需要把每个汉字用空格分开,而且要设置最小索引字符长度才行,所以还是wiki的这种方式方便。

因为它是一个汉字作为一个词,没有按顺序搜索,所以最后结果和中国人的语言习惯不太一样,其实只需要改一下源代码,使用冒号封装短语,就可以得出比较精确的结果了。

vim wiki/includes/SearchMySQL4.php

找到以下代码

if( $this->strictMatching && ($terms[1] == '') ) {
$terms[1] = '+';
}
$searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] );

修改为


if( $this->strictMatching && ($terms[1] == '') ) {
// $terms[1] = '+';
$terms[1] = '+"';
}
$searchon .= $terms[1] . $wgContLang->stripForSearch( $terms[2] ) . '"';

即可精确搜索。


Feb 03 2010

Openvpn中tun和tap的区别

Tag: 技术ssmax @ 22:25:03

tun devices encapsulate IPv4 or IPv6 (OSI Layer 3) while tap devices encapsulate Ethernet 802.3 (OSI Layer 2).

今天搞了一个下午,把几个地区的网络用openvpn连起来了,如果用tun的话,就是模拟了一个p2p的环境,虽然能够连接到同网段别的ip,但是无法广播,这样就无法实现到某些网段的跳转网关了。

后来才看到有tap方式,以前一直没留意这个有什么用,查了手册才发现这个是模拟一个局域网的环境,非常赞,广播有了,怎么指定网关都可以了,哈哈。

ps:今天终于买到火车票。。。真难买啊。


Dec 24 2009

Linux Kernel Configuration

Tag: 技术ssmax @ 15:38:59
You can determine the amount of System V IPC resources available by looking at the contents of the following files:
  /proc/sys/kernel/shmmax - The maximum size of a shared memory segment.
  /proc/sys/kernel/shmmni - The maximum number of shared memory segments.
  /proc/sys/kernel/shmall - The maximum amount of shared memory
                              that can be allocated.
  /proc/sys/kernel/sem    - The maximum number and size of semaphore sets
                              that can be allocated.
For example, to view the maximum size of a shared memory segment that can be created enter:
  cat /proc/sys/kernel/shmmax

To change the maximum size of a shared memory segment to 256 MB enter:

  echo 268435456 > /proc/sys/kernel/shmmax

To view the maximum number of semaphores and semaphore sets which can be created enter:

cat /proc/sys/kernel/sem

This returns 4 numbers indicating:

 SEMMSL - The maximum number of semaphores in a sempahore set
 SEMMNS - The maximum number of sempahores in the system
 SEMOPM - The maximum number of operations in a single semop call
 SEMMNI - The maximum number of sempahore sets

 For WebSphere MQ:

  • the SEMMSL value must be 128 or greater
  • the SEMOPM value must be 5 or greater
  • the SEMMNS value must be 16384 or greater
  • the SEMMNI value must be 1024 or greater

 To increase the maximum number of semaphores available to WebSphere MQ, you should update the SEMMNS and SEMMNI values.

 

Maximum open files

If the system is heavily loaded, you might need to increase the maximum possible number of open files. If your distribution supports the proc filesystem you can do this by issuing the following command:  echo 32768 > /proc/sys/fs/file-max

If you are using a pluggable security module such as PAM (Pluggable Authentication Module), ensure that this does not unduly restrict the number of open files for the ‘mqm’ user.

 

TCP Tuning Background

The following is a summary of techniques to maximize TCP WAN throughput.

TCP uses what is called the “congestion window”, or CWND, to determine how many packets can be sent at one time. The larger the congestion window size, the higher the throughput. The TCP “slow start” and “congestion avoidance” algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket, there is a default value for the buffer size, which can be changed by the program using a system library call just before opening the socket. There is also a kernel enforced maximum buffer size. The buffer size can be adjusted for both the send and receive ends of the socket.

To get maximal throughput it is critical to use optimal TCP send and receive socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never fully open up. If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, which will cause the TCP window to shut down. This is likely to happen if the sending host is faster than the receiving host. Overly large windows on the sending side is not a big problem as long as you have excess memory.

The optimal buffer size is twice the bandwidth*delay product of the link:

buffer size = 2 * bandwidth * delay

The ping program can be used to get the delay, and tools such as pathrate to get the end-to-end capacity (the bandwidth of the slowest hop in your path). Since ping gives the round trip time (RTT), this formula can be used instead of the previous one:

buffer size = bandwidth * RTT.

For example, if your ping time is 50 ms, and the end-to-end network consists of all 100 BT Ethernet and OC3 (155 Mbps), the TCP buffers should be .05 sec * (100 Mbits / 8 bits) = 625 KBytes. (When in doubt, 10 MB/s is a good first approximation for network bandwidth on high-speed R and E networks like ESnet).

There are 2 TCP settings you need to know about. The default TCP send and receive buffer size, and the maximum TCP send and receive buffer size. Note that most of UNIX OS’s by default have a maximum TCP buffer size that is way too small 1 Gbps pipes, and all have a maximum that is too small for 10 Gbps flows. For instructions on how to increase the maximum TCP buffer, see the OS specific instructions for setting system defaults.

Linux, FreeBSD, Windows, and OSX all now support TCP autotuning, so you no longer need to worry about setting the default buffer sizes. But for Solaris or other older OSes you’ll need to use the UNIX setsockopt call in your sender and receiver to set the optimal buffer size for the link you are using.

/proc/sys/net/core/rmem_max - Maximum TCP Receive Window
/proc/sys/net/core/wmem_max – Maximum TCP Send Window
/proc/sys/net/ipv4/tcp_timestamps – timestamps (RFC 1323) add 12 bytes to the TCP header…
/proc/sys/net/ipv4/tcp_sack – tcp selective acknowledgements.
/proc/sys/net/ipv4/tcp_window_scaling – support for large TCP Windows (RFC 1323). Needs to be set to 1 if the Max TCP Window


Dec 21 2009

64位的Linux中运行32位的应用程序

Tag: 技术ssmax @ 15:22:44

    大部分Linux发行套件都有针对x86_64处理器的版本。比较典型的x86_64的处理器有ADM Athlon II和英特尔Xeon。因为这些Linux发行套件都有自己专用的软件源,这些软件源会为提供所有它所支持的应用软件的二进制包。如果你满足于Linux的安装方式,你可能不会需要运行32位的程序。

    一些Linux商业软件,尤其是游戏,只提供32的版本。因为某些特殊的理由,你可能需要配置你的电脑来运行32位的软件。

    而在64位linux下运行这些32位系统的时候,经常会出现:

    No such file or directory 错误。

    只需要安装32位的支持库,就可以解决以上问题。

    因为x86_64处理器是为x86技术涉及,所以它也是支持32位程序的。在Linux里,你所需要做的就是为这些软件安装必要的软件库。幸运的是,大部分Linux发行版本已经将这些打包好了。比方在Ubuntu/debian里,这个包就叫做ia32-libs。为了安装它,你可以打开一个终端,然后输入下面的内容:

    sudo apt-get install ia32-libs


Dec 15 2009

windows2003不能自动分配USB移动硬盘盘符的解决方法

Tag: 技术ssmax @ 22:52:35
windows2003不能自动分配USB移动硬盘盘符的解决方法
开始——>运行——>mountvol /e——>回车——>重启机器,win2003就会自动分配盘符给USB移动硬盘

Nov 24 2009

虚拟机 ubuntu vga 分辨率

Tag: 技术ssmax @ 14:27:52

#  FRAMEBUFFER RESOLUTION SETTINGS
#     +————————————————-+
#          | 640×480    800×600    1024×768   1280×1024
#      —-+——————————————–
#      256 | 0×301=769  0×303=771  0×305=773   0×307=775
#      32K | 0×310=784  0×313=787  0×316=790   0×319=793
#      64K | 0×311=785  0×314=788  0×317=791   0×31A=794
#      16M | 0×312=786  0×315=789  0×318=792   0×31B=795
#     +————————————————-+

 

ubuntu 9.10 使用了最新的grub2,启动参数好像有不少变动,虚拟机的分辨率调整:

方法1,还是原来的vga=788

编辑 /etc/default/grub ,中的GRUB_CMDLINE_LINUX=”vga=788″

保存以后运行update-grub

但是这样子会显示

vga=788 is deprecated and asks me to use “set gfxpayload=800×600x16;800×600″ before the linux line.

意思就是vga参数已经是建议不要使用了,要用另外一种方法:

方法2:

编辑/boot/grub/grub.cfg

找到引导linux那几行

增加 set gfxpayload=800×600x16,注意不要带分号,如下:

### BEGIN /etc/grub.d/10_linux ###
menuentry “Ubuntu, Linux 2.6.31-14-generic-pae” {
        recordfail=1
        if [ -n ${have_grubenv} ]; then save_env recordfail; fi
        set quiet=1
        insmod ext2
        set root=(hd0,1)
        search –no-floppy –fs-uuid –set 9a441a57-5a71-4800-b46d-2e4c1cec6dee
        set gfxpayload=800×600x16
        linux   /boot/vmlinuz-2.6.31-14-generic-pae root=UUID=9a441a57-5a71-4800-b46d-2e4c1cec6dee ro   quiet splash
        initrd  /boot/initrd.img-2.6.31-14-generic-pae
}


Nov 06 2009

Cron中的最常见错误。。。

Tag: 技术ssmax @ 12:29:53

一个常见的错误是,命令行双引号中使用%时,未加反斜线\,例如:

# 错误的例子:
1 2 3 4 5 touch ~/error_`date "+%Y%m%d"`.txt

在守护进程发出的电子邮件中会见到错误信息:

/bin/sh: unexpected EOF while looking for `'''''''
# 正确的例子:
1 2 3 4 5 touch ~/right_$(date +\%Y\%m\%d).txt

# 使用单引号也可以解决问题:
1 2 3 4 5 touch ~/error_$(date '+%Y%m%d').txt

# 使用单引号就不用加反斜线了。这个例子会产生这样一个文件 ~/error_\2006\04\03.txt
1 2 3 4 5 touch ~/error_$(date '+\%Y\%m\%d').txt

下例是另一个常见错误:

# Prepare for the daylight savings time shift
59 1 1-7 4 0 /root/shift_my_times.sh

初看似要在四月的第一个星期日早晨1时59分运行shift_my_times.sh,但是这样设置不对。

与其他域不同,第三和第四个域之间执行的是“或”操作。所以这个程序会在4月1日至7日以及4月余下的每一个星期日执行。

这个例子可以重写如下:

# Prepare for the daylight savings time shift
59 1 1-7 4 * test `date +\%w` = 0 && /root/shift_my_times.sh

另一个常见错误是对分钟设置的误用。下例欲一个程两个小时运行一次:

# adds date to a log file
* 0,2,4,6,8,10,12,14,16,18,20,22 * * * date >> /var/log/date.log

而上述设置会使该程序在偶数小时内的每一分钟执行一次。正确的设置是:

# runs the date command every even hour at the top of the hour
0 0,2,4,6,8,10,12,14,16,18,20,22 * * * date >> /var/log/date.log
# an even better way
0 */2 * * * date >> /var/log/date.log

Nov 05 2009

突破Windows并发连接数上限

Tag: 技术ssmax @ 19:02:02

除了调整tcpip.sys 的并发连接数,还需要修改windows并发连接数上限,默认大概只有5k左右

最重要的两个注册表键值 TcpNumConnections(TCP连接上限),MaxUserPort(能使用的端口数,默认5000)

还有其他的一些调整参数,具体如下:

 

Configure the max limit for concurrent TCP connections

To keep the TCP/IP stack from taking all resources on the computer, there are different parameters that control how many connections it can handle. If running applications that are constantly opening and closing connections (P2P), or are providing a service which many tries to connect to at the same time (Web-server like IIS), then one can improve the performance of these applications by changing the restriction limits.

There is a parameter that limits the maximum number of connections that TCP may have open simultaneously.

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
TcpNumConnections = 0×00fffffe (Default = 16,777,214)

Note a 16 Million connection limit sounds very promising, but there are other parameters (See below), which keeps us from ever reaching this limit.

When a client makes a connect() call to make a connection to a server, then the client invisible/implicit bind the socket to a local dynamic (anonymous, ephemeral, short-lived) port number. The default range for dynamic ports in Windows is 1024 to 5000, thus giving 3977 outbound concurrent connections for each IP Address. It is possible to change the upper limit with this DWORD registry key:

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
MaxUserPort = 5000 (Default = 5000, Max = 65534)

Note it is possible to reserve port numbers so they aren’t used as dynamic ports in case one have a certain application that needs them. This is done by using the ReservedPorts (Q812873) setting.

Note Vista changes the default range from 1024-5000 to 49152-65535, which can be controlled with the dynamicport setting using netsh. More Info MS KB929851.

More Info The Cable Guy – Ephemeral, Reserved, and Blocked Port Behavior
More Info MS KB Q196271
More Info MS KB Q319502
More Info MS KB Q319504
More Info MS KB Q328476
More Info MS KB Q836429

For each connection a TCP Control Block (TCB – Data structure using 0.5 KB pagepool and 0.5 KB non-pagepool) is maintained. The TCBs are pre-allocated and stored in a table, to avoid spending time on allocating/deallocating the TCBs every time connections are created/closed. The TCB Table enables reuse/caching of TCBs and improves memory management, but the static size limits how many connections TCP can support simultaneously (Active + TIME_WAIT). Configure the size of the TCB Table with this DWORD registry key:

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
MaxFreeTcbs = 2000 (Default = RAM dependent, but usual Pro = 1000, Srv=2000)

To make lookups in the TCB table faster a hash table has been made, which is optimized for finding a certain active connection. If the hash table is too small compared to the total amount of active connections, then extra CPU time is required to find a connection. Configure the size of the hash table with this DWORD registry key (Is allocated from pagepool memory):

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \services \Tcpip \Parameters]
MaxHashTableSize = 512 (Default = 512, Range = 64-65536)

Note Microsoft recommends for a multiprocessor environment, that the value should not be higher than the maximum amount of concurrent connections (MaxFreeTcbs), also if multiprocessor then it might be interesting to look at the registry-key NumTcbTablePartitions (Recommended value CPU-count multiplied by 4).

More Info MS KB Q151418
More Info MS KB Q224585

If having allocated a 1000 TCBs then it doesn’t mean that one will be able to have a 1000 active connections. Especially if the application is quickly opening and closing connections, because after a connection is “closed” it enters the state TIME_WAIT, and will continue to occupy the port number for 4 minutes (2*Maximum Segment Live, MSL) before it is actually removed. This behavior is specified in RFC 793, and prevents attempts to reconnect to the same party, before the old socket is recognized as closed at both sides. It is possible to change how long a socket should be in TIME_WAIT state before it can be re-used freely:

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \services \Tcpip \Parameters]
TcpTimedWaitDelay = 120 (Default = 240 secs, Range = 30-300)

More Info MS KB Q137984
More Info MS KB Q149532
More Info MS KB Q832954

Note with Win2k the reuse of sockets have been changed, so when reaching the limit of more than 1000 connections in TIME-WAIT state, then it starts to mark sockets that have been in TIME_WAIT state for more than 60 secs as free. It is possible to configure this limit:

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \services \Tcpip \Parameters]
MaxFreeTWTcbs = 1000 (Default = 1000 sockets)

Note with Win2k3 SP1 the reuse of sockets have been changed, so when it has to re-use sockets in TIME_WAIT state, then it checks whether the other party is different from the old socket. Eliminating the need to fiddle with (TcpTimedWaitDelay) and (MaxFreeTWTcbs) any more.

If using an application protocol that doesn’t implement timeout checking, but relies on the TCPIP timeout checking without specifying how often it should be done, then it is possible to get connections that “never” closes, if the remote host disconnects without closing the connection properly. The TCPIP timeout checking is by default done every 2 hour, by sending a keep alive packet. It is possible to change how often TCPIP should check the connections (Affects all TCPIP connections):

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \services \Tcpip \Parameters]
KeepAliveTime = 1800000 (Default = 7,200,000 milisecs)

More Info MS KB Q140325

When data is sent/received the data is copied back and forth to non-paged pool memory for buffering. If there are many connections receiving/sending data, then it is possible to exhaust the non-paged pool memory. The max size of the non-paged pool buffer allocated for each connection is controlled by MaxBufferredReceiveBytes or TCPIP Receive Window depending on which is smallest. More Info MS KB Q296265

Note if using the Professional/Home edition of Windows then it is very likely that it is crippled (By Microsoft) not to handle many concurrent TCP connections. Ex. Microsoft have officially stated that the backlog limit is 5 (200 when Server), so the Professional edition is not able to accept() more than 5 new connections concurrently. More Info MS KB Q127144

Note even if having optimized Windows to handle many concurrent connections, then connections might still be refused when reaching a certain limit, in case a NAT-Router/Firewall is placed infront of it, which is unable to handle so many concurrent connections.

Note if having activated SYN-Attack-Protection (Enabled by default in Win2k3 SP1) or installed WinXP SP2, a limit is introduced on how many connection attempts (half-open) one can make simultaneously (XP SP2 & Vista = 10; Vista SP2 = no limit). This will keep worms like blaster and sasser from spreading too fast, but it will also limit other applications that creates many new connections simultaneously (Like P2P).

EventID 4226: TCP/IP has reached the security limit imposed on the number of concurrent TCP connect attempts

More Info www.LvlLord.de

Windows Vista SP2 removes the limit again, but it can be enabled with the following DWORD registry setting:

[HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Services \Tcpip \Parameters]
EnableConnectionRateLimiting = 1

More Info MS KB 969710

Related No more than 10 connections to a remote computer


Sep 30 2009

win2003 SP2 X64 程序字体变小的问题

Tag: 技术ssmax @ 18:08:00

今天用了win2003 sp2 r2(x64)的简体中文msdn标准版,怎么发现应用程序里面的字体小了,不知道哪里能改呢?比如QQ好友组的那些字体都好小,谁知道怎么搞啊???谢谢了

通过更改主题外观不能解决问题,这个我已经测试了。有其他办法么?

自己终于搞定了,希望对其他人有类似问题的有帮助
  打开注册表编辑器找到这里
  [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\GRE_Initialize]
  把”GUIFont.Facename”字符串的值改为Tahoma
  把”GUIFont.Height”DWORD值改为8
   
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes]
    把”MS Shell Dlg 2″和”MS Shell Dlg”字符串的值改为Tahoma
  结果如下面所示:
    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\GRE_Initialize]
  ”GUIFont.Facename”=”Tahoma”
  ”GUIFont.Height”=dword:00000008

  [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes]
  ”MS Shell Dlg 2″=”Tahoma”
  ”MS Shell Dlg”=”Tahoma”

改完后重启就好了,呵呵。


Sep 17 2009

新版Eclipse和Flashget、360safe等冲突问题

Tag: 技术ssmax @ 14:02:24

新版伽利略eclipse会和flashget、360安全卫士等冲突,导致不能启动,

以下是解决方法

快捷方式加上参数 -vm “%JAVA_HOME%\jre\bin\javaw.exe”

指定使用javaw,以前eclipse默认是使用javaw的,不知道为啥现在改了,呵呵。

可能是问题原因:

The reason we need a contiguous memory region for the heap is that we have a bunch of side data structures that are indexed by (scaled) offsets from the start of the heap. For example, we track object reference updates with a “card mark array” that has one byte for each 512 bytes of heap. When we store a reference in the heap we have to mark the corresponding byte in the card mark array. We right shift the destination address of the store and use that to index the card mark array. Fun addressing arithmetic games you can’t do in Java that you get to (have to :-) play in C++.

Usually we don’t have trouble getting modest contiguous regions (up to about 1.5GB on Windohs, up to about 3.8GB on Solaris. YMMV.). On Windohs, the problem is mostly that there are some libraries that get loaded before the JVM starts up that break up the address space. Using the /3GB switch won’t rebase those libraries, so they are still a problem for us.

We know how to make chunked heaps, but there would be some overhead to using them. We have more requests for faster storage management than we do for larger heaps in the 32-bit JVM. If you really want large heaps, switch to the 64-bit JVM. We still need contiguous memory, but it’s much easier to get in a 64-bit address space.


Next Page »