Tuesday, August 31, 2010

Iterative methods for solving the first k eigenvectors

For any given symmetric matrix and a random vector v, the k-order linear transformation of v by A is an estimation of the eigenvector corresponding to the largest eigenvalue of A. i.e.
u = A^k v with k grows, then Au = \lambda_1 u.

Monday, August 30, 2010

The Intrinsic Relationships between Distributed Computing and Iterative Methods

You can never exchange information in a group of task instances. The only proper way is use a reducer to exchange information.
If your problem is in a divide and conquer manner, you can divide the problem into several partitions and process it with a group of task instance.
But if unfortunately, your problem can not be solved with a one-shoot method in the divide and conquer manner, iteration methods may help you to make the problem in a dc mode.
Iteration methods make you able to solve a problem bit by bit. That's because, of course, solve a bit of a problem is usually much easier than solving the problem completely in a single step.

Saturday, August 28, 2010

The Key Word is Learning! Of course!

This is a good picture shows the important concepts about data mining. The key word is learning!

Tuesday, August 24, 2010

Universal hash with mod p

A universal hash is a class of hash functions H with finite number of hashing functions. Moreover, for any given 0 <= i < j <= m-1,the number of hash functions in H which cause h(i)=h(j) is no more than |H| / m. That is to say, for a randomly chosen hash function from H, the probability Pr{h(i)=h(j)} <= 1/m.
For a sufficiently large prime p and for any 1 < m < p, the hash functions h(k) = ak+b mod p is a universal hash. |H| = p(p-1).
for any given i < j, s = ai+b mod p, l = aj+b mod p, if l = s mod p, then a(i-j) = 0 mod p, this is impossible.
(s, l) is a function of (a, b) if (i, j) is given.
As (s-l) = a(i-j) mod p, there exists and only exists one a. Therefore, (a, b) is a function of (s, l) if (i, j) is given.
So the hash function is a one-one map between (a, b) and (s, l) if (i, j) is give. |{s, l}| = |{a, b}| = p(p-1).
So |h(i)=h(j) mod m| <= p(ceiling(p/m)-1) <= p(p-1)/m = |H| / m.
A large prime is useful!


Thursday, August 19, 2010

对机器学习中枚举型特征的通用处理方法

机器学习算法需要能计算两个特征向量之间的距离,这是最低要求。大部分机器学习算法需要特征向量具有欧式空间特性(满足勾股定理)。这对一般的实数型特征是合适的,但考虑枚举型特征,就不合适。比如一个枚举型特征叫做颜色{红,黄,绿},如果允许我们定义任意两个不同值之间的差别为1,那么我们对齐编号,比如0,1,2,都是不合适的,无法保证这一点。比如红,绿之间的距离就是2。
一种通用的做法是把每一个值作为一个维度,如果一个特征有k个枚举值,那么就把这个特征扩展为k-维的一个特征。比如取值为红,特征为[1, 0, 0],取值为黄就是[0, 1, 0]。这样就能保证任意两个不同的枚举值之间的距离为1。
这说明一个问题,一个枚举特征本身并不是一个1维的特征,而是要大于1维。枚举特征的出现会明显的增大特征空间的维度,从而产生更强的描述能力,但同时也带来对样本的更大的需求,甚至是泛化能力方面的问题。

Wednesday, August 18, 2010

从一个序列中获取最大和最小元素所需要的最小比较数

by Binqiang Zhao
这类问题一定要抓住比较最本质的内容,使用尽可能少的对算法形态的具体假设。
考虑一个n个元素组成的序列,需要从中找到最大元素max和最小元素min。
你能做的事情就是比较。
在比较之前,每个元素都有可能是最大元素和最小元素,因此max和min的候选分别为n个。总体候选数为2n。
比较就是减少候选,当候选减少到分别0个时,即是最终的解。
比较有三种:
1. 如果两个候选元素没参加过任何一次比较,那么对它们进行一次比较,较小的候选就仅仅是min的候选,较大的候选是max 的候选。min和max的候选个数分别减少1.
2. 对于已知是min的候选的元素,和min的当前值比较一次,min的候选个数减少1.
3. 对于已知是max的候选的元素,和max的当前值比较一次,max的候选个数减少1.
第1种比较可以进行的个数为floor(n/2)。
对于2、3两种情况,每个经过1的元素只能参加2和3之一,因此一次比较减少候选1. 这种比较的个数为ceiling(n/2) * 2 - 2 (其中两次比较已经蕴含在1中)。
对于初始的情况,可以选择两个元素进行比较,较大的赋给max,较小的赋给min。
因此,候选2n = floor(n/2) + ceiling(n/2)*2 - 2 = ceiling(3n/2) - 2。
这也是最小的比较次数。这里没限制你的算法如何设计,事实上,你无论如何设计算法,在最坏的情况下不会小于这个比较次数。这也是理论分析的美妙之处。

两个C++的资源

 
 

Sent to you by Binqiang via Google Reader:

 
 

via 酷壳 - CoolShell.cn by 陈皓 on 4/18/10

第一个是一个C++第三方类库的A-Z:(http://www.trumphurst.com/cpplibs/cpplibs.php)其中包含了:

  • 开源的C++的第三方类库列表
  • 商业的C++的第三方类库列表
  • 一些经典的C++的随书源码
  • 一些C++相关的工具

不过,这个网站好像最新更新是在2008年。

第二个是Boost C++的一个教程:(http://en.highscore.de/cpp/boost/

这个教程可能是写得比较不错的了,不过是英文的。


 
 

Things you can do from here:

 
 

一些重要的算法

 
 

Sent to you by Binqiang via Google Reader:

 
 

via 酷壳 - CoolShell.cn by 陈皓 on 7/11/10

下面是一些比较重要的算法,原文罗列了32个,但我觉得有很多是数论里的,和计算机的不相干,所以没有选取。下面的这些,有的我们经常在用,有的基本不用。有的很常见,有的很偏。不过了解一下也是好事。也欢迎你留下你觉得有意义的算法。(注:本篇文章并非翻译,其中的算法描述大部份摘自Wikipedia,因为维基百科描述的很专业了)

  1. A*搜寻算法
    俗称A星算法。这是一种在图形平面上,有多个节点的路径,求出最低通过成本的算法。常用于游戏中的NPC的移动计算,或线上游戏的BOT的移动计算上。该算法像Dijkstra算法一样,可以找到一条最短路径;也像BFS一样,进行启发式的搜索。
  2. Beam Search
    束搜索(beam search) 方法是解决优化问题的一种启发式方法,它是在分枝定界方法基础上发展起来的,它使用启发式方法估计k 个最好的路径,仅从这k 个路径出发向下搜索,即每一层只有满意的结点会被保留,其它的结点则被永久抛弃,从而比分枝定界法能大大节省运行时间。束搜索于20 世纪70 年代中期首先被应用于人工智能领域,1976 年Lowerre 在其称为HARPY的语音识别系统中第一次使用了束搜索方法,他的目标是并行地搜索几个潜在的最优决策路径以减少回溯,并快速地获得一个解。
  3. 二分取中查找算法
    一种在有序数组中查找某一特定元素的搜索算法。搜素过程从数组的中间元素开始,如果中间元素正好是要查找的元素,则搜素过程结束;如果某一特定元素大于或者小于中间元素,则在数组大于或小于中间元素的那一半中查找,而且跟开始一样从中间元素开始比较。这种搜索算法每一次比较都使搜索范围缩小一半。
  4. Branch and bound
    分支定界 (branch and bound) 算法是一种在问题的解空间树上搜索问题的解的方法。但与回溯算法不同,分支定界算法采用广度优先或最小耗费优先的方法搜索解空间树,并且,在分支定界算法中,每一个活结点只有一次机会成为扩展结点。
  5. 数据压缩
    数据压缩是通过减少计算机中所存储数据或者通信传播中数据的冗余度,达到增大数据密度,最终使数据的存储空间减少的技术。数据压缩在文件存储和分布式系统领域有着十分广泛的应用。数据压缩也代表着尺寸媒介容量的增大和网络带宽的扩展。
  6. Diffie–Hellman密钥协商
    Diffie–Hellman key exchange,简称"D–H", 是一种安全协议。它可以让双方在完全没有对方任何预先信息的条件下通过不安全信道建立起一个密钥。这个密钥可以在后续的通讯中作为对称密钥来加密通讯内容。
  7. Dijkstra's 算法
    迪科斯彻算法(Dijkstra)是由荷兰计算机科学家艾兹格·迪科斯彻(Edsger Wybe Dijkstra)发明的。算法解决的是有向图中单个源点到其他顶点的最短路径问题。举例来说,如果图中的顶点表示城市,而边上的权重表示著城市间开车行经的距离,迪科斯彻算法可以用来找到两个城市之间的最短路径。
  8. 动态规划
    动态规划是一种在数学和计算机科学中使用的,用于求解包含重叠子问题的最优化问题的方法。其基本思想是,将原问题分解为相似的子问题,在求解的过程中通过子问题的解求出原问题的解。动态规划的思想是多种算法的基础,被广泛应用于计算机科学和工程领域。比较著名的应用实例有:求解最短路径问题,背包问题项目管理网络流优化等。这里也有一篇文章说得比较详细。
  9. 欧几里得算法
    在数学中,辗转相除法,又称欧几里得算法,是求最大公约数的算法。辗转相除法首次出现于欧几里得的《几何原本》(第VII卷,命题i和ii)中,而在中国则可以追溯至东汉出现的《九章算术》。
  10. 最大期望(EM)算法
    在统计计算中,最大期望(EM)算法是在概率probabilistic)模型中寻找参数最大似然估计的算法,其中概率模型依赖于无法观测的隐藏变量(Latent Variable)。最大期望经常用在机器学习计算机视觉数据聚类Data Clustering)领域。最大期望算法经过两个步骤交替进行计算,第一步是计算期望(E),利用对隐藏变量的现有估计值,计算其最大似然估计值;第二步是最大化(M),最大化在 E 步上求得的最大似然值来计算参数的值。M 步上找到的参数估计值被用于下一个 E 步计算中,这个过程不断交替进行。
  11. 快速傅里叶变换 (FFT)
    快速傅里叶变换(Fast Fourier Transform,FFT),是离散傅里叶变换的快速算法,也可用于计算离散傅里叶变换的逆变换。快速傅里叶变换有广泛的应用,如数字信号处理、计算大整数乘法、求解偏微分方程等等。本条目只描述各种快速算法,对于离散傅里叶变换的性质和应用,请参见离散傅里叶变换
  12. 哈希函数
    Hash Function是一种从任何一种数据中创建小的数字"指纹"的方法。该函数将数据打乱混合,重新创建一个叫做散列值的指纹。散列值通常用来代表一个短的随机字母和数字组成的字符串。好的散列函数在输入域中很少出现散列冲突。在散列表和数据处理中,不抑制冲突来区别数据,会使得数据库记录更难找到。
  13. 堆排序
    Heapsort 是指利用堆积树)这种数据结构所设计的一种排序算法。堆积树是一个近似完全二叉树的结构,并同时满足堆积属性:即子结点的键值或索引总是小于(或者大于)它的父结点。
  14. 归并排序
    Merge sort是建立在归并操作上的一种有效的排序算法。该算法是采用分治法(Divide and Conquer)的一个非常典型的应用。
  15. RANSAC 算法
    RANSAC 是"RANdom SAmple Consensus"的缩写。该算法是用于从一组观测数据中估计数学模型参数的迭代方法,由Fischler and Bolles在1981 提出,它是一种非确定性算法,因为它只能以一定的概率得到合理的结果,随着迭代次数的增加,这种概率是增加的。 该算法的基本假设是观测数据集中存在"inliers"(那些对模型参数估计起到支持作用的点)和"outliers"(不符合模型的点),并且这组观测数据受到噪声影响。RANSAC 假设给定一组"inliers"数据就能够得到最优的符合这组点的模型。
  16. RSA加密演算法
    这是一个公钥加密算法,也是世界上第一个适合用来做签名的算法。今天的RSA已经专利失效,其被广泛地用于电子商务加密,大家都相信,只要密钥足够长,这个算法就会是安全的
  17. 并查集Union-find
    并查集是一种树型的数据结构,用于处理一些不相交集合(Disjoint Sets)的合并及查询问题。常常在使用中以森林来表示。
  18. Viterbi algorithm
    寻找最可能的隐藏状态序列(Finding most probable sequence of hidden states)

附录


 
 

Things you can do from here:

 
 

《Introduction to Algorithms》(2nd Edition) 一个错误的题目6.3-3

《算法导论》2e发现了一个错误的问题:6.3-3
Problem: Show that there are at most ceiling(n/2^(h+1)) nodes of height h in any n-element heap.
其中ceiling(x) = min {n >= x and n is an integer}。
考虑n = 4, 容易看到h = 1层node的个数为2,但ceiling(n/(2^(h+1)) = ceiling(4/2^(1+1)) = 1。2 > 1并不是at most 1。
应该是at least:
对于h=0,结论显然成立。
对 于h > 0,该层必然满,节点数=2^(floor(log2(n)) - h) >= 2^(log2(n) - 1 - h) = n / 2^(h+1),考虑到>=左边是个整数,因此对于n / 2^(h+1)取个ceiling仍然也是正确的。

查找最小和次小元素的比较次数

最小比较次数C = n + ceiling(log2(n)) - 2
具体的方法是:
最小元素
选择floor(n/2) * 2个元素,平均分成两组,进行一对一的比较,比较次数为floor(n/2),选择较小的一部分,元素个数为ceiling(n/2)。然后对着ceiling(n/2)个元素进行相应的处理,直到剩下最后一个,则可以证明这个元素是最小的。由于每次比较会淘汰1个元素,最后留下一个元素需要淘汰的元素数也就是比较次数,为n-1.
次小元素
在上述过程中记录所有和最小的元素比较过的元素,则总共需要n-1个记录单元。这些元素中必然包括次小的元素,原因是次小的元素必须通过最小的元素才能被淘汰。因此,这也是上述"组比较"的次数。这个次数的上限可以这样考虑:定义C(n) = ceiling(n/2),并且定义C*(n) = min{k | C...C(k个)(n) = 1}。容易证明C*(n)是个不减函数。对任意n=2^m,C*(n) = m。由于n = 2^(log2(n)) <= 2^ceiling(log2(n)),于是C*(n) <= ceiling(log2(n))。容易证明C*(n) > floor(log2(n))。因此C*(n) = ceiling(log2(n))。在这 ceiling(log2(n))个元素中进行 ceiling(log2(n)) - 1次比较可以得到最小元素。
因此总比较次数为C = n + ceiling(log2(n)) - 2。

Monday, August 16, 2010

The Most Important Algorithms (in CS and Math)

 
 

Sent to you by Binqiang via Google Reader:

 
 

via Data, Knowledge & Life on 7/18/10


  我接触的同僚之中,大约每个人心里都有自己最爱的几种算法。下面是Christoph Koutschan列出来的32类计算机与数学领域最为重要的算法(按字符顺序排列)。覆盖的面很广,评价很精准。
  1. A* search algorithm
    Graph search algorithm that finds a path from a given initial node to a given goal node. It employs a heuristic estimate that ranks each node by an estimate of the best route that goes through that node. It visits the nodes in order of this heuristic estimate. The A* algorithm is therefore an example of best-first search.
  2. Beam Search
    Beam search is a search algorithm that is an optimization of best-first search. Like best-first search, it uses a heuristic function to evaluate the promise of each node it examines. Beam search, however, only unfolds the first m most promising nodes at each depth, where m is a fixed number, the beam width.
  3. Binary search
    Technique for finding a particular value in a linear array, by ruling out half of the data at each step.
  4. Branch and bound
    A general algorithmic method for finding optimal solutions of various optimization problems, especially in discrete and combinatorial optimization.
  5. Buchberger's algorithm
    In computational algebraic geometry and computational commutative algebra, Buchberger's algorithm is a method of transforming a given set of generators for a polynomial ideal into a Gröbner basis with respect to some monomial order. One can view it as a generalization of the Euclidean algorithm for univariate gcd computation and of Gaussian elimination for linear systems.
  6. Data compression
    Data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes.
  7. Diffie-Hellman key exchange
    Cryptographic protocol which allows two parties that have no prior knowledge of each other to jointly establish a shared secret key over an insecure communications channel. This key can then be used to encrypt subsequent communications using a symmetric key cipher.
  8. Dijkstra's algorithm
    Algorithm that solves the single-source shortest path problem for a directed graph with nonnegative edge weights.
  9. Discrete differentiation
    I.e., the formula f'(x) = (f(x+h) - f(x-h)) / 2h.
  10. Dynamic programming
    Dynamic programming is a method for reducing the runtime of algorithms exhibiting the properties of overlapping subproblems and optimal substructure, described below.
  11. Euclidean algorithm
    Algorithm to determine the greatest common divisor (gcd) of two integers. It is one of the oldest algorithms known, since it appeared in Euclid's Elements around 300 BC. The algorithm does not require factoring the two integers.
  12. Expectation-maximization algorithm (EM-Training)
    In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation step, which computes the expected value of the latent variables, and a maximization step, which computes the maximum likelihood estimates of the parameters given the data and setting the latent variables to their expectation.
  13. Fast Fourier transform (FFT)
    Efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. FFTs are of great importance to a wide variety of applications, from digital signal processing to solving partial differential equations to algorithms for quickly multiplying large integers.
  14. Gradient descent
    Gradient descent is an optimization algorithm that approaches a local minimum of a function by taking steps proportional to the negative of the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
  15. Hashing
    A function for summarizing or probabilistically identifying data. Typically this means one applies a mathematical formula to the data, producing a string which is probably more or less unique to that data. The string is much shorter than the original data, but can be used to uniquely identify it.
  16. Heaps (heap sort)
    In computer science a heap is a specialized tree-based data structure. Heaps are favourite data structures for many applications: Heap sort, selection algorithms (finding the min, max or both of them, median or even any kth element in sublinear time), graph algorithms.
  17. Karatsuba multiplication
    For systems that need to multiply numbers in the range of several thousand digits, such as computer algebra systems and bignum libraries, long multiplication is too slow. These systems employ Karatsuba multiplication, which was discovered in 1962.
  18. LLL algorithm
    The Lenstra-Lenstra-Lovasz lattice reduction (LLL) algorithm is an algorithm which, given a lattice basis as input, outputs a basis with short, nearly orthogonal vectors. The LLL algorithm has found numerous applications in cryptanalysis of public-key encryption schemes: knapsack cryptosystems, RSA with particular settings, and so forth.
  19. Maximum flow
    The maximum flow problem is finding a legal flow through a flow network that is maximal. Sometimes it is defined as finding the value of such a flow. The maximum flow problem can be seen as special case of more complex network flow problems. The maximal flow is related to the cuts in a network by the Max-flow min-cut theorem. The Ford-Fulkerson algorithm computes the maximum flow in a flow network.
  20. Merge sort
    A sorting algorithm for rearranging lists (or any other data structure that can only be accessed sequentially, e.g. file streams) into a specified order.
  21. Newton's method
    Efficient algorithm for finding approximations to the zeros (or roots) of a real-valued function. Newton's method is also a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions.
  22. Q-learning
    Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. A strength with Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment.
  23. Quadratic sieve
    The quadratic sieve algorithm (QS) is a modern integer factorization algorithm and, in practice, the second fastest method known (after the number field sieve, NFS). It is still the fastest for integers under 110 decimal digits or so, and is considerably simpler than the number field sieve.
  24. RANSAC
    RANSAC is an abbreviation for "RANdom SAmple Consensus". It is an algorithm to estimate parameters of a mathematical model from a set of observed data which contains "outliers". A basic assumption is that the data consists of "inliers", i. e., data points which can be explained by some set of model parameters, and "outliers" which are data points that do not fit the model.
  25. RSA
    Algorithm for public-key encryption. It was the first algorithm known to be suitable for signing as well as encryption. RSA is still widely used in electronic commerce protocols, and is believed to be secure given sufficiently long keys.
  26. Schönhage-Strassen algorithm
    In mathematics, the Schönhage-Strassen algorithm is an asymptotically fast method for multiplication of large integer numbers. The run-time is O(N log(N) log(log(N))). The algorithm uses Fast Fourier Transforms in rings.
  27. Simplex algorithm
    In mathematical optimization theory, the simplex algorithm a popular technique for numerical solution of the linear programming problem. A linear programming problem consists of a collection of linear inequalities on a number of real variables and a fixed linear functional which is to be maximized (or minimized).
  28. Singular value decomposition (SVD)
    In linear algebra, SVD is an important factorization of a rectangular real or complex matrix, with several applications in signal processing and statistics, e.g., computing the pseudoinverse of a matrix (to solve the least squares problem), solving overdetermined linear systems, matrix approximation, numerical weather prediction.
  29. Solving a system of linear equations
    Systems of linear equations belong to the oldest problems in mathematics and they have many applications, such as in digital signal processing, estimation, forecasting and generally in linear programming and in the approximation of non-linear problems in numerical analysis. An efficient way to solve systems of linear equations is given by the Gauss-Jordan elimination or by the Cholesky decomposition.
  30. Strukturtensor
    In pattern recognition: Computes a measure for every pixel which tells you if this pixel is located in a homogenous region, if it belongs to an edge, or if it is a vertex.
  31. Union-find
    Given a set of elements, it is often useful to partition them into a number of separate, nonoverlapping groups. A disjoint-set data structure is a data structure that keeps track of such a partitioning. A union-find algorithm is an algorithm that performs two useful operations on such a data structure:
    Find: Determine which group a particular element is in.
    Union: Combine or merge two groups into a single group.
  32. Viterbi algorithm
    Dynamic programming algorithm for finding the most likely sequence of hidden states - known as the Viterbi path - that result in a sequence of observed events, especially in the context of hidden Markov models.
  Daniel Lemire在看到这个名单之后,列出了他心中的Top 5:
  • Binary search is the first non-trivial algorithm I remember learning.
  • The Fast Fourier transform (FFT) is an amazing algorithm. Combined with the convolution theorem, it lets you do magic.
  • While hashing is not an algorithm, it is one of the most powerful and useful idea in Computer Science. It takes minutes to explain it, but years to master.
  • Merge sort is the most elegant sorting algorithm. You can explain it in three sentences to anyone.
  • While not an algorithm per se, the Singular Value Decomposition (SVD) is the most important Linear Algebra concept I don't remember learning as an undergraduate. (And yes, I went to a good school. And yes, I was an A student.) It can help you invert singular matrices and do other similar magic.

 
 

Things you can do from here:

 
 

一个错误的产生随机排列的算法

对元素两两互不相同的数组A[n]进行随机排列,使得到的每一个排列的概率相等,都是1/n!。
一个错误的做法是:
for (int i = 0; i < n; i++)
{
int j = rand(0, n); // generate a random integer in the interval: [0, n)
swap(a[i], a[j]);
}
错误原因分析:
这 个过程中产生的j序列构成了{0, 1, ..., n-1}^n空间中的一个点,这个空间中的点数为n^n。每一个这样的点对应唯一的一个A的排列。因此,每一个A的排列都对应m个{0, 1, ..., n-1}^n空间中的点。因此:
1/n! = m / n^n,于是m = n^n/n!,m是个整数。但对于一个一般的n是不成立的,比如n = 3。
从而上述产生的随机排列并不是概率均匀的。

Stable Sort

We call a sorting algorithm stable if it assures that for any A[i] = A[j] and i < j, after the sort procedure, maintains that the new coressponding positions i' < j'.
Stable sort includes: bubble sort, insertion sort, merge sort
Nonstable sort includes: heap sort
Notice that quick sort is stable or not is dependent of the implementation.
Any sort algorithm can be changed into a stable one by extending the key with a original order number. E.g. <1, 9, 8, 1> can be extended as <1.1, 9.2, 8.3, 1.4>.
A stable version will cost no less time and memory than a nonstable version sort. Abstractly, stable is a constraint, any constraint for the result of the output form of an algorithm will take some cost. The reason is that a constraint on the output will take some unnessesary cost than the non-stable one. stable is a special case of a nonstable one.