Statistical Consulting

Our services include consulting on statistical methods, data visualization, and production of analytic reports. SupStat Inc. provides data analysis in wide variety of subject areas, including mathematical modeling, machine learning, econometrics, and biostatistics.


Our team includes experts in statistical sampling techniques, survey instrument development and testing, qualitative and quantitative experimental design, human factors research, and clinical trials management. We work closely with our clients to align our expertise with each project’s individual needs. SupStat designs and analyzes studies for all sizes of data, from small, specialized field tests to large, nationally representative samples.


We have a fantastic group of experts in graphics and visualization to help clients display data with the greatest possible clarity and efficiency. Our team is trained to take advantage of modern data analysis software to overcome the constraints of simple spreadsheet-based graphics.


Applying cluster analysis to study patterns of physical activity in pregnant women in the Blossom Project

Helping researchers in animal behavior improve survey design by using random forest analyses and multivariate logistic regression to highlight significant questions

Building a scoring system for predicting incidence of animal disease using the machine learning techniques of group lasso regularization and ROC analysis


Transwarp Technologies: Cache Strategy for CDNs

Problem: Content delivery networks (CDNs) cache web pages for information retrieval tasks; however, storage and speed limitations require that cached content be constrained such that only the most relevant data is retained.

Solution: R-generated heat maps were used to categorize web pages according to hit rate and then store those above a specified hit-rate threshold within the TDH-HDFS database structure. This system was deployed in 2013 and is able to manage up to 9 million records per second.

Beijing Municipal Health Bureau

Problem: Cardiovascular disease is the number one killer among the elderly in Beijing. To improve preventive care, the BMHB keeps health profiles on every elderly citizen in Beijing, but needs to be able to identify which of nearly 200 health characteristics is most linked to cardiovascular disease.

Solution: Group Lasso and partial correlation methods were used to select characteristics most highly linked to the occurrence of cardiovascular disease. The model built on these features exhibited 98% classification accuracy and was used to develop an early warning system now in use by the BMHB

Financial and Economic Committee of China

Problem: The FEC must provide annual estimates of the next year’s tax revenue to inform present policy decisions; the Stamp Duty is notoriously difficult to predict, despite the involvement of several teams of analysts from different disciplines

Solution: We used a variety of time-series models to predict with 93.7% accuracy a drop in Stamp Duty revenue by 10.62% in 2012. The estimates of other government departments predicted the opposite and thus were far off of the actual figure. Our model also revealed that stock turnover, cash in circulation, and GDP are the most relevant factors dictating Stamp Duty revenue.

New movie rating prediction

Problem: Many examples revealed that poor reputation would affect the film grossing, so some production companies hoped to predict a new film's rating before the release.

Solution: We utilized Item Response Theory to select the reliable rating team members. We also used collaborative filtering to handle missing data. Then exerted a Bayesian model to optimize the final rating prediction. Our forecasting system achieved 3.5% of Mean Absolute Error for movies’ Rotten Tomatoes ratings. The results were decided to estimate the rating of two independent films to participate the Sundance Film Festival in 2013.

Transwarp Technologies: Bill Query System

Problem: Telecommunications billing data has exploded in recent years due to the rapid development of mobile technology and changes in usage patterns. Storage and retrieval at these scales poses challenges to the speed and efficiency of data analysis and management.

Solution: A Transwarp platform on x86 clusters was used to provide a 30-fold increase in query performance over RISC platforms. This new system can handle 30 TB of monthly users’ billing data.

Citibike Data Collection, Prediction and Application  with R

Problem: Citibike is New York City's bike sharing system. Currently people find it difficult to find available bikes to rent, or available docks to return bike in hand. It is important to predict and guide users to the right station. Users of this system may have complex and diverse paths so it is hard to satisfy everybody, especially in the rush hour.

Solution: We built up a program to scrap data and save them to our database automatically. Using these data we utilized models from time series theory and machine learning to predict bike numbers in all the stations precisely. Based on the models, we build a website for this citibike system. This application helps users of citibike arrange their trips better.

Early Warning System of Cardiovascular

Problem: Government hospitals and health service stations had been building health profiles for every aged in Beijing. The bureau wanted to use these profiles to build an early warning system for every aged to prevent chronic diseases. The data had nearly 200 variables, with severe multicollinearity while the missing data problem was severe.

Solution: We utilized many imputation methods to deal with the missing data problems. After the data cleansing, we combined a mixture of group Lasso and partial correlation to select only 7 variables and combinations from the nearly 200 variables and built an optimal classication model. Our model had very accurate prediction of chronic diseases with a 98% accuracy of classification on the risk of Cardiovascular. Based on our model, the Bureau decided to build an early warning system and every aged in Beijing would benefit by just logging into their health profiles and getting the risk prediction of Cardiovascular.

Transwarp Technologies: Urban real-time video surveillance

Problem: Video surveillance system has become a standard practice to help coordinate emergency response, control traffic and enhance social security in urban areas.

Solution: We used Transwarp to help a city build a urban real-time video surveillance .

  1. Performance improvement. Our system cost less than 1/5 the time of the original ORACLE database.
  2. Higher speed. Compared with traditional database, each analysis costs less time.
  3. Each node of Hadoop cluster is both a computation node and a storage node.
  4. The access bandwidth of HDFS is the aggregated bandwidth of the entire network, which can be up to hundreds of Gbps or even more.
  5. Achieve storage location awareness. Tasks can be assigned to the video storage node and make full use of the local hard disk bandwidth to further improve throughput.
  6. When running a long-time video search task, if the primary task scheduler fails, the backup one will automatically take charge in order to continue running tasks.