Fushing’s research interests…….. 2022

“Data.n.science: CEDA in Complex systems”

A: Research theme in Data.n.science.

My research theme has been focusing on discovering data’s scientific intelligence via analyzing real world databases and solving my own scientific problems via creating my own databasesfrom complex dynamic systems of interest. This theme is channeled through the realty: data curators have encoded their domain knowledge and intelligence into their structured databases as well as I am the data curator myself. When facing this reality and having databases in hand, my research typically gears to analyze their databases not just to recover, but to go beyond the encoded intelligence. The room of achieving beyond the data curators’ domain knowledge and intelligence is described in P. W. Anderson’s 1972 Science paper with title “More is different”. The more data we have, the more discoveries of multiscale heterogeneity nature can be made. (See [F-7] for details.) The room for me to become a real scientist is wide open via Internet because data is everywhere, so is intelligence.

From the perspective of this research theme, I have been developing Categorical Exploratory Data Analysis (CEDA) coupled with a highly adaptable Major Factor Selection (MFS) protocol as a paradigm for studying wide ranges of complex systems.

A data-driven wisdom of mine: Data’s Multiscale Information Content with Structural Heterogeneity Solves Majority of Scientific Problems.

B: Complex dynamic systems of my current interest.

a. Human physical activities and their precision learnings with wearable sensors: MLB pitching, walking, Tango dance (see [G1-1], [G1-2], [G1-3]).

b. Human health within large survey databases and causal connectivity among chronical diseases: BRFSS, NHLBI. (see [G6-1], [G6-2]).

c. Quantitative Trait Loci (QTL) on plants and genome-wide selection with high order interactions: Tomato’s Late Blight disease, seed’s phenotypic characteristics and genetic connections of rice (see [G2-1]).

d. Large scale longitudinal social or psychological survey with self-rating data: Rosenerg Self-esteem Scale, Harter Scale (see [G3-1], [G3-2]).

e. Covid-19 in Taiwan from March 2022 and inferred equilibrium: tempo-spatial dynamics of controlled-spreading in Taiwan with district- and age-specific curves of infection (see [G7-1]).

f. AI & Social justice related to computer vision and data-driven regulating philosophy: annotate and extract emotion evolution of Adele’s facial expressions in a song. (With my PhD student Abdul-Hakeem Omotayo)

g. Color-aging and -reconstruction in paintings: van Gogh’s 3 Sunflowers and 3 The bedroom (See [G4-1], [G4-2]).

h. From comparing to mimicking pianists’ playing dynamics: Complex mixing in spectrum data encoded with pianists’ idiosyncratic styles. https://drive.google.com/file/d/1yQfm904qb1K-eh0CSOhllyyGvXiJmmP4/view? usp=sharing)

i. California drought and wildfire:

Different bioregions reveal rather distinct wildfire spikes, while they share relatively similar drought conditions that apparently regulated by alternating but arhythmic El Nino and Le Nina.

C: Take-home scientific messages from my research.

1. Information content in data of complex systems manifests multiscale heterogeneity that provides systemic understanding going beyond data curators’ intelligence (see [F-5], [F-7]).

2. Data’s information content is critically tied to all datatypes’ categorical nature that makes possible for the contingency table to become the fundamental computational platform and to facilitate Theoretical Information Measurements based associations (see [F-1] to [F-7]).

3. Conditional Shannon entropy drives the major factors selection protocol to discover multiscale heterogeneity underlying dynamics of complex system that is neither need modeling, nor can be modelled well in any priori fashions (see [F-4],[F-5], [F-6], [F-7]).

4. Visualizations of multiscale heterogeneity provides authentic resolutions to scientific topics and issues by revealing constrained relations between deterministic and stochastic structures on multiscale levels (see [F-4],[F-5], [F-6], [F-7]).

5. The equality between training and testing data sets can’t be assumed, and the concern of post-marketing population typically outgrowing the union of training and testing datasets is a huge concern in science and society (see [F-4]).

D: Computational developments.

Based on Theoretical Information measurements, my most recent computational developments culminated at the major factor selection (MFS) protocol in categorical exploratory data analysis (CEDA). This MFS protocol is used as a way of studying many complex systems and making inferences with structured data. The accommodating capacity of MFS protocol is given as follows, see details in [F-5] to [F-7] and [F-8].

I. Multi-dimensional (possibly highly dependent) response variables vary across multiscale levels within (temporal-spatial) dynamics.

II. Structural dependency among covariate variables constitutes multiscale heterogeneity-oriented collections of major factors.

III. Natural effects of high order interactions are revealed for high order major factors through the contingency table platform which can capture very high order dependency.

IV. Databases derived from large longitudinal studies are discovered and recognized being coupled with varying mechanisms along temporal axis in the sense that any structural assumptions on involved stochastic processes, such as short-memory property, are likely unreal and truly counter-productive characteristics.

V. With proper annotations, spatial-specific time series or growth curves would be equipped capabilities of manifesting spatial dynamics across the entire region under study.

VI. Extremely imbalance data (from rare events or diseases) would be allowed to show expansions of conditional entropies along categories’ ordinal axis that would lead to an entirely new concept of major factor selection.

VII. Response-vs-Covariate (Re-Co) dynamics with or without functional structures is allowed to be composed of heterogeneous component-wise mechanisms.

VIII. Mimicking entire or part of the data set becomes the legitimate route for reliability evaluations.

IX. Explicit de-associating (conditioning on the contingency table platform) constructions for high dimensional variables is the necessary visible path to confirm causality because discovering which variables provide how much extra information against which variables can settle many scientific issues via data’s information content.

X. Categorical covariate variables’ individual effects as well as their interacting effects can and should be evaluated in whole without the artificial need of binary surrogate variables.

E: Impacts of CEDA in Data Analysis.

1. Conditional entropy-based association can effectively accommodate non-linear relations and complicate interactions among multiple groups of features.All linearity based statistical modeling methodologies, like all types of regressions, ANOVA, log-linear models and GLMM and GEE, become rather limited from the perspective of discoveries in data analysis, when unrealistic structures of dependency needed to be imposed to make ways for the techniques to work. (See [F-5], [F-6], [F-7], [F-8], [G5-1], [G5-2], [G6-1], [G6-2]).

2. Contingency table is a basis of mimicking data for reliability evaluations that accommodating the authentic stochasticity in data. And its dimension is rather flexible for avoiding effects of curse of dimensionality. These two features make CEDA based inferences coupled with coherent finite-sample reliability evaluations and avoid the dangers of applying bootstrapping blindly without recognizing the data’s intrinsic stochasticity. (See [F-5], [F-6], [F-7], [G5-1], [G5-2]).

3. Major factor selection protocol can discover intrinsic lower and higher orders effects of covariate feature-sets. All model-based feature selection techniques appear to be limited in capacities because the assumption of knowing the model underlying a dynamics under study likely is a too big step to take if the scientific goal is discovery in nature. (See [F-1], [F-5], [F-6], [F-7],[G5-1], [G5-2]).

4. The principle of CEDA is discovering data’s authentic information content, so that we can discover something new.In comparison, the likelihood principle in all model-based mathematical statistics becomes ill-fitted for discovering task, and its asymptotic theories are apparently incoherent by ignoring reality. (see [F-1], [F-5], [F-6], [F-7],[G5-1], [G5-2]).

F. Stepwise Developments on CEDA:

1. Histogram to categorize quantitative features:

[F-1] Fushing H. and Roy, T. (2018) Complexity of possibly-gapped histogram and analysis of histogram (ANOHT). Royal Society-Open Science, doi: 10.1098/rsos.171026.

2. Entropy, association, contingency table and categorical-patten-matching:

[F-2] Fushing, Liu, S-Y, Hsieh, Y-C and McCowan, B. (2018). From patterned response dependency to structural covariate dependency: categorical-pattern-matching. PLoS One. https://doi.org/10.1371/journal.pone.0198253.

3. Multiclass Classification (MCC) and Response Manifold Analytics (RMA): Categorical patterns are truthful and informative!!

[F-3] Fushing H. and Elizabeth P. Chou (2021). Categorical Exploratory Data Analysis: From Multiclass Classification and Response Manifold Analytics perspectives of baseball pitching dynamics. Entropy, 23(7), 792; https://doi.org/10.3390/e23070792.

4. Mimicking Data Matrix: This is the right way for true reliability of multiscale patterns!

[F-4] Fushing H., Elizabeth P. Chou and Ting-Li Chen. (2021). Mimicking complexity of structured data matrix’s informationcontent: Categorical Exploratory Data Analysis. Entropy 23(5), 594; https://doi.org/10.3390/e23050594.

5. Feature Selection for Major Factors: This is the right approach for studying a collection of Complex Systems.

[F-5] Ting-Li Chen, Elizabeth P. Chou and Fushing Hsieh. (2022). Categorical Nature of Major Factor Selection via InformationTheoretic Measurements. Entropy. 23(12), 1684; https://doi.org/10.3390/e23121684.

[F-6] Elizabeth P. Chou, Ting-Li Chen and Fushing Hsieh. (2022). Unraveling Hidden Major Factors by Breaking Heterogeneityinto Homogeneous Parts within Many-System Problems. Entropy. 24(2), 170; https://doi.org/10.3390/e24020170.

[F-7] Fushing Hsieh, Elizabeth P. Chou, Ting-Li Chen. (2022). Multiscale major factor selections for complex system data with structural dependency and heterogeneity (submitted).

6. Issues related to conditional entropy evaluations in CEDA!

[F-8] Ting-Li Chen, Fushing H and Elizabeth P. Chou. (2022). Practical guidelines on evaluating Information Theoretical measurements for discovering major factors and making inferences in Categorical Exploratory Data Analysis. (Submitted.)

G. Related research topics.

1. Wearable sensors: Mimicking many Dimensional Rhythmic Time Series Analysis: information flows among time series.

<Ideas> Computed cycles need structural represented and to discover their composition of serial biomechanical states. Since different biomechanical states have distinct variations across all cycles, and the sums of their durations are nearly constant cycle lengths. This fact implies complicate constraining relation between deterministic and stochastic structures. The mimicking mechanism becomes rather intricate when we attempt to develop precision learning algorithms specific for a target individual.

[G1-1]Fushing H. and Xiaodong Wang. (2020) From learning gait signatures of many individuals to reconstructing gait dynamics of one single individual. Frontiers in Applied Mathematics and Statistics , 12. | https://doi.org/10.3389/fams.2020.564935.

[G1-2] Xi Yang and Fushing H. (2022) Mimicking gait dynamics: a step toward precision learning of human activities. (submitted)

[G1-3] Emanuela Furfaro, Xi Yang and Fushing H. (2022) Mimicking Tango dancing dynamics: a precision learning. (On-going)

2. QTL and GWAS: All covariate features are categorical!!

<Ideas> Genotypes are categorical. The digital coding scheme doesn’t turn them into metric. Linear regression model based GWAS is biased and only has limited capacity of discovering significant genes or markers for QTL purposes.

[G2-1] Leticia, Li-Yu Liu and Fushing H. (2022). Categorical exploratory data analysis for discovering gene and gene-to-gene states in tomato's Late Blight disease. (submitted)

3. Longitudinal self-rating data.

<Ideas> Self-rating data is categorical on an ordinal axis. The concepts of mean and variance are not valid. Uncertainty evaluated via entropy and displayed along the ordinal axis turns out is very suitable for such data. When temporal axis is involved, the CEDA and its major factor selection protocol provide very informative answers matching with data curators’ original scientific intentions and questions.

[G3-1]Emanuela Furfaro, Fushing Hsieh.(2022) Ordinal conditional entropy displays reveal intrinsic characteristics of Rosenberg Self-Esteem Scale. (submitted)

[G3-2]Emanuela Furfaro, Fushing Hsieh and Emilio Ferrer.(2022) Longitudinal data analysis on Harter self-worth rating on late adultescence conditional entropy. (submitted).

4. Color Physics for painting aging:

<Ideas> Each painting provides millions of 5-dimensional data points in RBG and Euclidean spaces. Different colors’ manifolds in RGB space correspond to different scatter plots in 2D Euclidean space. Their relational patterns are keys for figuring out colors’ aging effects and for carrying out color-reconstruction. We have finished two studies on van Gogh’s paintings.

[G4-1] Shuting Liao; Patrice Koehl; Jennifer Schultens; Fushing Hsieh. (2021). The geometry of colors in van Gogh’s Sunflowers. Heritage Science. 9:136 https://doi.org/10.1186/s40494-021-00608-y.

[G4-2] Shuting Liao and Fushing Hsieh. (2022). Learning color reconstruction from van Gogh’s Bedroom. (submitted).

5. CEDA and Statistics.

<Ideas> CEDA’s major factor selection ([F-7]) brings out a collection of feature-sets of varying sizes or different orders of interacting effects that specify component-mechanisms in a discovery fashion. In contrast, feature selection techniques in Statistics are all model based. We first workout CEDA to replace all modeling methodologies in Survival Analysis, such as Cox proportional hazard model or variants of Accelerated Failure Time model and additive model, etc. We continue to replace all GLMM and log-linear models with or without temporal axis.

[G5-1] Shuting Liao and Fushing Hsieh. (2022). CEDA with major factor selection for survival data. (Submitted).

[G5-2]Emanuela Furfaro, Fushing Hsieh.(2022). CEDA with major factor selection for categorical longitudinal data. (On-going).

6. CEDA and causality.

<Ideas> CEDA’s major factor selection ([F-7]) is capable of finding which feature-sets can provide extra information beyond which feature-sets. Such capacity will provide a basis for carrying out Granger Causality. This capacity also is useful for checking whether matching is properly carried out. We analyze large observational and survey databases to show the merits of CEDA’s major factor selection.

[G6-1] Shuting Liao and Fushing Hsieh(2022). CEDA with major factor selection for NHLBI databases from causality perspective. (On-going).

[G6-2] Fushing Hsieh, Emanuela Furfaro, Ting-Li Chen and Elizabeth Chou.(2022). CEDA with major factor selection for BRFSS survey data from causality perspective. (On-going).

7. CEDA and tempo-spatial functional Covid-19 infection data.

<ideas>We extract characteristic features from each district-specific infection growth curve, and then apply CEDA techniques to discover the spreading dynamics of Covid-19 from the initial hotspot in Taipei then across the entire Taiwan.

[G7-1] Ting-Li Chen, Elizabeth Chou and Fushing Hsieh (2022). Controlled spreading of Covid-19 in Taiwan and its inferred equilibrium: functional data via CEDA with major factor selection. (On-going).

H: Other publications in various fields (after 2018):

a. Neuroscience

[H-1] J. Zheng, M. Liang, A. Ekstrom, L. Ge, W. Yu and H. Fushing. (2018) On Association Study of Scalp EEG Data Channels Under Different Circumstances. WASA 2018 paper.

[H-2] Hsieh Fushing and Jingyi Zheng (2019). Unraveling pattern-based mechanics defining self-organized recurrent behaviors in a complex system: a Zebrafish's calcium brain-wide imaging example. Front. Appl. Math. Stat., https://doi.org/10.3389/fams.2019.00013

[H-3] Zheng, J., Linqiang Ge and Hsieh Fushing. (2018). A Data-driven approach to predict epileptic seizures from brain-wide calcium Imaging Video Data. BIOKDD 2018.

[H-4] Zheng, J., Fushing H. and Ge, L. (2019). A Data-driven Approach to Predict and Classify Epileptic Seizures From Brain-wide Calcium Imaging Video Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). doi: 10.1109/TCBB.2019.2895077.

[H-5] Jingyi Zheng, Mingli Liang, Sujata Sinha, Linqiang Ge, Wei Yu, Arne Ekstrom, and Fushing Hsieh (2021). Time-frequency analysis of scalp EEG with Hilbert-Huang transform and deep learning. IEEE Journal of Biomedical and Health Informatics (J-BHI), 4, 1549-1559.10.1109/JBHI.2021.3110267.

b. Animal sciences.

[H-6]Balasubramaniam, K. N., Beisner, B. A., Guan, J., Vandeleest, J., Fushing, H.,Atwill, E. R., & McCowan, B. J. (2018). Social network community structure is associated with the sharing of commensal E. coli among captive rhesus macaques (Macaca mulatta). PeerJ, 6, e4271.

[H-7] McCowan B, Vandeleest J, Balasubramaniam K, Hsieh Fushing, Nathman A, Beisner B. 2022 Measuring dominance certainty and assessing its impact on individual and societal health in a nonhuman primate model: a network approach. Phil. Trans. R. Soc.B 377: 20200438. https://doi.org/10.1098/rstb.2020.0438

[H-8]McVey C., Fushing H., D. Manriquez, P. Pinedo, K. Horback. (2020). Mind the Queue: A case study in visualizing heterogeneous behavioral patterns in livestock sensor data using unsupervised machine learning techniques. Frontiers in Veterinary Science, Animal Behavior and Welfare. 13. https://doi.org/10.3389/fvets.2020.00523.

[H-9]Catherine McVey, Fushing Hsieh, Diego Manriquez, Pablo Pinedo,Kristina Horback. (2021) Livestock Informatics Toolkit: Visually characterizing complexbehavioral patterns across multiple sensor platforms using informationtheoretic approaches and unsupervised machine learning. Sensors. https://www.mdpi.com/1424-8220/22/1/1

c. Autisum.

[H-10]Dwyer, P., X. Wang, R. De Meo-Monteil; Fushing H.; C. D. Saron; S. M. Rivera. (2020). Defining Clusters of Young Autistic and Typically-Developing Children Based on Loudness-Dependent Auditory Electrophysiological Responses. Molecular Autism, 11. https://doi.org/10.1186/s13229-020-00352-3

[H-11] Dwyer, P., X. Wang, R. De Meo-Monteil; Fushing H.; C. D. Saron; S. M. Rivera. (2021). Using Clustering to Examine Inter-Individual Variability in Topography of Auditory Event-Related Potentials in Autism and Typical Development. Brain Topography. doi: 10.1007/s10548-021-00863-z

d. Quantitative Finance

[H-12] Xiaodong Wang and Fushing Hsieh (2022) Unraveling S&P500 stock volatility and networks -An encoding and decoding approach. Quantitative Finance, 22, 997-1016.

e. Agronomy

[H-13]Fushing H., Olivia Lee, Constantin Heitkamp, Hildegarde Heymann, Susan E. Ebeler, Roger B. Boulton and Patrice Koehl. (2019) Unraveling the Regional Specificities of Malbec Wines from Mendoza, Argentina, and from Northern California. Agronomy, 9, 234; doi:10.3390/agronomy9050234.

[H-14] Shuting Liao, Li-Yu Liu, Ting-An Chen, Kuang-Yu Chen and Fushing Hsieh. (2021). Color-complexity enabled exhaustive color-dots identification and
spatial patterns testing in images. PLoS One. https://doi.org/10.1371/journal.pone.0251258.

f. Computer game and machine learning

[H-15] Li, Y., Cheng, M. Fujii, K., Fushing, H. and Hsieh, C-J. (2018). Learning from group comparison: exploiting higher order interactions. NIPS_2018 conference.

[H-16]Guan, J. and H. Fushing. (2018). Coupling geometry on binary bipartite networks: hypotheses testing on pattern geometry and nestedness. Frontiers in Applied Mathematics and Statistics 4. doi: 10.3389/fams.2018.00038.

[H-17] Guan J., Fushing H. and Koehl P. (2019) DCG++: A data-driven metric for geometric pattern recognition. PLoS One. https://doi.org/10.1371/journal.pone.021783.

[H-18]Fushing H. and X. Wang. (2020).Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. MLDM 2020.

I: Ideas regarding Data.n.Science in society.

Idea-1: “Real data is everywhere, so is real intelligences.” This should be one mission for data.n.science.

Idea-2: “By bringing out authentic information content in real data, ideal data science must offer authentic knowledge and render no concerns about producing misinformation.” Data.n.science should be responsible for the society.

Idea-3: “Data science’s visualization capacity has to make it a super tool to present data-driven information to public and individual.” Data.n.science needs to connect to real world.