Pandas: Reproduction of Microarray Study Results

This is a step-by-step guide to reproduce the results of a microarray study by using Pandas, scikit-learn, and other Python scientific libraries. Here, my target is the principal component analysis (PCA) conducted in Lukk M, Kapushesky M, Nikkilä J, et al. A global map of human gene expression. Nat Biotechnol. 2010;28:322-324. doi:10.1038/nbt0410-322.

Data retrieval

The authors provide all their datasets at the repository 'E-MTAB-62 - Human gene expression atlas of 5372 samples representing 369 different cell and tissue types, disease states and cell lines.' Let's look and see what's there.

Four files are listed in addition to lots of raw data. Namely,

E-MTAB-62.idf.txt
Investigation description
E-MTAB-62.sdrf.txt
Sample and data relationship
E-MTAB-62.processed.2.zip
Microarray data processed, tab-seperated, and compressed
A-AFFY-33.adf.txt
Array design

First, I import Python modules I need in this project.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA

Next, I load E-MTAB-62.sdrf.txt to associate sample attributes such as cell types and origins with the microarray data to be downloaded soon.

sample_data_rel = pd.read_csv(
    'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-62/E-MTAB-62.sdrf.txt',
    sep='\t',
    index_col='Source Name'
)

sample_data_rel consists of 33 columns for sample attributes and 5372 rows for as many samples. For example, the column "Characteristics[4 meta-groups]" indicates that a sample is related to which of neoplasm, cell line, normal, and disease.

>>> sample_data_rel['Characteristics[4 meta-groups]'].value_counts()

neoplasm     2315
cell line    1256
normal       1033
disease       765
Name: Characteristics[4 meta-groups], dtype: int64

Most samples derive from humans, but surprisingly, there happen to be a few mouse samples as well.

>>> sample_data_rel['Characteristics[Organism]'].value_counts()

Homo sapiens    5369
Mus musculus       3
Name: Characteristics[Organism], dtype: int64

I get rid of them since I need only human data.

sample_data_rel = sample_data_rel.loc[
    sample_data_rel['Characteristics[Organism]'] == 'Homo sapiens'
]

Below is the complete list of sample attributes and variables.

All attributes and unique variables of sample_data_rel

Characteristics[OperatorVariation]
Justin,,Lamb                    324
Milton,W,Taylor                 308
Roel,,Verhaak                   284
Benjamin,,Haibe-Kains           273
Michael,,Hummel                 213
            ...
Name: Characteristics[OperatorVariation], Length: 163, dtype: int64


Characteristics[DataSource]
GSE5258     324
GSE7123     308
GSE1159     284
GSE4475     213
E-AFMX-6    195
            ...
Name: Characteristics[DataSource], Length: 205, dtype: int64


Characteristics[4 meta-groups]
neoplasm     2315
cell line    1256
normal       1033
disease       765
Name: Characteristics[4 meta-groups], dtype: int64


Characteristics[15 meta-groups]
solid tissue neoplasm cell line        831
breast cancer                          672
leukemia                               567
normal solid tissue                    566
normal blood                           467
            ...
Name: Characteristics[15 meta-groups], dtype: int64


Characteristics[369 groups]
breast cancer                              672
mononuclear cell infection                 314
acute myeloid leukemia                     295
B-cell lymphoma                            213
MCF7 breast epithelial adenocarcinoma      213
            ...
Name: Characteristics[369 groups], Length: 368, dtype: int64


Characteristics[groups with 10 and more replicates]
                                       761
breast cancer                          672
mononuclear cell infection             314
acute myeloid leukemia                 295
B-cell lymphoma                        213
            ...
Name: Characteristics[groups with 10 and more replicates], Length: 97, dtype: int64


Characteristics[Blood/NonBlood meta-groups]
non blood    3447
blood        1922
Name: Characteristics[Blood/NonBlood meta-groups], dtype: int64


Characteristics[Organism]
Homo sapiens    5369
Name: Characteristics[Organism], dtype: int64


Characteristics[OrganismPart]
blood            1089
mammary gland    1033
bone marrow       733
                  287
lung              286
            ...
Name: Characteristics[OrganismPart], Length: 131, dtype: int64


Characteristics[CellType]
                                     3333
peripheral blood mononuclear cell     452
blast cell, mononuclear cell          284
CD138+ plasma cell                    142
Leukocyte                             107
            ...
Name: Characteristics[CellType], Length: 115, dtype: int64


Characteristics[CellLine]
            4112
mcf7         213
cultured      88
pc3           64
k562          48
            ...
Name: Characteristics[CellLine], Length: 264, dtype: int64


Characteristics[DiseaseState]
                                   1271
breast cancer                       686
acute myeloid leukemia              322
hepatitis c                         192
diffuse large B-cell lymphoma       160
            ...
Name: Characteristics[DiseaseState], Length: 231, dtype: int64


Characteristics[DevelopmentalStage]
          4813
adult      404
embryo     110
fetus       42
Name: Characteristics[DevelopmentalStage], dtype: int64


Characteristics[DiseaseStage]
                         4233
primary                   500
aggressive                141
grade 2                    74
lymph node metastasis      59
            ...
Name: Characteristics[DiseaseStage], dtype: int64


Characteristics[Sex]
                 3034
male             1272
female           1016
unknown sex        34
mixed sex           9
            ...
Name: Characteristics[Sex], dtype: int64


Characteristics[Age]
                      4678
10 days to 12 days      23
69                      18
65                      17
62                      17
            ...
Name: Characteristics[Age], Length: 189, dtype: int64


Characteristics[Generalization 6 on reduced data]
hematopoietic system           1854
solid tissue                   1382
                                875
cell line                       628
brain                           286
            ...
Name: Characteristics[Generalization 6 on reduced data], dtype: int64


Labeled Extract Name
GSM23227.CEL_L     1
GSM118757.CEL_L    1
GSM118815.CEL_L    1
GSM118790.CEL_L    1
GSM118950.CEL_L    1
            ...
Name: Labeled Extract Name, Length: 5369, dtype: int64


Label
biotin    5369
Name: Label, dtype: int64


Hybridization Name
GSM23227.CEL     1
GSM118757.CEL    1
GSM118815.CEL    1
GSM118790.CEL    1
GSM118950.CEL    1
            ...
Name: Hybridization Name, Length: 5369, dtype: int64


Array Design REF
A-AFFY-33    5369
Name: Array Design REF, dtype: int64


Array Data File
GSM23227.CEL     1
GSM118757.CEL    1
GSM118815.CEL    1
GSM118790.CEL    1
GSM118950.CEL    1
            ...
Name: Array Data File, Length: 5369, dtype: int64


Comment [ArrayExpress FTP file]
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.raw.94.zip     71
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.raw.93.zip     70
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.raw.91.zip     68
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.raw.92.zip     66
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.raw.90.zip     65
            ...
Name: Comment [ArrayExpress FTP file], Length: 121, dtype: int64


Protocol REF
P-MTAB-2072    5369
Name: Protocol REF, dtype: int64


Derived Array Data Matrix File
hgu133a_rma_okFiles_080619_MAGETAB.csv    5369
Name: Derived Array Data Matrix File, dtype: int64


Comment [Derived ArrayExpress FTP file]
ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-62/E-MTAB-62.processed.2.zip    5369
Name: Comment [Derived ArrayExpress FTP file], dtype: int64


Factor Value[4 meta-groups]
neoplasm     2315
cell line    1256
normal       1033
disease       765
Name: Factor Value[4 meta-groups], dtype: int64


Factor Value[15 meta-groups]
solid tissue neoplasm cell line        831
breast cancer                          672
leukemia                               567
normal solid tissue                    566
normal blood                           467
            ...
Name: Factor Value[15 meta-groups], dtype: int64


Factor Value[369 groups]
breast cancer                              672
mononuclear cell infection                 314
acute myeloid leukemia                     295
B-cell lymphoma                            213
MCF7 breast epithelial adenocarcinoma      213
            ...
Name: Factor Value[369 groups], Length: 368, dtype: int64


Factor Value[Blood-NonBlood meta-groups]
non blood    3447
blood        1922
Name: Factor Value[Blood-NonBlood meta-groups], dtype: int64


Factor Value[6 meta-groups]
hematopoietic system           1854
solid tissue                   1382
                                875
cell line                       628
brain                           286
            ...
Name: Factor Value[6 meta-groups], dtype: int64


Factor Value[4 groups from blood to incompletely diff]
the rest                       2976
blood                          1922
incompletely differentiated     263
connective                      208
Name: Factor Value[4 groups from blood to incompletely diff], dtype: int64


Factor Value[96 groups]
                                       761
breast cancer                          672
mononuclear cell infection             314
acute myeloid leukemia                 295
B-cell lymphoma                        213
            ...
Name: Factor Value[96 groups], Length: 97, dtype: int64

Then I load E-MTAB-62.processed.2.zip, the processed microarray data.

affymetrix = (
    pd.read_csv(
        'https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-62/E-MTAB-62.processed.2.zip',
        sep='\t',
        index_col='Hybridization REF',
        dtype='object' #1,2
    )  
    .drop('CompositeElement REF') #3
    .astype('float32') #4
    .T #5
    .loc[sample_data_rel.index] #6
)
  1. The string 'RMA' in the 2nd row "CompositeElement REF" prevents conversion to floating point values and triggers an annoyingly long warning unless dtype is specified as 'object.' Skipping the row by adding a parameter skiprows=[1] won't help avoid the bother.
  2. The data occupy 8.1 GB of memory at this point.
  3. The 2nd row gets removed.
  4. By converting the values to float32, the data size is now reduced to 457 MB. Generally, three significant figures are more than sufficient in biology, and so float16 will be probably OK as well.
  5. The data consists of 22283 rows for as many genes and 5322 columns for as many samples. I transpose the matrix so that it is suitable for PCA, with samples in rows and genes in columns.
  6. I align the indices of this dataframe and sample_data_rel for convenience and also to remove the mouse portion of the data.

The microarray data is standardized sample-wise, as is stated in the paper's supplementary material and evidenced by the low variances in data distribution across samples (below).

>>> descr = (
    affymetrix
    .T # transposed back to the original format
    .describe()
)

>>> pd.merge(
    descr.mean(axis=1).rename('mean'),
    descr.var(axis=1).rename('variance'),
    left_index=True,
    right_index=True
)

        mean       variance
count   22283.00   0.000000
mean    6.876087   0.016922
std     1.795443   0.004527
min     3.461061   0.001730
25%     5.563369   0.005386
50%     6.723452   0.001895
75%     7.967647   0.007483
max     14.25868   0.024582

PCA

Finally, I get scikit-learn to perform PCA...

n_components = 2
pca = PCA(n_components=n_components)
Xp = pca.fit_transform(affymetrix)
Xp = pd.DataFrame(
    Xp,
    index=affrymetrix.index, 
    columns=[f'PC{k+1}' for k in range(n_components)]
)

... and plot the results.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
for i, col in enumerate(
    [
        'Factor Value[4 groups from blood to incompletely diff]',
        'Characteristics[4 meta-groups]'
    ]
):

    sns.scatterplot(
        ax=axes[i],
        data=Xp,
        x='PC1',
        y='PC2',
        hue=sample_data_rel[col],
        alpha=0.3,
    )
    axes[i].axis('square')

These plots appear identical to those in the paper.

If, as usual, I normalized the dataset, which is already normalized sample-wise as I wrote above, it would be doubly normalized (sample-wise and gene-wise) and give totally different results.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(affymetrix)
Xp = pca.fit_transform(X)
Xp = pd.DataFrame(
    Xp, index=affymetrix.index, columns=[f'PC{k+1}' for k in range(n_components)]
)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
for i, col in enumerate(
    [
        'Factor Value[4 groups from blood to incompletely diff]',
        'Characteristics[4 meta-groups]'
    ]
):

    sns.scatterplot(
        ax=axes[i],
        data=Xp,
        x='PC1',
        y='PC2',
        hue=sample_data_rel[col],
        alpha=0.3,
    )
    axes[i].axis('square')
principal component analysis of microarray data after double normalization

Leave a Reply

Your email address will not be published. Required fields are marked *